Skip to main content
CRISPRtracrRNA

This toolkit is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.


BackofenLab/CRISPRtracrRNA
BackofenLab/CRISPRtracrRNA
10 stars
View repo
CRISPRtracrRNA: robust approach for CRISPR tracrRNA detection
Alexander Mitrofanov, Marcus Ziemann, … Rolf Backofen
Bioinformatics (2022)
Read paper
@article{mitrofanov2022crisprtracrna,
  title={CRISPRtracrRNA: robust approach for CRISPR tracrRNA detection},
  author={Mitrofanov, Alexander and Ziemann, Marcus and Alkhnbashi, Omer S and Hess, Wolfgang R and Backofen, Rolf},
  journal={Bioinformatics},
  volume={38},
  number={Supplement\_2},
  pages={ii42--ii48},
  year={2022},
  publisher={Oxford University Press},
  doi={10.1093/bioinformatics/btac466}
}
Copy citation
proto-bio/proto-tools/proto_tools/tools/gene_annotation/crispr_tracr_rna
View source
Open Notebook
Open notebook
Coming soon!
Run this tool directly in Proto with no setup required.
FunctionDescription
run_crispr_tracr_rna()Predict tracrRNA sequences from nucleotide CRISPR loci Docs Source
License: CRISPRtracrRNA’s own code is licensed under MIT, and it federates over bundled data sources and components, each under its own license terms.Bundled dependencies, each under its own license:Review each source’s terms before commercial use or redistribution.

Background

CRISPRtracrRNA (Mitrofanov et al., 2022) detects trans-activating CRISPR RNA (tracrRNA) sequences in nucleotide CRISPR loci. A tracrRNA is a small non-coding RNA that base-pairs with the precursor crRNA in the Class 2 effector systems that depend on one, namely Type II (Cas9) and the tracrRNA-bearing Type V subtypes such as Cas12b, Cas12c, and Cas12e, and the resulting RNA duplex licenses Cas-mediated cleavage of target DNA. It is also the second component fused into the single-guide RNA used in modern genome editing. Because tracrRNAs share little primary-sequence conservation across families, single-model approaches such as Infernal covariance-model search alone miss divergent tracrRNAs in newly sequenced and metagenomic genomes. Internally, the pipeline runs an array-detection step with CRISPRidentify (machine learning), a Cas-cassette step with CRISPRcasIdentifier (HMM and machine learning), a tracrRNA candidate scan with Infernal cmsearch against curated covariance models, an anti-repeat alignment step using fasta36, vmatch, Clustal Omega, and BLAST, an RNA-RNA interaction step with IntaRNA, and a transcription-terminator step with erpin. A final ranking step combines the per-candidate features into a single weighted score, and a faster model_run mode performs only the covariance-model scan and skips the validation evidence and the ranking step.

Learning Resources

  • BackofenLab/CRISPRtracrRNA (Bioinformatics Group Freiburg) - official repository with installation instructions, the canonical configuration surface, and the curated covariance models distributed with the tool.
  • EddyRivasLab/infernal (The Eddy/Rivas Laboratory, Harvard) - official repository and User’s Guide for the covariance-model search engine and the cmsearch E-value statistics that score tracrRNA candidates.
  • BackofenLab/IntaRNA (Bioinformatics Group Freiburg) - official repository for the RNA-RNA interaction predictor that scores the anti-repeat to repeat duplex.

Tools

CRISPRtracrRNA Prediction (crispr-tracr-rna)

Predicts tracrRNA candidates from one or more nucleotide sequences and returns, per input sequence, a list of CrisprTracrRNAPrediction rows sorted by ranking score. Each row carries the candidate position and sequence, CRISPR array context, anti-repeat similarity and coverage, predicted RNA-RNA interaction with the repeat, terminator location and score, distance to the nearest Cas-effector cassette, and a single weighted multi-evidence score.

API Reference

Source
sequences
List[string]
required
Nucleotide sequence(s) to predict tracrRNA from. Each sequence should contain a CRISPR locus. Labeled positionally (seq_0, seq_1, …); results are returned in input order.
Source
model_type
enum
default:"II"
CRISPR model type.Available options: II, all
run_type
enum
default:"complete_run"
Pipeline mode.Available options: complete_run, model_run
num_workers
integer
Parallel workers across input sequences (defaults to 1).
anti_repeat_similarity_threshold
number
default:"0.7"
Minimum anti-repeat ↔ repeat similarity (0-1).
anti_repeat_coverage_threshold
number
default:"0.6"
Minimum anti-repeat alignment coverage (0-1).
weight_crispr_array_score
number
default:"0.5"
Ranking weight for CRISPR array confidence.
weight_anti_repeat_sim
number
default:"0.5"
Ranking weight for anti-repeat similarity.
weight_anti_repeat_coverage
number
default:"0.5"
Ranking weight for anti-repeat coverage.
weight_anti_sim_coverage
number
default:"0.5"
Ranking weight for similarity x coverage.
weight_interaction_score
number
default:"0.6"
Ranking weight for IntaRNA interaction energy.
weight_model_hit_score
number
default:"0.9"
Ranking weight for the covariance-model tail hit.
weight_terminator_hit_score
number
default:"0.9"
Ranking weight for erpin terminator score.
weight_consistency_orientation
number
default:"0.1"
Ranking weight for orientation consistency.
weight_consistency_anti_repeat_tail
number
default:"0.1"
Ranking weight for anti-repeat ↔ tail consistency.
weight_consistency_tail_terminator
number
default:"0.1"
Ranking weight for tail ↔ terminator consistency.
perform_type_v_anti_repeat_analysis
boolean
default:"False"
Type V (Cas12) anti-repeat search.
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cpu"
Device to run the tool on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
results
List[CrisprTracrRNASequenceResult]
One result per input sequence, each carrying all candidate hits upstream produced for that sequence (top-ranked first).

Applications

Use this to confirm and characterize Type II and Type V CRISPR-Cas loci, since a detected tracrRNA is the component that completes a functional Class 2 locus and distinguishes a Cas9 or Cas12 system from an unaccompanied CRISPR array. Pair it with minced on a confirmed array to recover the crRNA spacers, then design a single-guide RNA by fusing a spacer-bearing crRNA with the detected tracrRNA scaffold for genome-editing experiments. Run it across metagenomes and uncultured genomes to discover novel Cas9 or Cas12 systems whose tracrRNAs are too divergent to be caught by covariance-model search alone.

Usage Tips

  • Provide each CRISPR locus with at least 5 kb of flanking sequence on either side. The multi-evidence pipeline needs adjacent context to locate the Cas cassette and the downstream transcription terminator. Loci submitted as narrow windows lose those evidence channels and fall back to a covariance-model-only score.
  • model_type defaults to "II", which only screens for Cas9 systems. To also screen tracr-bearing Type V (Cas12b, Cas12c, Cas12e, …) loci, set model_type="all" and perform_type_v_anti_repeat_analysis=True. The Type V path is off by default because it is slower and irrelevant when only Cas9 loci are of interest.
  • Type I and Type III CRISPR systems do not use a tracrRNA. A complete_run on such a locus returns array context and Cas annotations but empty tracrRNA fields, with the ranking score reflecting only the partial evidence.
  • The ten weight_* ranking knobs interact. Sweep them together against a held-out positive and negative set rather than tuning a single weight in isolation, and keep upstream’s documented defaults when there is no specific objective to optimize for.
  • run_type="model_run" is the high-throughput pre-filter, not the final answer. It runs only the Infernal cmsearch step and returns candidates with E-values but none of the array, interaction, or terminator evidence, so re-run promising candidates through complete_run before drawing conclusions.

Toolkit Notes

These apply to every CRISPRtracrRNA tool in this toolkit (crispr-tracr-rna).
  • Runs on CPU only. The pipeline drives Infernal, IntaRNA, fasta36, vmatch, Clustal Omega, BLAST, and erpin, all CPU-based programs. There is no GPU acceleration to enable.
  • Initial install pulls model archives from Google Drive. complete_run mode requires the CRISPRcasIdentifier ML and HMM archives, which the standalone install fetches once. Google Drive rate-limits anonymous fetches, so on a failed install retry after a minute or follow the upstream README to place the two archives in the CRISPRcasIdentifier directory by hand. After install the runtime needs no further network access.
  • num_workers parallelizes across input sequences, not within a sequence. Each worker runs the full pipeline in its own working directory to avoid file-name collisions between concurrent jobs. The default of 1 is single-process; set it explicitly when batch-scanning many loci. The wrapper caps the effective worker count at len(sequences), so over-provisioning is safe.
Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.