CRISPRtracrRNA

This toolkit is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.

GitHub 10 GitHub 10 Publication Publication Cite Cite Tool Source Tool Source Open as Notebook Open as Notebook Open on Proto Open on Proto

BackofenLab/CRISPRtracrRNA

10 stars

View repo

CRISPRtracrRNA: robust approach for CRISPR tracrRNA detection

Alexander Mitrofanov, Marcus Ziemann, … Rolf Backofen

Bioinformatics (2022)

Read paper

@article{mitrofanov2022crisprtracrna,
  title={CRISPRtracrRNA: robust approach for CRISPR tracrRNA detection},
  author={Mitrofanov, Alexander and Ziemann, Marcus and Alkhnbashi, Omer S and Hess, Wolfgang R and Backofen, Rolf},
  journal={Bioinformatics},
  volume={38},
  number={Supplement\_2},
  pages={ii42--ii48},
  year={2022},
  publisher={Oxford University Press},
  doi={10.1093/bioinformatics/btac466}
}

Copy citation

proto-bio/proto-tools/proto_tools/tools/gene_annotation/crispr_tracr_rna

View source

Open Notebook

Open notebook

Coming soon!

Run this tool directly in Proto with no setup required.

Function	Description
`run_crispr_tracr_rna()`	Predict tracrRNA sequences from nucleotide CRISPR loci	Docs Source

License: CRISPRtracrRNA’s own code is licensed under MIT, and it federates over bundled data sources and components, each under its own license terms.Bundled dependencies, each under its own license:

CRISPRcasIdentifier: GPL-3.0

Review each source’s terms before commercial use or redistribution.

Background

CRISPRtracrRNA (Mitrofanov et al., 2022) detects trans-activating CRISPR RNA (tracrRNA) sequences in nucleotide CRISPR loci. A tracrRNA is a small non-coding RNA that base-pairs with the precursor crRNA in the Class 2 effector systems that depend on one, namely Type II (Cas9) and the tracrRNA-bearing Type V subtypes such as Cas12b, Cas12c, and Cas12e, and the resulting RNA duplex licenses Cas-mediated cleavage of target DNA. It is also the second component fused into the single-guide RNA used in modern genome editing. Because tracrRNAs share little primary-sequence conservation across families, single-model approaches such as Infernal covariance-model search alone miss divergent tracrRNAs in newly sequenced and metagenomic genomes. Internally, the pipeline runs an array-detection step with CRISPRidentify (machine learning), a Cas-cassette step with CRISPRcasIdentifier (HMM and machine learning), a tracrRNA candidate scan with Infernal cmsearch against curated covariance models, an anti-repeat alignment step using fasta36, vmatch, Clustal Omega, and BLAST, an RNA-RNA interaction step with IntaRNA, and a transcription-terminator step with erpin. A final ranking step combines the per-candidate features into a single weighted score, and a faster model_run mode performs only the covariance-model scan and skips the validation evidence and the ranking step.

Learning Resources

BackofenLab/CRISPRtracrRNA (Bioinformatics Group Freiburg) - official repository with installation instructions, the canonical configuration surface, and the curated covariance models distributed with the tool.
EddyRivasLab/infernal (The Eddy/Rivas Laboratory, Harvard) - official repository and User’s Guide for the covariance-model search engine and the cmsearch E-value statistics that score tracrRNA candidates.
BackofenLab/IntaRNA (Bioinformatics Group Freiburg) - official repository for the RNA-RNA interaction predictor that scores the anti-repeat to repeat duplex.

Tools

CRISPRtracrRNA Prediction (`crispr-tracr-rna`)

Predicts tracrRNA candidates from one or more nucleotide sequences and returns, per input sequence, a list of CrisprTracrRNAPrediction rows sorted by ranking score. Each row carries the candidate position and sequence, CRISPR array context, anti-repeat similarity and coverage, predicted RNA-RNA interaction with the repeat, terminator location and score, distance to the nearest Cas-effector cassette, and a single weighted multi-evidence score.

API Reference

Source

Input: CrisprTracrRNAInput

sequences

List[string]

required

Nucleotide sequence(s) to predict tracrRNA from. Each sequence should contain a CRISPR locus. Labeled positionally (seq_0, seq_1, …); results are returned in input order.

Source

Config: CrisprTracrRNAConfig

model_type

enum

default:"II"

CRISPR model type.Available options: II, all

run_type

enum

default:"complete_run"

Pipeline mode.Available options: complete_run, model_run

num_workers

integer

Parallel workers across input sequences (defaults to 1).

anti_repeat_similarity_threshold

number

default:"0.7"

Minimum anti-repeat ↔ repeat similarity (0-1).

anti_repeat_coverage_threshold

number

default:"0.6"

Minimum anti-repeat alignment coverage (0-1).

weight_crispr_array_score

number

default:"0.5"

Ranking weight for CRISPR array confidence.

weight_anti_repeat_sim

number

default:"0.5"

Ranking weight for anti-repeat similarity.

weight_anti_repeat_coverage

number

default:"0.5"

Ranking weight for anti-repeat coverage.

weight_anti_sim_coverage

number

default:"0.5"

Ranking weight for similarity x coverage.

weight_interaction_score

number

default:"0.6"

Ranking weight for IntaRNA interaction energy.

weight_model_hit_score

number

default:"0.9"

Ranking weight for the covariance-model tail hit.

weight_terminator_hit_score

number

default:"0.9"

Ranking weight for erpin terminator score.

weight_consistency_orientation

number

default:"0.1"

Ranking weight for orientation consistency.

weight_consistency_anti_repeat_tail

number

default:"0.1"

Ranking weight for anti-repeat ↔ tail consistency.

weight_consistency_tail_terminator

number

default:"0.1"

Ranking weight for tail ↔ terminator consistency.

perform_type_v_anti_repeat_analysis

boolean

default:"False"

Type V (Cas12) anti-repeat search.

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cpu"

Device to run the tool on.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.

Source

Output: CrisprTracrRNAOutput

results

List[CrisprTracrRNASequenceResult]

One result per input sequence, each carrying all candidate hits upstream produced for that sequence (top-ranked first).

Show CrisprTracrRNASequenceResult

sequence_id

string

required

ID of the input sequence.

candidates

List[CrisprTracrRNAPrediction]

All candidate hits for this sequence, top-ranked first; empty when upstream found nothing.

Applications

Use this to confirm and characterize Type II and Type V CRISPR-Cas loci, since a detected tracrRNA is the component that completes a functional Class 2 locus and distinguishes a Cas9 or Cas12 system from an unaccompanied CRISPR array. Pair it with minced on a confirmed array to recover the crRNA spacers, then design a single-guide RNA by fusing a spacer-bearing crRNA with the detected tracrRNA scaffold for genome-editing experiments. Run it across metagenomes and uncultured genomes to discover novel Cas9 or Cas12 systems whose tracrRNAs are too divergent to be caught by covariance-model search alone.

Usage Tips

Provide each CRISPR locus with at least 5 kb of flanking sequence on either side. The multi-evidence pipeline needs adjacent context to locate the Cas cassette and the downstream transcription terminator. Loci submitted as narrow windows lose those evidence channels and fall back to a covariance-model-only score.
model_type defaults to "II", which only screens for Cas9 systems. To also screen tracr-bearing Type V (Cas12b, Cas12c, Cas12e, …) loci, set model_type="all" and perform_type_v_anti_repeat_analysis=True. The Type V path is off by default because it is slower and irrelevant when only Cas9 loci are of interest.
Type I and Type III CRISPR systems do not use a tracrRNA. A complete_run on such a locus returns array context and Cas annotations but empty tracrRNA fields, with the ranking score reflecting only the partial evidence.
The ten weight_* ranking knobs interact. Sweep them together against a held-out positive and negative set rather than tuning a single weight in isolation, and keep upstream’s documented defaults when there is no specific objective to optimize for.
run_type="model_run" is the high-throughput pre-filter, not the final answer. It runs only the Infernal cmsearch step and returns candidates with E-values but none of the array, interaction, or terminator evidence, so re-run promising candidates through complete_run before drawing conclusions.

Toolkit Notes

These apply to every CRISPRtracrRNA tool in this toolkit (crispr-tracr-rna).

Runs on CPU only. The pipeline drives Infernal, IntaRNA, fasta36, vmatch, Clustal Omega, BLAST, and erpin, all CPU-based programs. There is no GPU acceleration to enable.
Initial install pulls model archives from Google Drive. complete_run mode requires the CRISPRcasIdentifier ML and HMM archives, which the standalone install fetches once. Google Drive rate-limits anonymous fetches, so on a failed install retry after a minute or follow the upstream README to place the two archives in the CRISPRcasIdentifier directory by hand. After install the runtime needs no further network access.
num_workers parallelizes across input sequences, not within a sequence. Each worker runs the full pipeline in its own working directory to avoid file-name collisions between concurrent jobs. The default of 1 is single-process; set it explicitly when batch-scanning many loci. The wrapper caps the effective worker count at len(sequences), so over-provisioning is safe.

Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.

​Background

​Learning Resources

​Tools

​CRISPRtracrRNA Prediction (crispr-tracr-rna)

​API Reference

​Applications

​Usage Tips

​Toolkit Notes

​Infrastructure Guides

Tool Persistence

Device Management

Parallel Execution

Cloud Inference

Background

Learning Resources

Tools

CRISPRtracrRNA Prediction (`crispr-tracr-rna`)

API Reference

Applications

Usage Tips

Toolkit Notes

Infrastructure Guides