Skip to main content
MinCED
License: MinCED has a GPL-3.0 license. Please refer to the license for full terms.

This toolkit is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.


ctSkennerton/minced
ctSkennerton/minced
Mining CRISPRs in Environmental Datasets
122 stars
View repo
@software{skennerton2019minced,
  title={MinCED: Mining CRISPRs in Environmental Datasets},
  author={Skennerton, Connor T and Angly, Florent},
  year={2019},
  url={https://github.com/ctSkennerton/minced},
  note={Derived from the CRISPR Recognition Tool (CRT) by Bland et al., 2007}
}
Copy citation
proto-bio/proto-tools/proto_tools/tools/gene_annotation/minced
View source
Open Notebook
Open notebook
Coming soon!
Run this tool directly in Proto with no setup required.
FunctionDescription
run_minced()Detect CRISPR arrays in nucleotide sequences using MinCED Docs Source

Background

MinCED is a derivative of the CRISPR Recognition Tool (CRT) (Bland et al., 2007), maintained by Connor Skennerton. CRISPR arrays are blocks of short, near-identical direct repeats (typically 23 to 47 nt) separated by unique spacer sequences (typically 26 to 50 nt) that record fragments of past viral and plasmid infections; they form the heritable memory of the CRISPR-Cas adaptive immune system of bacteria and archaea. Internally, MinCED uses a k-mer seed-and-extend strategy. It scans for short exact k-mer matches that recur at a consistent spacing, then extends each seed bidirectionally to the actual repeat length, and finally validates the candidate by checking that the inter-repeat spacers fall within the configured length window. The algorithm runs on raw DNA, has linear time complexity in sequence length, and finishes in seconds on a typical 5 Mb bacterial genome on commodity CPU hardware.

Learning Resources

  • ctSkennerton/minced (Connor Skennerton) - official repository with the canonical command-line flag surface, installation instructions, and example output.
  • PMC1924867 (CRT paper) (Bland et al.) - the full text of the algorithm description, including the seed-and-extend mechanism and the comparison against PatScan and PILER-CR.

Tools

MinCED CRISPR Array Detection (minced-crispr)

Detects CRISPR arrays in one or more nucleotide sequences. Returns, per input sequence, a list of CrisprArray objects; each carries an ordered list of CrisprRepeatSpacer units with the repeat’s start position, the repeat sequence, and the following spacer (the last unit has no spacer).

API Reference

Source
sequences
List[string]
required
Nucleotide sequence(s) to search for CRISPR arrays. Labeled positionally (seq_0, seq_1, …); results are returned in input order.
Source
min_num_repeats
integer
default:"3"
Minimum repeats per array. Default 3.
min_repeat_length
integer
default:"23"
Minimum repeat length in nt. Default 23.
max_repeat_length
integer
default:"47"
Maximum repeat length in nt. Default 47. Must be ≥ min_repeat_length.
min_spacer_length
integer
default:"26"
Minimum spacer length in nt. Default 26.
max_spacer_length
integer
default:"50"
Maximum spacer length in nt. Default 50. Must be ≥ min_spacer_length.
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cpu"
Device to run the tool on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
results
List[MincedSequenceResult]
Per-sequence CRISPR detection results.

Applications

Use this to confirm and catalog CRISPR loci across newly sequenced bacterial and archaeal genomes, or to mine spacer libraries from metagenomic assemblies for phage-host interaction studies. As a pre-filter, run minced-crispr first to verify that a candidate contig actually carries a CRISPR array before spending compute on downstream Cas and tracrRNA analysis with pyhmmer-hmmsearch for Cas effector domains and crispr-tracr-rna for tracrRNA on the same locus. The spacer set returned for each array can then be aligned against phage or plasmid sequence databases to reconstruct the host’s immune history.

Usage Tips

  • min_num_repeats controls the sensitivity-versus-specificity trade-off. The default of 3 balances both for typical bacterial and archaeal genomes. Lower it to 2 to catch partial or degraded arrays at the cost of more false positives, and raise it to 4 or more when only high-confidence arrays should pass through.
  • The 23 to 47 nt repeat and 26 to 50 nt spacer windows match canonical CRISPR loci. Widen max_repeat_length and max_spacer_length to detect atypical families such as Type IV-A or CRISPR systems with unusually long spacers, and lower min_repeat_length only when chasing partial repeats since values below 20 nt start to pick up generic tandem repeats.
  • MinCED only locates the array; it does not identify Cas genes or classify the CRISPR system. Type assignment requires downstream Cas-effector annotation, typically pyhmmer-hmmsearch against curated Cas HMMs or a dedicated classifier such as CRISPRcasIdentifier.
  • Inverted length ranges are caught at config time. Setting max_repeat_length < min_repeat_length or max_spacer_length < min_spacer_length raises ValueError before the run starts, so the call fails fast instead of completing with an empty result set.
  • Spacer count is not an immunity-breadth metric. Multiple spacers in an array can target the same phage, and many spacers are degraded remnants of historical encounters, so the number of spacers overestimates how many distinct threats the host can recognize today.

Toolkit Notes

These apply to every MinCED tool in this toolkit (minced-crispr).
  • Runs on CPU only. MinCED is a Java program; the standalone install bundles a Java runtime alongside the minced program. There is no GPU acceleration to enable, and runtime is seconds per typical bacterial genome.
  • Self-contained after install. The standalone setup.sh downloads the minced program once; subsequent runs need no network access and no model weights or reference databases.
  • Sequences are processed one at a time. The wrapper iterates over inputs.sequences sequentially rather than parallelizing across them. For large batches, run independent calls in parallel from the caller side.
Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.