MinCED - Proto

License: MinCED has a GPL-3.0 license. Please refer to the license for full terms.

This toolkit is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.

GitHub 122 GitHub 122 Cite Cite Tool Source Tool Source Open as Notebook Open as Notebook Open on Proto Open on Proto

ctSkennerton/minced

Mining CRISPRs in Environmental Datasets

122 stars

View repo

@software{skennerton2019minced,
  title={MinCED: Mining CRISPRs in Environmental Datasets},
  author={Skennerton, Connor T and Angly, Florent},
  year={2019},
  url={https://github.com/ctSkennerton/minced},
  note={Derived from the CRISPR Recognition Tool (CRT) by Bland et al., 2007}
}

Copy citation

proto-bio/proto-tools/proto_tools/tools/gene_annotation/minced

View source

Open Notebook

Open notebook

Coming soon!

Run this tool directly in Proto with no setup required.

Function	Description
`run_minced()`	Detect CRISPR arrays in nucleotide sequences using MinCED	Docs Source

Background

MinCED is a derivative of the CRISPR Recognition Tool (CRT) (Bland et al., 2007), maintained by Connor Skennerton. CRISPR arrays are blocks of short, near-identical direct repeats (typically 23 to 47 nt) separated by unique spacer sequences (typically 26 to 50 nt) that record fragments of past viral and plasmid infections; they form the heritable memory of the CRISPR-Cas adaptive immune system of bacteria and archaea. Internally, MinCED uses a k-mer seed-and-extend strategy. It scans for short exact k-mer matches that recur at a consistent spacing, then extends each seed bidirectionally to the actual repeat length, and finally validates the candidate by checking that the inter-repeat spacers fall within the configured length window. The algorithm runs on raw DNA, has linear time complexity in sequence length, and finishes in seconds on a typical 5 Mb bacterial genome on commodity CPU hardware.

Learning Resources

ctSkennerton/minced (Connor Skennerton) - official repository with the canonical command-line flag surface, installation instructions, and example output.
PMC1924867 (CRT paper) (Bland et al.) - the full text of the algorithm description, including the seed-and-extend mechanism and the comparison against PatScan and PILER-CR.

Tools

MinCED CRISPR Array Detection (`minced-crispr`)

Detects CRISPR arrays in one or more nucleotide sequences. Returns, per input sequence, a list of CrisprArray objects; each carries an ordered list of CrisprRepeatSpacer units with the repeat’s start position, the repeat sequence, and the following spacer (the last unit has no spacer).

API Reference

Source

Input: MincedInput

sequences

List[string]

required

Nucleotide sequence(s) to search for CRISPR arrays. Labeled positionally (seq_0, seq_1, …); results are returned in input order.

Source

Config: MincedConfig

min_num_repeats

integer

default:"3"

Minimum repeats per array. Default 3.

min_repeat_length

integer

default:"23"

Minimum repeat length in nt. Default 23.

max_repeat_length

integer

default:"47"

Maximum repeat length in nt. Default 47. Must be ≥ min_repeat_length.

min_spacer_length

integer

default:"26"

Minimum spacer length in nt. Default 26.

max_spacer_length

integer

default:"50"

Maximum spacer length in nt. Default 50. Must be ≥ min_spacer_length.

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cpu"

Device to run the tool on.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.

Source

Output: MincedOutput

results

List[MincedSequenceResult]

Per-sequence CRISPR detection results.

Show MincedSequenceResult

sequence_id

string

required

ID of the input sequence

crispr_arrays

List[CrisprArray]

CRISPR arrays detected in this sequence

Applications

Use this to confirm and catalog CRISPR loci across newly sequenced bacterial and archaeal genomes, or to mine spacer libraries from metagenomic assemblies for phage-host interaction studies. As a pre-filter, run minced-crispr first to verify that a candidate contig actually carries a CRISPR array before spending compute on downstream Cas and tracrRNA analysis with pyhmmer-hmmsearch for Cas effector domains and crispr-tracr-rna for tracrRNA on the same locus. The spacer set returned for each array can then be aligned against phage or plasmid sequence databases to reconstruct the host’s immune history.

Usage Tips

min_num_repeats controls the sensitivity-versus-specificity trade-off. The default of 3 balances both for typical bacterial and archaeal genomes. Lower it to 2 to catch partial or degraded arrays at the cost of more false positives, and raise it to 4 or more when only high-confidence arrays should pass through.
The 23 to 47 nt repeat and 26 to 50 nt spacer windows match canonical CRISPR loci. Widen max_repeat_length and max_spacer_length to detect atypical families such as Type IV-A or CRISPR systems with unusually long spacers, and lower min_repeat_length only when chasing partial repeats since values below 20 nt start to pick up generic tandem repeats.
MinCED only locates the array; it does not identify Cas genes or classify the CRISPR system. Type assignment requires downstream Cas-effector annotation, typically pyhmmer-hmmsearch against curated Cas HMMs or a dedicated classifier such as CRISPRcasIdentifier.
Inverted length ranges are caught at config time. Setting max_repeat_length < min_repeat_length or max_spacer_length < min_spacer_length raises ValueError before the run starts, so the call fails fast instead of completing with an empty result set.
Spacer count is not an immunity-breadth metric. Multiple spacers in an array can target the same phage, and many spacers are degraded remnants of historical encounters, so the number of spacers overestimates how many distinct threats the host can recognize today.

Toolkit Notes

These apply to every MinCED tool in this toolkit (minced-crispr).

Runs on CPU only. MinCED is a Java program; the standalone install bundles a Java runtime alongside the minced program. There is no GPU acceleration to enable, and runtime is seconds per typical bacterial genome.
Self-contained after install. The standalone setup.sh downloads the minced program once; subsequent runs need no network access and no model weights or reference databases.
Sequences are processed one at a time. The wrapper iterates over inputs.sequences sequentially rather than parallelizing across them. For large batches, run independent calls in parallel from the caller side.

Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.

​Background

​Learning Resources

​Tools

​MinCED CRISPR Array Detection (minced-crispr)

​API Reference

​Applications

​Usage Tips

​Toolkit Notes

​Infrastructure Guides

Tool Persistence

Device Management

Parallel Execution

Cloud Inference

Background

Learning Resources

Tools

MinCED CRISPR Array Detection (`minced-crispr`)

API Reference

Applications

Usage Tips

Toolkit Notes

Infrastructure Guides