Proto is not affiliated with the Steinegger Lab and the Söding Lab. This toolkit is open source and builds on the implementations produced by these organizations. Product names, logos, and trademarks are the property of their respective owners.
| Function | Description | |
|---|---|---|
run_mmseqs2_clustering() | Perform sequence clustering using MMseqs2 to reduce redundancy | Docs Source |
run_mmseqs2_homology_search() | Generate MSAs by searching protein sequences against MMseqs2-indexed databases (GPU by default). | Docs Source |
run_mmseqs2_search_genomes() | Execute nucleotide genome-to-genome search workflow | Docs Source |
run_mmseqs2_search_proteins() | Search protein sequences using MMseqs2 with per-sequence results | Docs Source |
Background
MMseqs2 (Steinegger and Söding, 2017) implements sequence-similarity search and clustering through a cascaded prefilter-align approach. Short k-mer matches between the query and the database are located first, surviving candidates are scored with an ungapped extension, and final hits are realigned with gapped Smith-Waterman alignment. Clustering uses greedy set cover over the alignment graph. The cascade reduces the search space by several orders of magnitude while retaining sensitivity comparable to BLAST, making analyses over databases with billions of sequences tractable on a single workstation. The GPU build (Kallenborn et al., 2025) accelerates the prefilter and alignment stages on NVIDIA Turing-generation or newer hardware. On top of the search engine, the ColabFold homology-search pipeline (Mirdita et al., 2022) iterates MMseqs2 searches against clustered reference databases such as UniRef30 to produce the multiple sequence alignments that AlphaFold-class structure predictors consume. This toolkit exposes that pipeline asmmseqs2-homology-search in addition to the more general search and clustering operations.
Learning Resources
- soedinglab/MMseqs2 (Söding and Steinegger labs). Official repository and the source of the
mmseqscommand-line program that this toolkit invokes. - MMseqs2 wiki (Söding and Steinegger labs). The reference wiki for the command-line surface, including the workflow modules that the four registered tools wrap.
- ColabFold homology-search documentation (Steinegger and Ovchinnikov labs). Walks through the iterative MSA pipeline that
mmseqs2-homology-searchruns internally.
Tools
MMseqs2 Protein Search (mmseqs2-search-proteins)
Performs mmseqs easy-search of one or more protein query sequences against either a user-supplied target database or an inline list of target proteins. Returns the alignment hits per query, each with target identifier, percent identity, and E-value. The local execution mode runs on CPU by default and supports an opt-in GPU mode for searches against a prebuilt database with a GPU-padded index.API Reference
Input: Mmseqs2SearchProteinsInput
Input: Mmseqs2SearchProteinsInput
seq_0, seq_1, …) as query_id in the output; results are returned in input order.target_sequences.mmseqs_db.Config: Mmseqs2SearchProteinsConfig
Config: Mmseqs2SearchProteinsConfig
0 auto-detects all cores (the wrapper omits --threads since mmseqs rejects --threads 0).0 = auto."90G"); None uses all system memory.cov_mode.0, 1, 2, 3, 4, 5--gpu 1); requires a .idx_pad sibling on the target DB (built via mmseqs makepaddedseqdb).mmseqs easy-search CLI tokens for niche flags (e.g. ["--alignment-mode", "3"]).True is coerced to 1 and False to 0.None waits indefinitely.BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.Output: Mmseqs2SearchProteinsOutput
Output: Mmseqs2SearchProteinsOutput
Applications
This tool is appropriate for ad-hoc protein homology search at scale, ranking hits against a custom reference database, identifying functional homologs across a sequenced library, and any analysis in which the BLAST-style hit table is the deliverable. The MMseqs2 sensitivity model finds remote homologs that fall well below the sequence-identity range where standard BLAST searches lose signal.Usage Tips
- Targets are specified via either
mmseqs_dbortarget_sequences, but not both. Usemmseqs_dbfor a path to a FASTA file or a prebuilt MMseqs2 database when the target set is large or reused across calls. Usetarget_sequencesfor short inline lists. sensitivity=5.7is the wrapper default and matches upstreameasy-search. Higher values recover more distant homologs at the cost of additional runtime. The accepted range is 1.0 to 7.5.only_top_hits=True(the default) returns only the best hit per query by percent identity. Set it toFalseto retain every hit that passes the configured thresholds.use_gpu=Truerequires a GPU-padded index alongside the target database. Build the index once withmmseqs makepaddedseqdb <db> <db>.idx_pad. The configuration validator hard-errors when the.idx_padcompanion is missing or whenuse_gpu=Trueis combined with inlinetarget_sequences(the GPU path does not accept inline targets).extra_argsaccepts verbatimmmseqs easy-searchCLI tokens. Pass any flag not exposed as a typed field through this list (for example["--alignment-mode", "3"]). Tokens are appended after the typed flags.
MMseqs2 Genome Search (mmseqs2-search-genomes)
Performs the full MMseqs2 nucleotide search pipeline against either a user-supplied target database or an inline list of target genomes. The tool builds query and target databases with createdb, runs search, and converts the result to the BLAST-style tabular schema with convertalis. Runs on CPU only.API Reference
Input: Mmseqs2SearchGenomesInput
Input: Mmseqs2SearchGenomesInput
seq_0, seq_1, …) as query_id in the output; results are returned in query order.target_db.target_genomes.Config: Mmseqs2SearchGenomesConfig
Config: Mmseqs2SearchGenomesConfig
0 auto-detects all cores (the wrapper omits --threads since mmseqs rejects --threads 0).cov_mode.0, 1, 2, 3, 4, 50, 1, 2mmseqs search CLI tokens for niche flags (e.g. ["--alignment-mode", "2"]).True is coerced to 1 and False to 0.None waits indefinitely.BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.Output: Mmseqs2SearchGenomesOutput
Output: Mmseqs2SearchGenomesOutput
Applications
This tool is appropriate for genome-to-genome similarity analysis, locating homologous regions between assembled genomes, comparative genomics over closely related strains, and any nucleotide analog of the protein-search workflow.Usage Tips
- Targets are specified via either
target_dbortarget_genomes, but not both. Usetarget_dbfor a FASTA file or a prebuilt MMseqs2 database; usetarget_genomesfor inline nucleotide sequences. sensitivity=7.5is the wrapper default for nucleotide search. This is a wrapper bias above the upstream MMseqs2 default of 5.7, chosen because nucleotide searches typically benefit from the higher sensitivity setting. The accepted range is 1.0 to 7.5.strand=2(both strands) is the wrapper default. Upstream defaults to forward strand only. Setstrand=1to restrict to the forward strand orstrand=0for reverse only.extra_argsaccepts verbatimmmseqs searchCLI tokens. Tokens are appended after the typed flags.
MMseqs2 Clustering (mmseqs2-clustering)
Performs mmseqs cluster over an inline list of sequences or a prebuilt MMseqs2 database and returns per-sequence cluster assignments. Each result records the cluster identifier and whether the sequence is the cluster representative. Runs on CPU only.API Reference
Config: Mmseqs2ClusteringConfig
Config: Mmseqs2ClusteringConfig
cov_mode.0, 1, 2, 3, 4, 50, 1, 2, 3mmseqs cluster CLI tokens for niche flags (e.g. ["--similarity-type", "2"]).True is coerced to 1 and False to 0.None waits indefinitely.BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.Output: Mmseqs2ClusteringOutput
Output: Mmseqs2ClusteringOutput
Applications
This tool is appropriate for deduplicating a sequence set before downstream analysis, partitioning a protein library into functional families, selecting representative sequences from a redundant collection, and any analysis that benefits from a similarity-based grouping of sequences.Usage Tips
- Inputs are specified via either
input_sequencesormmseqs_db, but not both. Useinput_sequencesfor inline sequences; usemmseqs_dbfor a prebuilt database that may be reused across calls. min_seq_id=0.6is the wrapper default. This is a wrapper bias above the upstream MMseqs2 default of 0.0, chosen as a reasonable starting point for grouping proteins into functional families. Set it higher (for example0.95) to remove near-duplicates, or lower (for example0.3) to group remote homologs.cluster_mode=0(set-cover) is the default greedy algorithm. Alternative modes are1(connected-component, BLASTclust-style) and2or3(greedy by length, CD-HIT-style).- The cluster representative is the first sequence to cover the cluster during greedy set-cover. It is not necessarily the longest or most central sequence. Choose an alternative
cluster_modeif a different representative-selection policy is needed. extra_argsaccepts verbatimmmseqs clusterCLI tokens. Tokens are appended after the typed flags.
MMseqs2 Homology Search (mmseqs2-homology-search)
Generates a multiple sequence alignment per query protein by iterating MMseqs2 searches against a registry-provisioned reference database using the ColabFold homology-search pipeline. Returns one MSA object per query, suitable as the MSA input to AlphaFold-class structure predictors. GPU execution is the default on supported hardware.API Reference
Input: Mmseqs2HomologySearchInput
Input: Mmseqs2HomologySearchInput
Config: Mmseqs2HomologySearchConfig
Config: Mmseqs2HomologySearchConfig
"local" runs MMseqs2 against a registry-provisioned DB on disk; "remote" (the default) queries the ColabFold MSA API over the network and needs no local DB.Available options: local, remotecolabfold-envdb-202108, uniref30-2302.idx_pad index, an NVIDIA GPU (Turing+), and a Linux host.colabfold-envdb-202108 dataset provisioned. Default False. Does not affect cross-chain pairing."greedy" pairs a species found in at least two chains; "complete" only pairs a species present in every chain. Ignored for singleton groups, and (remote-mode only) the API always uses its own greedy pairing.Available options: greedy, complete-s override; ignored under use_gpu=True. None uses the dataset’s registered default.None auto-detects all cores.True is coerced to 1 and False to 0.None waits indefinitely.BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.Output: Mmseqs2HomologySearchOutput
Output: Mmseqs2HomologySearchOutput
Mmseqs2HomologySearchInput.queries).Applications
This tool is the proto-tools entry point for generating the MSA input to structure-prediction tools. It also drives coevolutionary analyses that identify covarying residue pairs as candidate spatial contacts, conservation analyses that highlight functionally important residues, and homolog mining for protein engineering and design pipelines.Usage Tips
- The
datasetfield selects one registered reference database. The default isuniref30-2302. It is a scalar enum of the searchable ColabFold-style protein databases, so the proto-ui renders it as a dropdown; non-searchable or non-protein datasets are rejected by validation. - GPU execution is the default. The configuration validator hard-errors on macOS and Windows (GPU search is Linux-only). Set
use_gpu=Falseto force the CPU pipeline. - The reference database must be provisioned once on the host machine before the first call. Run
python -m proto_tools.tools.sequence_alignment.mmseqs2.setup_databases <dataset>, where the dataset key matches the value ofMmseqs2HomologySearchConfig.dataset. The wrapper does not auto-download databases at call time. - Each query produces an
MSAobject orNone. Always checkresult.msas[i] is not Nonebefore accessing alignment properties. Thenum_homologs_foundlist returns0for queries that produced no homologs.MSAobjects serialise to A3M or FASTA through the standard export interface.
Toolkit Notes
These apply to every MMseqs2 tool in this toolkit (mmseqs2-search-proteins, mmseqs2-search-genomes, mmseqs2-clustering, mmseqs2-homology-search).
- All four tools share a single MMseqs2 installation. The local installation downloads the GPU-capable MMseqs2 build, which is a strict superset of the CPU-only build and runs CPU subcommands without enabling GPU code paths.

Steinegger Lab
Söding Lab