Skip to main content
License: MMseqs2 is open source and free for academic and commercial use under an MIT license. Please refer to the license for full terms.

Proto is not affiliated with the Steinegger Lab and the Söding Lab. This toolkit is open source and builds on the implementations produced by these organizations. Product names, logos, and trademarks are the property of their respective owners.


soedinglab/MMseqs2
soedinglab/MMseqs2
MMseqs2: ultra fast and sensitive search and clustering suite
2.0k stars
View repo
MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets
Martin Steinegger and Johannes Soding
Nature Biotechnology (2017)
Read paper
@article{steinegger2017mmseqs2,
  title={MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets},
  author={Steinegger, Martin and S{\"o}ding, Johannes},
  journal={Nature Biotechnology},
  volume={35},
  number={11},
  pages={1026--1028},
  year={2017},
  publisher={Nature Publishing Group},
  doi={10.1038/nbt.3988}
}

@article{kallenborn2024mmseqs2gpu,
  title={GPU-accelerated homology search with MMseqs2},
  author={Kallenborn, Felix and Chacon, Alvaro and Hundt, Christian and Sirelkhatim, Hassan and Didi, Kieran and Cha, Sooyoung and Dallago, Christian and Mirdita, Milot and Schmidt, Bertil and Steinegger, Martin},
  journal={Nature Methods},
  year={2024},
  publisher={Nature Publishing Group},
  doi={10.1038/s41592-025-02819-8}
}

@article{mirdita2022colabfold,
  title={ColabFold: making protein folding accessible to all},
  author={Mirdita, Milot and Sch{\"u}tze, Konstantin and Moriwaki, Yoshitaka and Heo, Lim and Ovchinnikov, Sergey and Steinegger, Martin},
  journal={Nature Methods},
  volume={19},
  number={6},
  pages={679--682},
  year={2022},
  publisher={Nature Publishing Group},
  doi={10.1038/s41592-022-01488-1}
}
Copy citation
proto-bio/proto-tools/proto_tools/tools/sequence_alignment/mmseqs2
View source
Coming soon!
Run this tool directly in Proto with no setup required.
FunctionDescription
run_mmseqs2_clustering()Perform sequence clustering using MMseqs2 to reduce redundancy Docs Source
run_mmseqs2_homology_search()Generate MSAs by searching protein sequences against MMseqs2-indexed databases (GPU by default). Docs Source
run_mmseqs2_search_genomes()Execute nucleotide genome-to-genome search workflow Docs Source
run_mmseqs2_search_proteins()Search protein sequences using MMseqs2 with per-sequence results Docs Source

Background

MMseqs2 (Steinegger and Söding, 2017) implements sequence-similarity search and clustering through a cascaded prefilter-align approach. Short k-mer matches between the query and the database are located first, surviving candidates are scored with an ungapped extension, and final hits are realigned with gapped Smith-Waterman alignment. Clustering uses greedy set cover over the alignment graph. The cascade reduces the search space by several orders of magnitude while retaining sensitivity comparable to BLAST, making analyses over databases with billions of sequences tractable on a single workstation. The GPU build (Kallenborn et al., 2025) accelerates the prefilter and alignment stages on NVIDIA Turing-generation or newer hardware. On top of the search engine, the ColabFold homology-search pipeline (Mirdita et al., 2022) iterates MMseqs2 searches against clustered reference databases such as UniRef30 to produce the multiple sequence alignments that AlphaFold-class structure predictors consume. This toolkit exposes that pipeline as mmseqs2-homology-search in addition to the more general search and clustering operations.

Learning Resources

  • soedinglab/MMseqs2 (Söding and Steinegger labs). Official repository and the source of the mmseqs command-line program that this toolkit invokes.
  • MMseqs2 wiki (Söding and Steinegger labs). The reference wiki for the command-line surface, including the workflow modules that the four registered tools wrap.
  • ColabFold homology-search documentation (Steinegger and Ovchinnikov labs). Walks through the iterative MSA pipeline that mmseqs2-homology-search runs internally.

Tools

MMseqs2 Clustering (mmseqs2-clustering)

Performs mmseqs cluster over an inline list of sequences or a prebuilt MMseqs2 database and returns per-sequence cluster assignments. Each result records the cluster identifier and whether the sequence is the cluster representative. Runs on CPU only.

API Reference

Source
input_sequences
array
Inline sequences to cluster. Mutually exclusive with mmseqs_db.
mmseqs_db
string
Pre-built MMseqs2 DB (path/slug/AssetRef). Mutually exclusive with input_sequences.
sequence_ids
array
Optional IDs for inline sequences (defaults to seq_0, seq_1, …).
Source
min_seq_id
number
default:"0.6"
Min identity (0.0-1.0) to share a cluster. Wrapper default 0.6 (upstream MMseqs2 = 0.0).
coverage
number
default:"0.8"
Minimum aligned-residue fraction (0.0-1.0); semantics depend on cov_mode.
cov_mode
enum
default:"0"
0=query AND target, 1=target, 2=query, 3-5=length-ratio variants.Available options: 0, 1, 2, 3, 4, 5
evalue
number
default:"0.001"
E-value threshold for the prefilter step.
cluster_mode
enum
default:"0"
0=Set-Cover (greedy), 1=Connected component (BLASTclust), 2-3=Greedy by length (CD-HIT).Available options: 0, 1, 2, 3
max_seqs
integer
default:"20"
Max prefilter results per query.
sensitivity
number
default:"4.0"
Prefilter sensitivity (1.0-7.5).
extra_args
List[string]
default:"[]"
Verbatim mmseqs cluster CLI tokens for niche flags (e.g. ["--similarity-type", "2"]).
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cpu"
Device to run the tool on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
results
List[Mmseqs2ClusterResult]
required
List of clustering results, one per input sequence. The order matches the input sequences order.

Applications

This tool is appropriate for deduplicating a sequence set before downstream analysis, partitioning a protein library into functional families, selecting representative sequences from a redundant collection, and any analysis that benefits from a similarity-based grouping of sequences.

Usage Tips

  • Inputs are specified via either input_sequences or mmseqs_db, but not both. Use input_sequences for inline sequences; use mmseqs_db for a prebuilt database that may be reused across calls.
  • min_seq_id=0.6 is the wrapper default. This is a wrapper bias above the upstream MMseqs2 default of 0.0, chosen as a reasonable starting point for grouping proteins into functional families. Set it higher (for example 0.95) to remove near-duplicates, or lower (for example 0.3) to group remote homologs.
  • cluster_mode=0 (set-cover) is the default greedy algorithm. Alternative modes are 1 (connected-component, BLASTclust-style) and 2 or 3 (greedy by length, CD-HIT-style).
  • The cluster representative is the first sequence to cover the cluster during greedy set-cover. It is not necessarily the longest or most central sequence. Choose an alternative cluster_mode if a different representative-selection policy is needed.
  • extra_args accepts verbatim mmseqs cluster CLI tokens. Tokens are appended after the typed flags.

Toolkit Notes

These apply to every MMseqs2 tool in this toolkit (mmseqs2-search-proteins, mmseqs2-search-genomes, mmseqs2-clustering, mmseqs2-homology-search).
  • All four tools share a single MMseqs2 installation. The local installation downloads the GPU-capable MMseqs2 build, which is a strict superset of the CPU-only build and runs CPU subcommands without enabling GPU code paths.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.