MMseqs2 - Proto

License: MMseqs2 is open source and free for academic and commercial use under an MIT license. Please refer to the license for full terms.

Proto is not affiliated with the Steinegger Lab and the Söding Lab. This toolkit is open source and builds on the implementations produced by these organizations. Product names, logos, and trademarks are the property of their respective owners.

GitHub 2.0k GitHub 2.0k Publication Publication Cite Cite Tool Source Tool Source Open on Proto Open on Proto

soedinglab/MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite

2.0k stars

View repo

MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets

Martin Steinegger and Johannes Soding

Nature Biotechnology (2017)

Read paper

@article{steinegger2017mmseqs2,
  title={MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets},
  author={Steinegger, Martin and S{\"o}ding, Johannes},
  journal={Nature Biotechnology},
  volume={35},
  number={11},
  pages={1026--1028},
  year={2017},
  publisher={Nature Publishing Group},
  doi={10.1038/nbt.3988}
}

@article{kallenborn2024mmseqs2gpu,
  title={GPU-accelerated homology search with MMseqs2},
  author={Kallenborn, Felix and Chacon, Alvaro and Hundt, Christian and Sirelkhatim, Hassan and Didi, Kieran and Cha, Sooyoung and Dallago, Christian and Mirdita, Milot and Schmidt, Bertil and Steinegger, Martin},
  journal={Nature Methods},
  year={2024},
  publisher={Nature Publishing Group},
  doi={10.1038/s41592-025-02819-8}
}

@article{mirdita2022colabfold,
  title={ColabFold: making protein folding accessible to all},
  author={Mirdita, Milot and Sch{\"u}tze, Konstantin and Moriwaki, Yoshitaka and Heo, Lim and Ovchinnikov, Sergey and Steinegger, Martin},
  journal={Nature Methods},
  volume={19},
  number={6},
  pages={679--682},
  year={2022},
  publisher={Nature Publishing Group},
  doi={10.1038/s41592-022-01488-1}
}

Copy citation

proto-bio/proto-tools/proto_tools/tools/sequence_alignment/mmseqs2

View source

Coming soon!

Run this tool directly in Proto with no setup required.

Function	Description
`run_mmseqs2_clustering()`	Perform sequence clustering using MMseqs2 to reduce redundancy	Docs Source
`run_mmseqs2_homology_search()`	Generate MSAs by searching protein sequences against MMseqs2-indexed databases (GPU by default).	Docs Source
`run_mmseqs2_search_genomes()`	Execute nucleotide genome-to-genome search workflow	Docs Source
`run_mmseqs2_search_proteins()`	Search protein sequences using MMseqs2 with per-sequence results	Docs Source

Background

MMseqs2 (Steinegger and Söding, 2017) implements sequence-similarity search and clustering through a cascaded prefilter-align approach. Short k-mer matches between the query and the database are located first, surviving candidates are scored with an ungapped extension, and final hits are realigned with gapped Smith-Waterman alignment. Clustering uses greedy set cover over the alignment graph. The cascade reduces the search space by several orders of magnitude while retaining sensitivity comparable to BLAST, making analyses over databases with billions of sequences tractable on a single workstation. The GPU build (Kallenborn et al., 2025) accelerates the prefilter and alignment stages on NVIDIA Turing-generation or newer hardware. On top of the search engine, the ColabFold homology-search pipeline (Mirdita et al., 2022) iterates MMseqs2 searches against clustered reference databases such as UniRef30 to produce the multiple sequence alignments that AlphaFold-class structure predictors consume. This toolkit exposes that pipeline as mmseqs2-homology-search in addition to the more general search and clustering operations.

Learning Resources

soedinglab/MMseqs2 (Söding and Steinegger labs). Official repository and the source of the mmseqs command-line program that this toolkit invokes.
MMseqs2 wiki (Söding and Steinegger labs). The reference wiki for the command-line surface, including the workflow modules that the four registered tools wrap.
ColabFold homology-search documentation (Steinegger and Ovchinnikov labs). Walks through the iterative MSA pipeline that mmseqs2-homology-search runs internally.

Tools

MMseqs2 Protein Search (`mmseqs2-search-proteins`)

Performs mmseqs easy-search of one or more protein query sequences against either a user-supplied target database or an inline list of target proteins. Returns the alignment hits per query, each with target identifier, percent identity, and E-value. The local execution mode runs on CPU by default and supports an opt-in GPU mode for searches against a prebuilt database with a GPU-padded index.

API Reference

Source

Input: Mmseqs2SearchProteinsInput

query_sequences

List[string]

required

List of protein sequence strings (amino acid sequences) to search. Labeled positionally (seq_0, seq_1, …) as query_id in the output; results are returned in input order.

mmseqs_db

string

Target DB (path/slug/AssetRef). Mutually exclusive with target_sequences.

target_sequences

array

Inline target sequences. Mutually exclusive with mmseqs_db.

Source

Config: Mmseqs2SearchProteinsConfig

threads

integer

default:"0"

CPU threads; 0 auto-detects all cores (the wrapper omits --threads since mmseqs rejects --threads 0).

split

integer

default:"0"

Split into N chunks to bound memory; 0 = auto.

split_memory_limit

string

Max prefilter memory per split (e.g. "90G"); None uses all system memory.

sensitivity

number

default:"5.7"

Prefilter sensitivity (1.0-7.5); higher = slower but finds more remote homologs.

evalue

number

default:"0.001"

E-value threshold for reported hits.

min_seq_id

number

default:"0.0"

Minimum sequence identity (0.0-1.0) for reported hits.

coverage

number

default:"0.0"

Minimum aligned-residue fraction (0.0-1.0); semantics depend on cov_mode.

cov_mode

enum

default:"0"

0=query AND target, 1=target, 2=query, 3-5=length-ratio variants.Available options: 0, 1, 2, 3, 4, 5

max_seqs

integer

default:"300"

Max prefilter results per query.

only_top_hits

boolean

default:"True"

Wrapper filter; keep only the best hit (highest pident) per query sequence.

use_gpu

boolean

default:"False"

Run MMseqs2-GPU (--gpu 1); requires a .idx_pad sibling on the target DB (built via mmseqs makepaddedseqdb).

extra_args

List[string]

default:"[]"

Verbatim mmseqs easy-search CLI tokens for niche flags (e.g. ["--alignment-mode", "3"]).

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cpu"

Device to run the tool on.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.

Source

Output: Mmseqs2SearchProteinsOutput

results

List[Mmseqs2SequenceSearchResult]

required

List of search results, one per input sequence. The order matches the input sequences order.

Show Mmseqs2SequenceSearchResult

query_id

string

required

Identifier of the query sequence.

query_sequence

string

required

The input query sequence.

hits

List[Mmseqs2Hit]

All hits found for this query, sorted by pident descending.

Applications

This tool is appropriate for ad-hoc protein homology search at scale, ranking hits against a custom reference database, identifying functional homologs across a sequenced library, and any analysis in which the BLAST-style hit table is the deliverable. The MMseqs2 sensitivity model finds remote homologs that fall well below the sequence-identity range where standard BLAST searches lose signal.

Usage Tips

Targets are specified via either mmseqs_db or target_sequences, but not both. Use mmseqs_db for a path to a FASTA file or a prebuilt MMseqs2 database when the target set is large or reused across calls. Use target_sequences for short inline lists.
sensitivity=5.7 is the wrapper default and matches upstream easy-search. Higher values recover more distant homologs at the cost of additional runtime. The accepted range is 1.0 to 7.5.
only_top_hits=True (the default) returns only the best hit per query by percent identity. Set it to False to retain every hit that passes the configured thresholds.
use_gpu=True requires a GPU-padded index alongside the target database. Build the index once with mmseqs makepaddedseqdb <db> <db>.idx_pad. The configuration validator hard-errors when the .idx_pad companion is missing or when use_gpu=True is combined with inline target_sequences (the GPU path does not accept inline targets).
extra_args accepts verbatim mmseqs easy-search CLI tokens. Pass any flag not exposed as a typed field through this list (for example ["--alignment-mode", "3"]). Tokens are appended after the typed flags.

MMseqs2 Genome Search (`mmseqs2-search-genomes`)

Performs the full MMseqs2 nucleotide search pipeline against either a user-supplied target database or an inline list of target genomes. The tool builds query and target databases with createdb, runs search, and converts the result to the BLAST-style tabular schema with convertalis. Runs on CPU only.

API Reference

Source

Input: Mmseqs2SearchGenomesInput

query_genomes

List[string]

required

List of nucleotide sequence strings (DNA/RNA) to use as queries. Labeled positionally (seq_0, seq_1, …) as query_id in the output; results are returned in query order.

target_genomes

array

Inline target genomes. Mutually exclusive with target_db.

target_db

string

Target FASTA or MMseqs2 DB stem (path/slug/AssetRef). Mutually exclusive with target_genomes.

Source

Config: Mmseqs2SearchGenomesConfig

threads

integer

default:"0"

CPU threads; 0 auto-detects all cores (the wrapper omits --threads since mmseqs rejects --threads 0).

sensitivity

number

default:"7.5"

Prefilter sensitivity (1.0-7.5). Wrapper default 7.5 (upstream MMseqs2 = 5.7).

evalue

number

default:"0.001"

E-value threshold for reported hits.

min_seq_id

number

default:"0.0"

Minimum sequence identity (0.0-1.0) for reported hits.

coverage

number

default:"0.0"

Minimum aligned-residue fraction (0.0-1.0); semantics depend on cov_mode.

cov_mode

enum

default:"0"

0=query AND target, 1=target, 2=query, 3-5=length-ratio variants.Available options: 0, 1, 2, 3, 4, 5

max_seqs

integer

default:"300"

Max prefilter results per query.

strand

enum

default:"2"

0=reverse, 1=forward, 2=both. Wrapper default 2 (upstream MMseqs2 = 1).Available options: 0, 1, 2

extra_args

List[string]

default:"[]"

Verbatim mmseqs search CLI tokens for niche flags (e.g. ["--alignment-mode", "2"]).

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cpu"

Device to run the tool on.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Source

Output: Mmseqs2SearchGenomesOutput

results

List[Mmseqs2SequenceSearchResult]

required

List of search results, one per input query genome. The order matches the input query genomes order.

Show Mmseqs2SequenceSearchResult

query_id

string

required

Identifier of the query sequence.

query_sequence

string

required

The input query sequence.

hits

List[Mmseqs2Hit]

All hits found for this query, sorted by pident descending.

Applications

This tool is appropriate for genome-to-genome similarity analysis, locating homologous regions between assembled genomes, comparative genomics over closely related strains, and any nucleotide analog of the protein-search workflow.

Usage Tips

Targets are specified via either target_db or target_genomes, but not both. Use target_db for a FASTA file or a prebuilt MMseqs2 database; use target_genomes for inline nucleotide sequences.
sensitivity=7.5 is the wrapper default for nucleotide search. This is a wrapper bias above the upstream MMseqs2 default of 5.7, chosen because nucleotide searches typically benefit from the higher sensitivity setting. The accepted range is 1.0 to 7.5.
strand=2 (both strands) is the wrapper default. Upstream defaults to forward strand only. Set strand=1 to restrict to the forward strand or strand=0 for reverse only.
extra_args accepts verbatim mmseqs search CLI tokens. Tokens are appended after the typed flags.

MMseqs2 Clustering (`mmseqs2-clustering`)

Performs mmseqs cluster over an inline list of sequences or a prebuilt MMseqs2 database and returns per-sequence cluster assignments. Each result records the cluster identifier and whether the sequence is the cluster representative. Runs on CPU only.

API Reference

Source

Input: Mmseqs2ClusteringInput

input_sequences

array

Inline sequences to cluster. Mutually exclusive with mmseqs_db.

mmseqs_db

string

Pre-built MMseqs2 DB (path/slug/AssetRef). Mutually exclusive with input_sequences.

sequence_ids

array

Optional IDs for inline sequences (defaults to seq_0, seq_1, …).

Source

Config: Mmseqs2ClusteringConfig

min_seq_id

number

default:"0.6"

Min identity (0.0-1.0) to share a cluster. Wrapper default 0.6 (upstream MMseqs2 = 0.0).

coverage

number

default:"0.8"

Minimum aligned-residue fraction (0.0-1.0); semantics depend on cov_mode.

cov_mode

enum

default:"0"

0=query AND target, 1=target, 2=query, 3-5=length-ratio variants.Available options: 0, 1, 2, 3, 4, 5

evalue

number

default:"0.001"

E-value threshold for the prefilter step.

cluster_mode

enum

default:"0"

0=Set-Cover (greedy), 1=Connected component (BLASTclust), 2-3=Greedy by length (CD-HIT).Available options: 0, 1, 2, 3

max_seqs

integer

default:"20"

Max prefilter results per query.

sensitivity

number

default:"4.0"

Prefilter sensitivity (1.0-7.5).

extra_args

List[string]

default:"[]"

Verbatim mmseqs cluster CLI tokens for niche flags (e.g. ["--similarity-type", "2"]).

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cpu"

Device to run the tool on.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Source

Output: Mmseqs2ClusteringOutput

results

List[Mmseqs2ClusterResult]

required

List of clustering results, one per input sequence. The order matches the input sequences order.

Show Mmseqs2ClusterResult

sequence_id

string

required

Identifier of the input sequence.

input_sequence

string

Original input sequence; None when the caller used mmseqs_db.

cluster_id

string

required

Identifier of the cluster (usually the representative’s ID).

is_representative

boolean

Whether this sequence is the cluster representative.

Applications

This tool is appropriate for deduplicating a sequence set before downstream analysis, partitioning a protein library into functional families, selecting representative sequences from a redundant collection, and any analysis that benefits from a similarity-based grouping of sequences.

Usage Tips

Inputs are specified via either input_sequences or mmseqs_db, but not both. Use input_sequences for inline sequences; use mmseqs_db for a prebuilt database that may be reused across calls.
min_seq_id=0.6 is the wrapper default. This is a wrapper bias above the upstream MMseqs2 default of 0.0, chosen as a reasonable starting point for grouping proteins into functional families. Set it higher (for example 0.95) to remove near-duplicates, or lower (for example 0.3) to group remote homologs.
cluster_mode=0 (set-cover) is the default greedy algorithm. Alternative modes are 1 (connected-component, BLASTclust-style) and 2 or 3 (greedy by length, CD-HIT-style).
The cluster representative is the first sequence to cover the cluster during greedy set-cover. It is not necessarily the longest or most central sequence. Choose an alternative cluster_mode if a different representative-selection policy is needed.
extra_args accepts verbatim mmseqs cluster CLI tokens. Tokens are appended after the typed flags.

MMseqs2 Homology Search (`mmseqs2-homology-search`)

Generates a multiple sequence alignment per query protein by iterating MMseqs2 searches against a registry-provisioned reference database using the ColabFold homology-search pipeline. Returns one MSA object per query, suitable as the MSA input to AlphaFold-class structure predictors. GPU execution is the default on supported hardware.

API Reference

Source

Input: Mmseqs2HomologySearchInput

queries

List[Mmseqs2HomologySearchQuery | List[Mmseqs2HomologySearchQuery]]

required

List of query groups, in input order.

Source

Config: Mmseqs2HomologySearchConfig

search_mode

enum

default:"remote"

"local" runs MMseqs2 against a registry-provisioned DB on disk; "remote" (the default) queries the ColabFold MSA API over the network and needs no local DB.Available options: local, remote

dataset

enum

default:"uniref30-2302"

Local-only (ignored when remote). Registered key of the searchable reference database; one ColabFold protein DB.Available options: colabfold-envdb-202108, uniref30-2302

use_gpu

boolean

default:"True"

Local-only (ignored when remote). Run MMseqs2-GPU; requires a .idx_pad index, an NVIDIA GPU (Turing+), and a Linux host.

use_metagenomic_db

boolean

default:"False"

Include the metagenomic/environmental DB (ColabFoldDB envdb) to deepen unpaired MSAs. Works in both modes; local mode requires the colabfold-envdb-202108 dataset provisioned. Default False. Does not affect cross-chain pairing.

pairing_strategy

enum

default:"greedy"

Cross-chain pairing strategy for paired (multi-chain) groups. "greedy" pairs a species found in at least two chains; "complete" only pairs a species present in every chain. Ignored for singleton groups, and (remote-mode only) the API always uses its own greedy pairing.Available options: greedy, complete

sensitivity

number

Local-only (ignored when remote). MMseqs2 -s override; ignored under use_gpu=True. None uses the dataset’s registered default.

num_threads

integer

Local-only. CPU threads; None auto-detects all cores.

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cpu"

Device to run the tool on.

timeout

integer

default:"3600"

Subprocess timeout in seconds. None waits indefinitely.

seed

integer

Source

Output: Mmseqs2HomologySearchOutput

results

List[Mmseqs2HomologySearchResult]

required

One result per input group (matches the order of Mmseqs2HomologySearchInput.queries).

Show Mmseqs2HomologySearchResult

sequence_ids

List[string]

required

Identifiers for the chains in this group.

msas

List[MSA]

required

Per-chain unpaired MSAs. None when no homologs were found beyond the query itself.

paired_msas

List[MSA]

required

Per-chain taxonomy-paired MSAs, row-aligned across the group’s chains. [None] for singleton groups (nothing to pair).

datasets_searched

List[string]

required

Registry keys of datasets hit for this group.

num_homologs_found

List[integer]

required

Number of homologs per chain (excludes the query itself).

Applications

This tool is the proto-tools entry point for generating the MSA input to structure-prediction tools. It also drives coevolutionary analyses that identify covarying residue pairs as candidate spatial contacts, conservation analyses that highlight functionally important residues, and homolog mining for protein engineering and design pipelines.

Usage Tips

The dataset field selects one registered reference database. The default is uniref30-2302. It is a scalar enum of the searchable ColabFold-style protein databases, so the proto-ui renders it as a dropdown; non-searchable or non-protein datasets are rejected by validation.
GPU execution is the default. The configuration validator hard-errors on macOS and Windows (GPU search is Linux-only). Set use_gpu=False to force the CPU pipeline.
The reference database must be provisioned once on the host machine before the first call. Run python -m proto_tools.tools.sequence_alignment.mmseqs2.setup_databases <dataset>, where the dataset key matches the value of Mmseqs2HomologySearchConfig.dataset. The wrapper does not auto-download databases at call time.
Each query produces an MSA object or None. Always check result.msas[i] is not None before accessing alignment properties. The num_homologs_found list returns 0 for queries that produced no homologs. MSA objects serialise to A3M or FASTA through the standard export interface.

Toolkit Notes

These apply to every MMseqs2 tool in this toolkit (mmseqs2-search-proteins, mmseqs2-search-genomes, mmseqs2-clustering, mmseqs2-homology-search).

All four tools share a single MMseqs2 installation. The local installation downloads the GPU-capable MMseqs2 build, which is a strict superset of the CPU-only build and runs CPU subcommands without enabling GPU code paths.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.

​Background

​Learning Resources

​Tools

​MMseqs2 Protein Search (mmseqs2-search-proteins)

​API Reference

​Applications

​Usage Tips

​MMseqs2 Genome Search (mmseqs2-search-genomes)

​API Reference

​Applications

​Usage Tips

​MMseqs2 Clustering (mmseqs2-clustering)

​API Reference

​Applications

​Usage Tips

​MMseqs2 Homology Search (mmseqs2-homology-search)

​API Reference

​Applications

​Usage Tips

​Toolkit Notes

​Infrastructure Guides

Tool Persistence

Device Management

Parallel Execution

Cloud Inference

Background

Learning Resources

Tools

MMseqs2 Protein Search (`mmseqs2-search-proteins`)

API Reference

Applications

Usage Tips

MMseqs2 Genome Search (`mmseqs2-search-genomes`)

API Reference

Applications

Usage Tips

MMseqs2 Clustering (`mmseqs2-clustering`)

API Reference

Applications

Usage Tips

MMseqs2 Homology Search (`mmseqs2-homology-search`)

API Reference

Applications

Usage Tips

Toolkit Notes

Infrastructure Guides