ColabFold Search

License: ColabFold Search is open source and free for academic and commercial use under an MIT license. Please refer to the license for full terms.

Proto is not affiliated with the Ovchinnikov Lab and the Steinegger Lab. This toolkit is open source and builds on the implementations produced by these organizations. Product names, logos, and trademarks are the property of their respective owners.

GitHub 2.7k GitHub 2.7k Publication Publication Cite Cite Tool Source Tool Source Open as Notebook Open as Notebook Open on Proto Open on Proto

sokrypton/ColabFold

Making Protein folding accessible to all!

2.7k stars

View repo

ColabFold: making protein folding accessible to all

Milot Mirdita, Konstantin Schutze, … Martin Steinegger

Nature Methods (2022)

Read paper

@article{mirdita2022colabfold,
  title={ColabFold: making protein folding accessible to all},
  author={Mirdita, Milot and Sch{\"u}tze, Konstantin and Moriwaki, Yoshitaka and Heo, Lim and Ovchinnikov, Sergey and Steinegger, Martin},
  journal={Nature Methods},
  volume={19},
  number={6},
  pages={679--682},
  year={2022},
  publisher={Nature Publishing Group},
  doi={10.1038/s41592-022-01488-1}
}

Copy citation

proto-bio/proto-tools/proto_tools/tools/sequence_alignment/colabfold_search

View source

Open Notebook

Open notebook

Coming soon!

Run this tool directly in Proto with no setup required.

Function	Description
`run_colabfold_search()`	Generate Multiple Sequence Alignments via ColabFold (local MMseqs2 DB or remote API)	Docs Source

Background

ColabFold (Mirdita et al., 2022) is an open-source pipeline that pairs the MMseqs2 (Many-against-Many sequence searching) engine with AlphaFold-class structure prediction. The homology-search step uses a three-stage cascade. Short k-mer matches between the query and the database are located first, surviving candidates are scored with an ungapped extension, and final hits are realigned with gapped Smith-Waterman alignment. The pipeline produces per-query multiple sequence alignments that capture the evolutionary signal that AlphaFold and similar structure-prediction models rely on. Conservation patterns within an MSA reveal residues under structural or functional constraint, and covarying residue pairs identify spatial contacts. This toolkit exposes the search step in two execution modes. Remote execution targets the public ColabFold MMseqs2 API operated by the upstream developers and requires no local database. Local execution runs the bundled colabfold_search command-line tool against a local MMseqs2 database, supporting much higher throughput and optional GPU acceleration. The local database is the UniRef30 clustered reference of UniProt, optionally augmented with a metagenomic environmental database. The local database must be provisioned once on the host machine.

Learning Resources

sokrypton/ColabFold (Steinegger and Ovchinnikov labs). Official repository and the source of the colabfold_search command-line tool, plus the Google Colab notebooks that interactively expose the ColabFold pipeline.
ColabFold web service (Steinegger and Ovchinnikov labs). Hosted entry point to the ColabFold MSA-search and structure-prediction pipeline, useful for a quick browser-based run before scripting against the tool.

Tools

ColabFold MSA Search (`colabfold-search`)

Generates a multiple sequence alignment for each input protein sequence by searching reference databases for homologs. Remote execution submits the query to the public ColabFold MMseqs2 API. Local execution runs the bundled colabfold_search command-line tool against a local MMseqs2 database. The tool returns one result per query in input order, each carrying a list of per-chain MSA objects (one for an unpaired query; row-aligned per-chain MSAs for a paired group) that can be exported to A3M or FASTA. Inputs accept raw sequence strings (one unpaired query each), a nested list of sequences (one taxonomy-paired group), or ColabfoldSearchQuery objects.

API Reference

Source

Input: ColabfoldSearchInput

queries

List[ColabfoldSearchQuery]

required

Search queries in original input order. Each query is unpaired (one chain) or paired (q.is_paired, two or more chains). Results are returned parallel to this list.

Show ColabfoldSearchQuery

sequences

List[string]

required

The chain sequence(s) for this query. One chain is an unpaired search; two or more is one taxonomy-paired group. A bare str is accepted and normalized to [str].

Source

Config: ColabfoldSearchConfig

search_mode

enum

default:"remote"

"local" runs MMseqs2 against a downloaded DB; "remote" queries ColabFold’s MSA API.Available options: local, remote

use_metagenomic_db

boolean

default:"False"

Include metagenomic/environmental DBs (ColabFoldDB envdb / SPIRE) in the search. Off for speed; upstream colabfold defaults this on (--use-env=1). Supported in both local and remote modes. Deepens the unpaired per-chain MSAs only; it does not affect cross-chain pairing.

pairing_strategy

enum

default:"greedy"

Cross-chain pairing strategy for paired (multi-chain) queries. "greedy" pairs a species found in at least two chains; "complete" only pairs a species present in every chain. "greedy" (the default) typically yields more paired rows and better predictions.Available options: greedy, complete

output_dir

string

Directory where output MSA files are saved. An msas subdirectory is created to store A3M files, one per sequence ID. None resolves to $PROTO_HOME/colabfold_search.

msa_db_dir

string

Local mode only. Path to the MMseqs2 database directory provisioned by setup_databases.sh. None resolves to $PROTO_MODEL_CACHE/databases/uniref30_2302/. Deliberately kept outside output_dir so the run-time cleanup in _cleanup_default_output_dir_if_cache_empty cannot delete it.

database_name

string

default:"uniref30_2302_db"

Local mode only. MMseqs2 DB stem within msa_db_dir (matches the *.dbtype file).

sensitivity

number

MMseqs2 -s override (1.0-9.0). Local mode only. Ignored under use_gpu=True (colabfold_search forces ungapped prefilter and drops -s). When None on CPU, falls back to colabfold’s k-score path (matches the public MSA server).

num_threads

integer

Local mode only. CPU threads for parallel search. None auto-detects all available cores.

use_gpu

boolean

default:"False"

Local mode only. Run MMseqs2-GPU; requires an NVIDIA GPU (Turing+), a Linux host, and a *.idx_pad GPU index built via mmseqs makepaddedseqdb. Validators raise ValueError if set with search_mode="remote", on non-Linux platforms, or without the padded DB on disk.

extra_args

List[string]

default:"[]"

Local mode only. Verbatim colabfold_search CLI tokens appended after the typed flags (e.g. ["--max-accept", "500"]). Power-user escape hatch for flags not exposed as typed fields above.

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cpu"

Device to run the tool on.

timeout

integer

default:"3600"

Subprocess timeout in seconds. Full database searches can take more than 10 minutes. None waits indefinitely.

seed

integer

Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.

Source

Output: ColabfoldSearchOutput

results

List[ColabfoldSearchResult]

required

List of search results, one per input query. Each result contains the path to the generated A3M file and metadata. The order matches the input queries order.

Show ColabfoldSearchResult

query_sequences

List[string]

required

The query chain sequence(s) this result is for.

msas

List[MSA]

required

One MSA per chain (None for a chain with no homologs beyond the query row).

paired

boolean

Whether msas are taxonomy-paired (row-aligned across chains). False for unpaired queries, and for a paired query that fell back to unpaired MSAs because no cross-chain pairing was found.

Applications

The most common application is generating the MSA input to a structure-prediction tool such as AlphaFold or its open-source successors. MSAs also drive coevolutionary analyses that identify covarying residue pairs as candidate spatial contacts, conservation analyses that highlight functionally important residues, and homolog mining for protein engineering and design pipelines where the natural sequence neighbourhood of a query is informative.

Usage Tips

Sequence identifiers must be unique across the input batch. The input validator rejects duplicate identifiers up front. Identifiers omitted from the input are auto-generated as seq_<sha256[:10]> and are guaranteed unique for distinct sequences.
Remote execution is the default and is appropriate for small batches. The public ColabFold MMseqs2 API is rate-limited by the upstream developers. High-throughput or batch workloads should use local execution to avoid being throttled.
Local execution requires a msa_db_dir pointing at a provisioned MMseqs2 database. The configuration validator hard-errors when the directory does not exist or does not contain the expected *.dbtype file for the configured database_name. See the local-database note in Toolkit Notes for the provisioning script.
sensitivity controls the MMseqs2 prefilter in local CPU execution. Higher values recover more distant homologs at the cost of additional runtime. Setting sensitivity has no effect when use_gpu=True, because the GPU path forces an ungapped prefilter.
use_gpu=True requires Linux and a GPU-padded database. The validator hard-errors on macOS or Windows, when paired with remote execution, or when the {database_name}.idx_pad file is missing from msa_db_dir. The padded database is built by the provisioning script described in Toolkit Notes.
use_metagenomic_db=True deepens the MSA by including environmental sequences but substantially increases search runtime. Use it only when the standard reference database returns a shallow alignment. Leave it False (the default) for routine searches.
result.msa is None when no homologs are detected. Always check result.msa is not None before accessing alignment properties. The num_homologs_found property returns 0 in that case.
extra_args accepts verbatim colabfold_search CLI tokens and applies only in local execution. Pass any CLI flag not exposed as a typed field through this list (for example ["--max-accept", "500"]). The remote API does not accept arbitrary CLI tokens, so extra_args is ignored when search_mode="remote" and the configuration validator emits a warning in that case.

Toolkit Notes

These apply to every ColabFold Search tool in this toolkit (colabfold-search).

Local execution requires a one-time UniRef30 database setup on the host machine. The bundled setup_databases.sh script downloads the UniRef30 MMseqs2 database, builds the standard index, and optionally builds the GPU-padded index. The fully indexed database occupies approximately 630 GB of disk space and the download alone is approximately 99 GB. The optional metagenomic environmental database adds approximately 110 GB. The wrapper does not provision the database automatically.
Outputs are returned as typed MSA objects. Each ColabfoldSearchResult carries an MSA object (or None when no homologs are found) along with the query identifier. MSA objects expose alignment dimensions and column-level conservation statistics, and serialise to A3M or FASTA through to_a3m_file and to_fasta_file.

Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.

​Background

​Learning Resources

​Tools

​ColabFold MSA Search (colabfold-search)

​API Reference

​Applications

​Usage Tips

​Toolkit Notes

​Infrastructure Guides

Tool Persistence

Device Management

Parallel Execution

Cloud Inference

Background

Learning Resources

Tools

ColabFold MSA Search (`colabfold-search`)

API Reference

Applications

Usage Tips

Toolkit Notes

Infrastructure Guides