Proto is not affiliated with the Ovchinnikov Lab and the Steinegger Lab. This toolkit is open source and builds on the implementations produced by these organizations. Product names, logos, and trademarks are the property of their respective owners.
Background
ColabFold (Mirdita et al., 2022) is an open-source pipeline that pairs the MMseqs2 (Many-against-Many sequence searching) engine with AlphaFold-class structure prediction. The homology-search step uses a three-stage cascade. Short k-mer matches between the query and the database are located first, surviving candidates are scored with an ungapped extension, and final hits are realigned with gapped Smith-Waterman alignment. The pipeline produces per-query multiple sequence alignments that capture the evolutionary signal that AlphaFold and similar structure-prediction models rely on. Conservation patterns within an MSA reveal residues under structural or functional constraint, and covarying residue pairs identify spatial contacts. This toolkit exposes the search step in two execution modes. Remote execution targets the public ColabFold MMseqs2 API operated by the upstream developers and requires no local database. Local execution runs the bundledcolabfold_search command-line tool against a local MMseqs2 database, supporting much higher throughput and optional GPU acceleration. The local database is the UniRef30 clustered reference of UniProt, optionally augmented with a metagenomic environmental database. The local database must be provisioned once on the host machine.
Learning Resources
- sokrypton/ColabFold (Steinegger and Ovchinnikov labs). Official repository and the source of the
colabfold_searchcommand-line tool, plus the Google Colab notebooks that interactively expose the ColabFold pipeline. - ColabFold web service (Steinegger and Ovchinnikov labs). Hosted entry point to the ColabFold MSA-search and structure-prediction pipeline, useful for a quick browser-based run before scripting against the tool.
Tools
ColabFold MSA Search (colabfold-search)
Generates a multiple sequence alignment for each input protein sequence by searching reference databases for homologs. Remote execution submits the query to the public ColabFold MMseqs2 API. Local execution runs the bundled colabfold_search command-line tool against a local MMseqs2 database. The tool returns one result per query in input order, each carrying a list of per-chain MSA objects (one for an unpaired query; row-aligned per-chain MSAs for a paired group) that can be exported to A3M or FASTA. Inputs accept raw sequence strings (one unpaired query each), a nested list of sequences (one taxonomy-paired group), or ColabfoldSearchQuery objects.API Reference
Input: ColabfoldSearchInput
Input: ColabfoldSearchInput
q.is_paired, two or more chains). Results are returned parallel to this list.Config: ColabfoldSearchConfig
Config: ColabfoldSearchConfig
"local" runs MMseqs2 against a downloaded DB; "remote" queries ColabFold’s MSA API.Available options: local, remote--use-env=1). Supported in both local and remote modes. Deepens the unpaired per-chain MSAs only; it does not affect cross-chain pairing."greedy" pairs a species found in at least two chains; "complete" only pairs a species present in every chain. "greedy" (the default) typically yields more paired rows and better predictions.Available options: greedy, completemsas subdirectory is created to store A3M files, one per sequence ID. None resolves to $PROTO_HOME/colabfold_search.setup_databases.sh. None resolves to $PROTO_MODEL_CACHE/databases/uniref30_2302/. Deliberately kept outside output_dir so the run-time cleanup in _cleanup_default_output_dir_if_cache_empty cannot delete it.msa_db_dir (matches the *.dbtype file).-s override (1.0-9.0). Local mode only. Ignored under use_gpu=True (colabfold_search forces ungapped prefilter and drops -s). When None on CPU, falls back to colabfold’s k-score path (matches the public MSA server).None auto-detects all available cores.*.idx_pad GPU index built via mmseqs makepaddedseqdb. Validators raise ValueError if set with search_mode="remote", on non-Linux platforms, or without the padded DB on disk.colabfold_search CLI tokens appended after the typed flags (e.g. ["--max-accept", "500"]). Power-user escape hatch for flags not exposed as typed fields above.True is coerced to 1 and False to 0.None waits indefinitely.BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.Output: ColabfoldSearchOutput
Output: ColabfoldSearchOutput
Applications
The most common application is generating the MSA input to a structure-prediction tool such as AlphaFold or its open-source successors. MSAs also drive coevolutionary analyses that identify covarying residue pairs as candidate spatial contacts, conservation analyses that highlight functionally important residues, and homolog mining for protein engineering and design pipelines where the natural sequence neighbourhood of a query is informative.Usage Tips
- Sequence identifiers must be unique across the input batch. The input validator rejects duplicate identifiers up front. Identifiers omitted from the input are auto-generated as
seq_<sha256[:10]>and are guaranteed unique for distinct sequences. - Remote execution is the default and is appropriate for small batches. The public ColabFold MMseqs2 API is rate-limited by the upstream developers. High-throughput or batch workloads should use local execution to avoid being throttled.
- Local execution requires a
msa_db_dirpointing at a provisioned MMseqs2 database. The configuration validator hard-errors when the directory does not exist or does not contain the expected*.dbtypefile for the configureddatabase_name. See the local-database note in Toolkit Notes for the provisioning script. sensitivitycontrols the MMseqs2 prefilter in local CPU execution. Higher values recover more distant homologs at the cost of additional runtime. Settingsensitivityhas no effect whenuse_gpu=True, because the GPU path forces an ungapped prefilter.use_gpu=Truerequires Linux and a GPU-padded database. The validator hard-errors on macOS or Windows, when paired with remote execution, or when the{database_name}.idx_padfile is missing frommsa_db_dir. The padded database is built by the provisioning script described in Toolkit Notes.use_metagenomic_db=Truedeepens the MSA by including environmental sequences but substantially increases search runtime. Use it only when the standard reference database returns a shallow alignment. Leave itFalse(the default) for routine searches.result.msaisNonewhen no homologs are detected. Always checkresult.msa is not Nonebefore accessing alignment properties. Thenum_homologs_foundproperty returns0in that case.extra_argsaccepts verbatimcolabfold_searchCLI tokens and applies only in local execution. Pass any CLI flag not exposed as a typed field through this list (for example["--max-accept", "500"]). The remote API does not accept arbitrary CLI tokens, soextra_argsis ignored whensearch_mode="remote"and the configuration validator emits a warning in that case.
Toolkit Notes
These apply to every ColabFold Search tool in this toolkit (colabfold-search).
- Local execution requires a one-time UniRef30 database setup on the host machine. The bundled
setup_databases.shscript downloads the UniRef30 MMseqs2 database, builds the standard index, and optionally builds the GPU-padded index. The fully indexed database occupies approximately 630 GB of disk space and the download alone is approximately 99 GB. The optional metagenomic environmental database adds approximately 110 GB. The wrapper does not provision the database automatically. - Outputs are returned as typed
MSAobjects. EachColabfoldSearchResultcarries anMSAobject (orNonewhen no homologs are found) along with the query identifier.MSAobjects expose alignment dimensions and column-level conservation statistics, and serialise to A3M or FASTA throughto_a3m_fileandto_fasta_file.

Ovchinnikov Lab
Steinegger Lab