Skip to main content
License: ColabFold Search is open source and free for academic and commercial use under an MIT license. Please refer to the license for full terms.

Proto is not affiliated with the Ovchinnikov Lab and the Steinegger Lab. This toolkit is open source and builds on the implementations produced by these organizations. Product names, logos, and trademarks are the property of their respective owners.


sokrypton/ColabFold
sokrypton/ColabFold
Making Protein folding accessible to all!
2.7k stars
View repo
ColabFold: making protein folding accessible to all
Milot Mirdita, Konstantin Schutze, … Martin Steinegger
Nature Methods (2022)
Read paper
@article{mirdita2022colabfold,
  title={ColabFold: making protein folding accessible to all},
  author={Mirdita, Milot and Sch{\"u}tze, Konstantin and Moriwaki, Yoshitaka and Heo, Lim and Ovchinnikov, Sergey and Steinegger, Martin},
  journal={Nature Methods},
  volume={19},
  number={6},
  pages={679--682},
  year={2022},
  publisher={Nature Publishing Group},
  doi={10.1038/s41592-022-01488-1}
}
Copy citation
proto-bio/proto-tools/proto_tools/tools/sequence_alignment/colabfold_search
View source
Open Notebook
Open notebook
Coming soon!
Run this tool directly in Proto with no setup required.
FunctionDescription
run_colabfold_search()Generate Multiple Sequence Alignments via ColabFold (local MMseqs2 DB or remote API) Docs Source

Background

ColabFold (Mirdita et al., 2022) is an open-source pipeline that pairs the MMseqs2 (Many-against-Many sequence searching) engine with AlphaFold-class structure prediction. The homology-search step uses a three-stage cascade. Short k-mer matches between the query and the database are located first, surviving candidates are scored with an ungapped extension, and final hits are realigned with gapped Smith-Waterman alignment. The pipeline produces per-query multiple sequence alignments that capture the evolutionary signal that AlphaFold and similar structure-prediction models rely on. Conservation patterns within an MSA reveal residues under structural or functional constraint, and covarying residue pairs identify spatial contacts. This toolkit exposes the search step in two execution modes. Remote execution targets the public ColabFold MMseqs2 API operated by the upstream developers and requires no local database. Local execution runs the bundled colabfold_search command-line tool against a local MMseqs2 database, supporting much higher throughput and optional GPU acceleration. The local database is the UniRef30 clustered reference of UniProt, optionally augmented with a metagenomic environmental database. The local database must be provisioned once on the host machine.

Learning Resources

  • sokrypton/ColabFold (Steinegger and Ovchinnikov labs). Official repository and the source of the colabfold_search command-line tool, plus the Google Colab notebooks that interactively expose the ColabFold pipeline.
  • ColabFold web service (Steinegger and Ovchinnikov labs). Hosted entry point to the ColabFold MSA-search and structure-prediction pipeline, useful for a quick browser-based run before scripting against the tool.

Tools

Toolkit Notes

These apply to every ColabFold Search tool in this toolkit (colabfold-search).
  • Local execution requires a one-time UniRef30 database setup on the host machine. The bundled setup_databases.sh script downloads the UniRef30 MMseqs2 database, builds the standard index, and optionally builds the GPU-padded index. The fully indexed database occupies approximately 630 GB of disk space and the download alone is approximately 99 GB. The optional metagenomic environmental database adds approximately 110 GB. The wrapper does not provision the database automatically.
  • Outputs are returned as typed MSA objects. Each ColabfoldSearchResult carries an MSA object (or None when no homologs are found) along with the query identifier. MSA objects expose alignment dimensions and column-level conservation statistics, and serialise to A3M or FASTA through to_a3m_file and to_fasta_file.
Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.