Skip to main content
License: Chai-1 is open source and free for academic and commercial use under an Apache-2.0 license. Please refer to the license for full terms.

Proto is not affiliated with Chai Discovery. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.


chaidiscovery/chai-lab
chaidiscovery/chai-lab
Chai-1, SOTA model for biomolecular structure prediction
1.9k stars
View repo
Chai-1: Decoding the molecular interactions of life
Chai Discovery
bioRxiv (2024)
Read preprint
@article{chaidiscovery2024chai1,
  title={Chai-1: Decoding the molecular interactions of life},
  author={{Chai Discovery}},
  journal={bioRxiv},
  year={2024},
  doi={10.1101/2024.10.10.615955},
  publisher={Cold Spring Harbor Laboratory}
}
Copy citation
proto-bio/proto-tools/proto_tools/tools/structure_prediction/chai1
View source
Open Notebook
Open notebook
Coming soon!
Run this tool directly in Proto with no setup required.
FunctionDescription
run_chai1()Multi-modal structure prediction using Chai1 (GPU) Docs Source

Background

Chai-1 (Chai Discovery, 2024) predicts the joint 3D structure of a biomolecular assembly from the sequences and chemical components it contains. It is a multi-modal foundation model that folds proteins together with small-molecule ligands, nucleic acids, glycans, and covalent modifications in a single model. Each protein chain can be conditioned on evolutionary signal, either through a multiple-sequence alignment (MSA) of related sequences or through embeddings from an ESM protein language model. Internally, Chai-1 follows the all-atom co-folding approach popularized by AlphaFold3. It tokenizes the assembly the same way, with one token per amino-acid residue or nucleotide and one token per atom for ligands and modified residues. A trunk network then builds and refines token and pairwise representations, optionally conditioned on the MSA and ESM embeddings, and a diffusion module generates the all-atom coordinates by starting from noise and iteratively denoising into a structure. Several structures are sampled per input and ranked by an aggregate confidence score. Predicted confidence includes a per-atom predicted local distance difference test (pLDDT) for local reliability, a predicted aligned error (PAE) for the relative placement of any two tokens, and predicted template-modeling (pTM) and interface predicted template-modeling (ipTM) scores that summarize overall and interface accuracy. The reference implementation is open-sourced by Chai Discovery at chaidiscovery/chai-lab, with both the code and the model weights released under the Apache-2.0 license for academic and commercial use, including drug discovery. Chai Discovery also runs the model as a hosted web platform at lab.chaidiscovery.com.

Learning Resources

Tools

Chai-1 Structure Prediction (chai1-prediction)

Predicts the 3D structure of a biomolecular complex. Each input complex can combine protein, ligand, and glycan chains; the assembly is folded by Chai-1 and returned as a predicted Structure per complex with confidence metrics: average pLDDT, pTM, interface pTM, predicted aligned error, and an overall confidence score.

API Reference

Source
complexes
List[Complex]
required
List of complexes to predict structures for. Inherited from StructurePredictionInput. Each complex can contain multiple chains of proteins, ligands, and/or glycans. Total token count per complex must not exceed 2,048 (see Note below).
msas
array
Pre-computed MSAs, one entry per complex. Each entry is a ComplexMSAs (per-chain MSAs keyed by chain index); paired=True marks rows taxonomy-aligned across chains. Populated by preprocess() or supplied directly.
Source
use_esm_embeddings
boolean
default:"True"
Whether to use ESM (Evolutionary Scale Modeling) embeddings for improved predictions. ESM embeddings provide evolutionary context from large-scale protein language models, typically improving prediction quality. Independent of use_msa; both can be enabled together and Chai-1 conditions on the ESM embeddings and the MSA simultaneously. Default: True.
num_trunk_recycles
integer
default:"3"
Number of iterative refinement passes through the trunk network. Higher values produce more refined structures but increase computation time. Typical range: 0-10. Must be at least 0.
num_diffn_timesteps
integer
default:"200"
Number of denoising steps in the diffusion process. Higher values produce more refined structures but are slower. Typical
num_diffn_samples
integer
default:"5"
Number of independent structure samples to generate per complex via the diffusion process. Only the best sample (by confidence) is returned. Higher values explore more conformational space but increase computation time. Must be at least 1. Default: 5.
num_trunk_samples
integer
default:"1"
Number of independent trunk forward passes per diffusion sample. Increases diversity in structure generation. Must be at least 1. Default: 1.
low_memory
boolean
default:"True"
Stream MSA + template features per sample to reduce peak GPU memory at the cost of speed. Default: True.
recycle_msa_subsample
integer
default:"0"
Stochastically subsample MSA across recycles for diversity. 0 disables (default).
verbose
integer
default:"0"
Whether to print status messages during execution. Inherited from StructurePredictionConfig. Default: False.
device
string
default:"cuda"
Device to run the model on ("cuda", "cpu"). Inherited from StructurePredictionConfig. Default: "cuda".
timeout
integer
default:"1200"
Maximum execution time in seconds. None waits indefinitely. Default: 1200.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
include_pae_matrix
boolean
default:"False"
Inherited. Default: False.
use_msa
boolean
default:"True"
Whether to generate and use Multiple Sequence Alignments (MSAs) for protein chains using MMseqs2 homology search. Inherited from MSAStructurePredictionConfig. Default: True.
msa_search_config
Mmseqs2HomologySearchConfig
Configuration for MMseqs2 homology search (MSA generation). Only used when use_msa=True. Inherited from MSAStructurePredictionConfig. Default: None.
pair_heterocomplex_msas
boolean
default:"True"
Whether heterocomplex protein chains should use taxonomy-paired MSA generation. Inherited from MSAStructurePredictionConfig. Default: True.
Source
structures
List[Structure]
required
Predicted structures, each carrying a :class:Chai1Metrics instance on .metrics.
Metrics (one set per structures item)
MetricTypeRangeAvailability
avg_plddtfloat0.0 to 1.0always
ptmfloat0.0 to 1.0always
iptmfloat0.0 to 1.0always
avg_paefloat≥ 0.0always
paelist[list[float]]≥ 0.0when include_pae_matrix=True
confidence_scorefloat0.0 to 1.0always

Applications

This tool predicts the structure of multi-component assemblies such as protein-ligand binding poses and glycosylated proteins, which makes it well suited to drug-discovery screening and modeling carbohydrate-decorated targets. For a multi-chain complex it also reports how confidently the chains are placed relative to one another: interface pTM (ipTM) gives a single 0-to-1 score for the overall inter-chain arrangement, and the cross-chain blocks of the PAE matrix show which inter-chain regions are positioned confidently versus uncertainly, so you can rank or filter predicted complexes before trusting a pose downstream.

Usage Tips

  • Total length is capped at 2,048 tokens per complex (1 per amino-acid residue, 1 per heavy atom for ligands and glycans); longer inputs are rejected.
  • use_esm_embeddings defaults to True. Chai-1 conditions on embeddings from an ESM protein language model; they are used with or without an MSA.
  • use_msa defaults to True. A ColabFold search generates an MSA for each protein chain; set it False for single-sequence prediction, or attach precomputed MSAs to the input.
  • Sampling and refinement are configurable. num_diffn_samples (default 5) independent samples are drawn per complex and the best is kept by confidence_score; num_diffn_timesteps (default 200) sets the denoising steps and num_trunk_recycles (default 3) trades accuracy for runtime.
  • Confidence is reported as pLDDT, pTM, ipTM, PAE, and a confidence score. avg_plddt, the primary metric, is on a 0 to 1 scale; ipTM is meaningful only for multi-chain complexes. Set include_pae_matrix to attach the full per-token PAE matrix.

Toolkit Notes

These apply to every Chai-1 tool in this toolkit (chai1-prediction).
  • Requires a GPU. Chai-1 runs through a PyTorch backend and needs an NVIDIA GPU; CPU execution is not practical. low_memory (default True) streams features per sample to reduce peak GPU memory at some cost in speed.
  • Protein, ligand, and glycan only The Chai-1 model additionally supports DNA, RNA, and covalent modifications; this toolkit currently wraps protein, ligand, and glycan prediction. Use AlphaFold3, Boltz-2, or Protenix for nucleic-acid complexes.
  • Predictions are stochastic. Structures come from a diffusion process; set seed for reproducible sampling. recycle_msa_subsample and unseeded runs are intentionally non-deterministic.
Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.