Chai-1 - Proto

License: Chai-1 is open source and free for academic and commercial use under an Apache-2.0 license. Please refer to the license for full terms.

Proto is not affiliated with Chai Discovery. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.

GitHub 1.9k GitHub 1.9k Preprint Preprint Cite Cite Tool Source Tool Source Open as Notebook Open as Notebook Open on Proto Open on Proto

chaidiscovery/chai-lab

Chai-1, SOTA model for biomolecular structure prediction

1.9k stars

View repo

Chai-1: Decoding the molecular interactions of life

Chai Discovery

bioRxiv (2024)

Read preprint

@article{chaidiscovery2024chai1,
  title={Chai-1: Decoding the molecular interactions of life},
  author={{Chai Discovery}},
  journal={bioRxiv},
  year={2024},
  doi={10.1101/2024.10.10.615955},
  publisher={Cold Spring Harbor Laboratory}
}

Copy citation

proto-bio/proto-tools/proto_tools/tools/structure_prediction/chai1

View source

Open Notebook

Open notebook

Coming soon!

Run this tool directly in Proto with no setup required.

Function	Description
`run_chai1()`	Multi-modal structure prediction using Chai1 (GPU)	Docs Source

Background

Chai-1 (Chai Discovery, 2024) predicts the joint 3D structure of a biomolecular assembly from the sequences and chemical components it contains. It is a multi-modal foundation model that folds proteins together with small-molecule ligands, nucleic acids, glycans, and covalent modifications in a single model. Each protein chain can be conditioned on evolutionary signal, either through a multiple-sequence alignment (MSA) of related sequences or through embeddings from an ESM protein language model. Internally, Chai-1 follows the all-atom co-folding approach popularized by AlphaFold3. It tokenizes the assembly the same way, with one token per amino-acid residue or nucleotide and one token per atom for ligands and modified residues. A trunk network then builds and refines token and pairwise representations, optionally conditioned on the MSA and ESM embeddings, and a diffusion module generates the all-atom coordinates by starting from noise and iteratively denoising into a structure. Several structures are sampled per input and ranked by an aggregate confidence score. Predicted confidence includes a per-atom predicted local distance difference test (pLDDT) for local reliability, a predicted aligned error (PAE) for the relative placement of any two tokens, and predicted template-modeling (pTM) and interface predicted template-modeling (ipTM) scores that summarize overall and interface accuracy. The reference implementation is open-sourced by Chai Discovery at chaidiscovery/chai-lab, with both the code and the model weights released under the Apache-2.0 license for academic and commercial use, including drug discovery. Chai Discovery also runs the model as a hosted web platform at lab.chaidiscovery.com.

Learning Resources

chaidiscovery/chai-lab (Chai Discovery) - the official repository and inference code, linking the technical report and the hosted Chai Discovery web platform for running predictions in the browser.

Tools

Chai-1 Structure Prediction (`chai1-prediction`)

Predicts the 3D structure of a biomolecular complex. Each input complex can combine protein, ligand, and glycan chains; the assembly is folded by Chai-1 and returned as a predicted Structure per complex with confidence metrics: average pLDDT, pTM, interface pTM, predicted aligned error, and an overall confidence score.

API Reference

Source

Input: Chai1Input

complexes

List[Complex]

required

List of complexes to predict structures for. Inherited from StructurePredictionInput. Each complex can contain multiple chains of proteins, ligands, and/or glycans. Total token count per complex must not exceed 2,048 (see Note below).

Show Complex

chains

List[Chain | Fragment]

required

Chains in the complex, in input order.

msas

array

Pre-computed MSAs, one entry per complex. Each entry is a ComplexMSAs (per-chain MSAs keyed by chain index); paired=True marks rows taxonomy-aligned across chains. Populated by preprocess() or supplied directly.

Source

Config: Chai1Config

use_esm_embeddings

boolean

default:"True"

Whether to use ESM (Evolutionary Scale Modeling) embeddings for improved predictions. ESM embeddings provide evolutionary context from large-scale protein language models, typically improving prediction quality. Independent of use_msa; both can be enabled together and Chai-1 conditions on the ESM embeddings and the MSA simultaneously. Default: True.

num_trunk_recycles

integer

default:"3"

Number of iterative refinement passes through the trunk network. Higher values produce more refined structures but increase computation time. Typical range: 0-10. Must be at least 0.

num_diffn_timesteps

integer

default:"200"

Number of denoising steps in the diffusion process. Higher values produce more refined structures but are slower. Typical

num_diffn_samples

integer

default:"5"

Number of independent structure samples to generate per complex via the diffusion process. Only the best sample (by confidence) is returned. Higher values explore more conformational space but increase computation time. Must be at least 1. Default: 5.

num_trunk_samples

integer

default:"1"

Number of independent trunk forward passes per diffusion sample. Increases diversity in structure generation. Must be at least 1. Default: 1.

low_memory

boolean

default:"True"

Stream MSA + template features per sample to reduce peak GPU memory at the cost of speed. Default: True.

recycle_msa_subsample

integer

default:"0"

Stochastically subsample MSA across recycles for diversity. 0 disables (default).

verbose

integer

default:"0"

Whether to print status messages during execution. Inherited from StructurePredictionConfig. Default: False.

device

string

default:"cuda"

Device to run the model on ("cuda", "cpu"). Inherited from StructurePredictionConfig. Default: "cuda".

timeout

integer

default:"1200"

Maximum execution time in seconds. None waits indefinitely. Default: 1200.

seed

integer

Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.

include_pae_matrix

boolean

default:"False"

Inherited. Default: False.

use_msa

boolean

default:"True"

Whether to generate and use Multiple Sequence Alignments (MSAs) for protein chains using MMseqs2 homology search. Inherited from MSAStructurePredictionConfig. Default: True.

msa_search_config

Mmseqs2HomologySearchConfig

Configuration for MMseqs2 homology search (MSA generation). Only used when use_msa=True. Inherited from MSAStructurePredictionConfig. Default: None.

pair_heterocomplex_msas

boolean

default:"True"

Whether heterocomplex protein chains should use taxonomy-paired MSA generation. Inherited from MSAStructurePredictionConfig. Default: True.

Source

Output: Chai1Output

structures

List[Structure]

required

Predicted structures, each carrying a :class:Chai1Metrics instance on .metrics.

Show Structure

structure

string

required

Raw structure content in PDB or CIF format.

structure_format

string

Format of the content string (auto-detected if omitted).

b_factor_type

BFactorType

What the B-factor column represents.

source

string

Optional source identifier (filepath or tool name).

metrics

Metrics

Associated metrics (e.g., pLDDT, pTM scores, per-chain lists, pairwise matrices). None values are stripped at construction.

Metrics (one set per structures item)

Metric	Type	Range	Availability
`avg_plddt`	float	0.0 to 1.0	always
`ptm`	float	0.0 to 1.0	always
`iptm`	float	0.0 to 1.0	always
`avg_pae`	float	≥ 0.0	always
`pae`	list[list[float]]	≥ 0.0	when include_pae_matrix=True
`confidence_score`	float	0.0 to 1.0	always

Applications

This tool predicts the structure of multi-component assemblies such as protein-ligand binding poses and glycosylated proteins, which makes it well suited to drug-discovery screening and modeling carbohydrate-decorated targets. For a multi-chain complex it also reports how confidently the chains are placed relative to one another: interface pTM (ipTM) gives a single 0-to-1 score for the overall inter-chain arrangement, and the cross-chain blocks of the PAE matrix show which inter-chain regions are positioned confidently versus uncertainly, so you can rank or filter predicted complexes before trusting a pose downstream.

Usage Tips

Total length is capped at 2,048 tokens per complex (1 per amino-acid residue, 1 per heavy atom for ligands and glycans); longer inputs are rejected.
use_esm_embeddings defaults to True. Chai-1 conditions on embeddings from an ESM protein language model; they are used with or without an MSA.
use_msa defaults to True. A ColabFold search generates an MSA for each protein chain; set it False for single-sequence prediction, or attach precomputed MSAs to the input.
Sampling and refinement are configurable. num_diffn_samples (default 5) independent samples are drawn per complex and the best is kept by confidence_score; num_diffn_timesteps (default 200) sets the denoising steps and num_trunk_recycles (default 3) trades accuracy for runtime.
Confidence is reported as pLDDT, pTM, ipTM, PAE, and a confidence score. avg_plddt, the primary metric, is on a 0 to 1 scale; ipTM is meaningful only for multi-chain complexes. Set include_pae_matrix to attach the full per-token PAE matrix.

Toolkit Notes

These apply to every Chai-1 tool in this toolkit (chai1-prediction).

Requires a GPU. Chai-1 runs through a PyTorch backend and needs an NVIDIA GPU; CPU execution is not practical. low_memory (default True) streams features per sample to reduce peak GPU memory at some cost in speed.
Protein, ligand, and glycan only The Chai-1 model additionally supports DNA, RNA, and covalent modifications; this toolkit currently wraps protein, ligand, and glycan prediction. Use AlphaFold3, Boltz-2, or Protenix for nucleic-acid complexes.
Predictions are stochastic. Structures come from a diffusion process; set seed for reproducible sampling. recycle_msa_subsample and unseeded runs are intentionally non-deterministic.

Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.

​Background

​Learning Resources

​Tools

​Chai-1 Structure Prediction (chai1-prediction)

​API Reference

​Applications

​Usage Tips

​Toolkit Notes

​Infrastructure Guides

Tool Persistence

Device Management

Parallel Execution

Cloud Inference

Background

Learning Resources

Tools

Chai-1 Structure Prediction (`chai1-prediction`)

API Reference

Applications

Usage Tips

Toolkit Notes

Infrastructure Guides