Skip to main content
License: Protenix is open source and free for academic and commercial use under an Apache-2.0 license. Please refer to the license for full terms.

Proto is not affiliated with ByteDance. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.


bytedance/Protenix
bytedance/Protenix
Toward High-Accuracy Open-Source Biomolecular Structure Prediction.
1.7k stars
View repo
Protenix: An Open-Source Implementation of AlphaFold 3
ByteDance Research
bioRxiv (2025)
Read preprint
@article{bytedance2025protenix,
  title={Protenix: An Open-Source Implementation of AlphaFold 3},
  author={{ByteDance Research}},
  journal={bioRxiv},
  year={2025},
  doi={10.1101/2025.01.08.631967},
  publisher={Cold Spring Harbor Laboratory}
}
Copy citation
proto-bio/proto-tools/proto_tools/tools/structure_prediction/protenix
View source
Open Notebook
Open notebook
Coming soon!
Run this tool directly in Proto with no setup required.
FunctionDescription
run_protenix()Multi-modal structure prediction using Protenix (open-source AlphaFold3) (GPU) Docs Source

Background

Protenix (ByteDance Research, 2025) predicts the joint 3D structure of a biomolecular assembly from the sequences and chemical components it contains. It is a trainable, openly licensed reproduction of the AlphaFold3 architecture: like AlphaFold3, one model folds complexes that mix proteins, DNA, RNA, and small-molecule ligands, and predicts how those components are arranged relative to one another. Each protein chain can be paired with a multiple-sequence alignment (MSA) of evolutionarily related sequences, whose covariation patterns supply the evolutionary signal the model uses to place residues. Architecturally, Protenix follows AlphaFold3 rather than AlphaFold2: it carries a single representation of the input tokens and a pairwise representation over token pairs, refines them through a Pairformer trunk, and generates all-atom coordinates with a diffusion module that starts from noise and iteratively denoises into a structure, in place of AlphaFold2’s structure module. Several structures are sampled per random seed and ranked by a confidence score. Protenix is distributed in several sizes: full-parameter base models for highest accuracy, and lighter mini and tiny variants for faster, lower-memory prediction; the mini_esm and mini_ism variants replace the MSA with learned embeddings — from the ESM-2 protein language model or the ISM inverse-structure model, respectively — so they can fold without an alignment. Predicted confidence includes a per-residue predicted local distance difference test (pLDDT) for local reliability, a predicted aligned error (PAE) for the relative placement of any two tokens, a global predicted distance error (gPDE), and predicted template-modeling (pTM) and interface predicted template-modeling (ipTM) scores that summarize overall and interface accuracy. The reference implementation is open-sourced at bytedance/Protenix, with both the code and the model parameters released under the Apache-2.0 license for academic and commercial use. It was developed by ByteDance’s AI4Science team as a comprehensive reproduction of AlphaFold3, trained on comparable data to reach competitive accuracy across protein, nucleic-acid, and protein-ligand benchmarks.

Learning Resources

  • bytedance/Protenix (ByteDance) - the official repository, with a model card for each variant, benchmark results across protein, nucleic-acid, and ligand tasks, and a link to the hosted Protenix web server.

Tools

Protenix Structure Prediction (protenix-prediction)

Predicts the 3D structure of a biomolecular complex. Each input complex can combine protein, DNA, RNA, and ligand chains, with optional post-translational and nucleotide modifications; the assembly is folded by Protenix and returned as a predicted Structure per complex with confidence metrics: average pLDDT, pTM, interface pTM, per-chain and pairwise-chain scores, a global predicted distance error, and predicted aligned error.

API Reference

Source
complexes
List[Complex]
required
List of complexes to predict structures for. Inherited from StructurePredictionInput. Each complex can contain multiple chains of proteins, DNA, RNA, and/or ligands.
msas
array
Pre-computed MSAs, one entry per complex. Each entry is a ComplexMSAs (per-chain MSAs keyed by chain index); paired=True marks rows taxonomy-aligned across chains. Populated by preprocess() or supplied directly.
Source
model_name
enum
default:"protenix_base_default_v1.0.0"
Protenix model variant to use. Available models:Available options: protenix_base_default_v1.0.0, protenix_base_20250630_v1.0.0, protenix_base_default_v0.5.0, protenix_base_constraint_v0.5.0, protenix_mini_esm_v0.5.0, protenix_mini_ism_v0.5.0, protenix_mini_default_v0.5.0, protenix_tiny_default_v0.5.0, protenix-v2
seeds
List[integer]
default:"[0]"
Random seeds for structure sampling. Each seed produces num_diffusion_samples independent structure samples. Multiple seeds increase diversity of the sampled conformations. A single seed is sufficient for most use cases; more seeds may help for challenging docking tasks such as antibody-antigen complexes.
num_diffusion_samples
integer
default:"5"
Independent structure samples per seed; only the best by ranking score is returned. Higher = more thorough but slower. Default 5 (matches upstream).
num_diffusion_steps
integer
Denoising steps in the diffusion process. None uses the upstream schedule: 200 for base/constraint, 5 for mini/tiny. Default None.
num_pairformer_cycles
integer
Pairformer refinement passes through the model. None uses the upstream schedule: 10 for base/constraint, 4 for mini/tiny. Default None.
verbose
integer
default:"0"
Whether to print status messages during execution including MSA generation, model loading, and prediction progress. Inherited from StructurePredictionConfig. Default: False.
device
string
default:"cuda"
Device to run the model on (e.g., "cuda", "cpu"). Inherited from StructurePredictionConfig. Default: "cuda".
timeout
integer
default:"1200"
Maximum execution time in seconds. Base models need ~10-15 minutes on slower GPUs. None waits indefinitely. Default: 1200.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
include_pae_matrix
boolean
default:"False"
Attach pae (avg_pae always emitted). Default: False.
use_msa
boolean
default:"True"
Whether to generate and use Multiple Sequence Alignments (MSAs) for protein chains using MMseqs2 homology search. Inherited from MSAStructurePredictionConfig. Default: True.
msa_search_config
Mmseqs2HomologySearchConfig
Configuration for MMseqs2 homology search (MSA generation). Only used when use_msa=True. Inherited from MSAStructurePredictionConfig. Default: None.
pair_heterocomplex_msas
boolean
default:"True"
Whether heterocomplex protein chains should use taxonomy-paired MSA generation. Inherited from MSAStructurePredictionConfig. Default: True.
Source
structures
List[Structure]
required
Predicted structures, each carrying a :class:ProtenixMetrics instance on .metrics.
Metrics (one set per structures item)
MetricTypeRangeAvailability
confidence_scorefloatunboundedalways
ptmfloat0.0 to 1.0always
iptmfloat0.0 to 1.0always
avg_plddtfloat0.0 to 1.0always
gpdefloat≥ 0.0always
avg_paefloat≥ 0.0always
paelist[list[float]]≥ 0.0when include_pae_matrix=True
chain_ptmlist[float]0.0 to 1.0depends on model output
chain_plddtlist[float]0.0 to 1.0depends on model output
chain_pair_iptmlist[list[float]]0.0 to 1.0depends on model output
has_clashboolunboundeddepends on model output

Applications

This tool predicts the structure of multi-component assemblies such as protein-DNA and protein-RNA complexes, protein-ligand binding poses, and chains carrying modified residues. For a multi-chain complex it also reports how confidently the chains are placed relative to one another: interface pTM (ipTM) gives a single 0-to-1 score for the overall inter-chain arrangement, per-chain-pair ipTM scores each individual interface, and the cross-chain blocks of the PAE matrix show which specific inter-chain regions are positioned confidently versus uncertainly. These let you rank or filter predicted complexes and judge whether a docking pose or binding interface is reliable before trusting it downstream.

Usage Tips

  • model_name selects the accuracy/speed trade-off. The default protenix_base_default_v1.0.0 is the most accurate (10 Pairformer cycles, 200 diffusion steps); the mini and tiny variants are far faster with fewer cycles and steps, and protenix_mini_esm_v0.5.0 / protenix_mini_ism_v0.5.0 use protein language-model embeddings for MSA-free prediction.
  • protenix-v2 weights are gated by ByteDance. They are not currently distributed publicly; if you have a copy, drop it into the resolved weights directory before selecting model_name="protenix-v2". See notes/storage.md for path resolution.
  • use_msa defaults to True. A ColabFold search generates an MSA for each protein chain; set it False, attach precomputed MSAs, or use an ESM/ISM mini variant to skip alignments entirely.
  • Diffusion sampling is controlled by seeds and num_diffusion_samples. Protenix draws num_diffusion_samples (default 5) structures per seed and keeps the best by ranking score; the total number of candidates is len(seeds) times num_diffusion_samples. Setting seed overrides seeds with a single value for reproducibility.
  • num_pairformer_cycles and num_diffusion_steps trade accuracy for time. Defaults are checkpoint-aware: base and constraint variants use 10 cycles and 200 steps, while mini and tiny variants use 4 cycles and 5 steps, matching each checkpoint’s native upstream schedule. Override either field to apply a custom schedule regardless of model_name.
  • Confidence is reported as pLDDT, pTM, ipTM, gPDE, and PAE. confidence_score, the ranking score and primary metric, selects the best sample; avg_plddt is on a 0 to 1 scale and PAE and gPDE are in angstroms. has_clash flags steric clashes. Set include_pae_matrix to attach the full per-token PAE matrix.
  • Modified residues are supported. Protein PTMs and DNA/RNA modifications are passed through as CCD codes, as in AlphaFold3.

Toolkit Notes

These apply to every Protenix tool in this toolkit (protenix-prediction).
  • Requires a GPU. Protenix runs through a PyTorch backend and needs an NVIDIA GPU; base models are memory-intensive and slower, while mini and tiny variants run on more modest hardware. CPU execution is not practical.
  • Open AlphaFold3 reproduction. Unlike AlphaFold3, whose weights are gated and non-commercial, Protenix releases both code and weights under Apache-2.0 for academic and commercial use. Like Boltz-2 it follows the AlphaFold3 diffusion architecture, and additionally accepts modified residues.
  • Predictions are stochastic. Structures come from a diffusion process, so repeated runs vary unless sampling is seeded.
Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.