Protenix - Proto

License: Protenix is open source and free for academic and commercial use under an Apache-2.0 license. Please refer to the license for full terms.

Proto is not affiliated with ByteDance. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.

GitHub 1.7k GitHub 1.7k Preprint Preprint Cite Cite Tool Source Tool Source Open as Notebook Open as Notebook Open on Proto Open on Proto

bytedance/Protenix

Toward High-Accuracy Open-Source Biomolecular Structure Prediction.

1.7k stars

View repo

Protenix: An Open-Source Implementation of AlphaFold 3

ByteDance Research

bioRxiv (2025)

Read preprint

@article{bytedance2025protenix,
  title={Protenix: An Open-Source Implementation of AlphaFold 3},
  author={{ByteDance Research}},
  journal={bioRxiv},
  year={2025},
  doi={10.1101/2025.01.08.631967},
  publisher={Cold Spring Harbor Laboratory}
}

Copy citation

proto-bio/proto-tools/proto_tools/tools/structure_prediction/protenix

View source

Open Notebook

Open notebook

Coming soon!

Run this tool directly in Proto with no setup required.

Function	Description
`run_protenix()`	Multi-modal structure prediction using Protenix (open-source AlphaFold3) (GPU)	Docs Source

Background

Protenix (ByteDance Research, 2025) predicts the joint 3D structure of a biomolecular assembly from the sequences and chemical components it contains. It is a trainable, openly licensed reproduction of the AlphaFold3 architecture: like AlphaFold3, one model folds complexes that mix proteins, DNA, RNA, and small-molecule ligands, and predicts how those components are arranged relative to one another. Each protein chain can be paired with a multiple-sequence alignment (MSA) of evolutionarily related sequences, whose covariation patterns supply the evolutionary signal the model uses to place residues. Architecturally, Protenix follows AlphaFold3 rather than AlphaFold2: it carries a single representation of the input tokens and a pairwise representation over token pairs, refines them through a Pairformer trunk, and generates all-atom coordinates with a diffusion module that starts from noise and iteratively denoises into a structure, in place of AlphaFold2’s structure module. Several structures are sampled per random seed and ranked by a confidence score. Protenix is distributed in several sizes: full-parameter base models for highest accuracy, and lighter mini and tiny variants for faster, lower-memory prediction; the mini_esm and mini_ism variants replace the MSA with learned embeddings — from the ESM-2 protein language model or the ISM inverse-structure model, respectively — so they can fold without an alignment. Predicted confidence includes a per-residue predicted local distance difference test (pLDDT) for local reliability, a predicted aligned error (PAE) for the relative placement of any two tokens, a global predicted distance error (gPDE), and predicted template-modeling (pTM) and interface predicted template-modeling (ipTM) scores that summarize overall and interface accuracy. The reference implementation is open-sourced at bytedance/Protenix, with both the code and the model parameters released under the Apache-2.0 license for academic and commercial use. It was developed by ByteDance’s AI4Science team as a comprehensive reproduction of AlphaFold3, trained on comparable data to reach competitive accuracy across protein, nucleic-acid, and protein-ligand benchmarks.

Learning Resources

bytedance/Protenix (ByteDance) - the official repository, with a model card for each variant, benchmark results across protein, nucleic-acid, and ligand tasks, and a link to the hosted Protenix web server.

Tools

Protenix Structure Prediction (`protenix-prediction`)

Predicts the 3D structure of a biomolecular complex. Each input complex can combine protein, DNA, RNA, and ligand chains, with optional post-translational and nucleotide modifications; the assembly is folded by Protenix and returned as a predicted Structure per complex with confidence metrics: average pLDDT, pTM, interface pTM, per-chain and pairwise-chain scores, a global predicted distance error, and predicted aligned error.

API Reference

Source

Input: ProtenixInput

complexes

List[Complex]

required

List of complexes to predict structures for. Inherited from StructurePredictionInput. Each complex can contain multiple chains of proteins, DNA, RNA, and/or ligands.

Show Complex

chains

List[Chain | Fragment]

required

Chains in the complex, in input order.

msas

array

Pre-computed MSAs, one entry per complex. Each entry is a ComplexMSAs (per-chain MSAs keyed by chain index); paired=True marks rows taxonomy-aligned across chains. Populated by preprocess() or supplied directly.

Source

Config: ProtenixConfig

model_name

enum

default:"protenix_base_default_v1.0.0"

Protenix model variant to use. Available models:Available options: protenix_base_default_v1.0.0, protenix_base_20250630_v1.0.0, protenix_base_default_v0.5.0, protenix_base_constraint_v0.5.0, protenix_mini_esm_v0.5.0, protenix_mini_ism_v0.5.0, protenix_mini_default_v0.5.0, protenix_tiny_default_v0.5.0, protenix-v2

seeds

List[integer]

default:"[0]"

Random seeds for structure sampling. Each seed produces num_diffusion_samples independent structure samples. Multiple seeds increase diversity of the sampled conformations. A single seed is sufficient for most use cases; more seeds may help for challenging docking tasks such as antibody-antigen complexes.

num_diffusion_samples

integer

default:"5"

Independent structure samples per seed; only the best by ranking score is returned. Higher = more thorough but slower. Default 5 (matches upstream).

num_diffusion_steps

integer

Denoising steps in the diffusion process. None uses the upstream schedule: 200 for base/constraint, 5 for mini/tiny. Default None.

num_pairformer_cycles

integer

Pairformer refinement passes through the model. None uses the upstream schedule: 10 for base/constraint, 4 for mini/tiny. Default None.

verbose

integer

default:"0"

Whether to print status messages during execution including MSA generation, model loading, and prediction progress. Inherited from StructurePredictionConfig. Default: False.

device

string

default:"cuda"

Device to run the model on (e.g., "cuda", "cpu"). Inherited from StructurePredictionConfig. Default: "cuda".

timeout

integer

default:"1200"

Maximum execution time in seconds. Base models need ~10-15 minutes on slower GPUs. None waits indefinitely. Default: 1200.

seed

integer

Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.

include_pae_matrix

boolean

default:"False"

Attach pae (avg_pae always emitted). Default: False.

use_msa

boolean

default:"True"

Whether to generate and use Multiple Sequence Alignments (MSAs) for protein chains using MMseqs2 homology search. Inherited from MSAStructurePredictionConfig. Default: True.

msa_search_config

Mmseqs2HomologySearchConfig

Configuration for MMseqs2 homology search (MSA generation). Only used when use_msa=True. Inherited from MSAStructurePredictionConfig. Default: None.

pair_heterocomplex_msas

boolean

default:"True"

Whether heterocomplex protein chains should use taxonomy-paired MSA generation. Inherited from MSAStructurePredictionConfig. Default: True.

Source

Output: ProtenixOutput

structures

List[Structure]

required

Predicted structures, each carrying a :class:ProtenixMetrics instance on .metrics.

Show Structure

structure

string

required

Raw structure content in PDB or CIF format.

structure_format

string

Format of the content string (auto-detected if omitted).

b_factor_type

BFactorType

What the B-factor column represents.

source

string

Optional source identifier (filepath or tool name).

metrics

Metrics

Associated metrics (e.g., pLDDT, pTM scores, per-chain lists, pairwise matrices). None values are stripped at construction.

Metrics (one set per structures item)

Metric	Type	Range	Availability
`confidence_score`	float	unbounded	always
`ptm`	float	0.0 to 1.0	always
`iptm`	float	0.0 to 1.0	always
`avg_plddt`	float	0.0 to 1.0	always
`gpde`	float	≥ 0.0	always
`avg_pae`	float	≥ 0.0	always
`pae`	list[list[float]]	≥ 0.0	when include_pae_matrix=True
`chain_ptm`	list[float]	0.0 to 1.0	depends on model output
`chain_plddt`	list[float]	0.0 to 1.0	depends on model output
`chain_pair_iptm`	list[list[float]]	0.0 to 1.0	depends on model output
`has_clash`	bool	unbounded	depends on model output

Applications

This tool predicts the structure of multi-component assemblies such as protein-DNA and protein-RNA complexes, protein-ligand binding poses, and chains carrying modified residues. For a multi-chain complex it also reports how confidently the chains are placed relative to one another: interface pTM (ipTM) gives a single 0-to-1 score for the overall inter-chain arrangement, per-chain-pair ipTM scores each individual interface, and the cross-chain blocks of the PAE matrix show which specific inter-chain regions are positioned confidently versus uncertainly. These let you rank or filter predicted complexes and judge whether a docking pose or binding interface is reliable before trusting it downstream.

Usage Tips

model_name selects the accuracy/speed trade-off. The default protenix_base_default_v1.0.0 is the most accurate (10 Pairformer cycles, 200 diffusion steps); the mini and tiny variants are far faster with fewer cycles and steps, and protenix_mini_esm_v0.5.0 / protenix_mini_ism_v0.5.0 use protein language-model embeddings for MSA-free prediction.
protenix-v2 weights are gated by ByteDance. They are not currently distributed publicly; if you have a copy, drop it into the resolved weights directory before selecting model_name="protenix-v2". See notes/storage.md for path resolution.
use_msa defaults to True. A ColabFold search generates an MSA for each protein chain; set it False, attach precomputed MSAs, or use an ESM/ISM mini variant to skip alignments entirely.
Diffusion sampling is controlled by seeds and num_diffusion_samples. Protenix draws num_diffusion_samples (default 5) structures per seed and keeps the best by ranking score; the total number of candidates is len(seeds) times num_diffusion_samples. Setting seed overrides seeds with a single value for reproducibility.
num_pairformer_cycles and num_diffusion_steps trade accuracy for time. Defaults are checkpoint-aware: base and constraint variants use 10 cycles and 200 steps, while mini and tiny variants use 4 cycles and 5 steps, matching each checkpoint’s native upstream schedule. Override either field to apply a custom schedule regardless of model_name.
Confidence is reported as pLDDT, pTM, ipTM, gPDE, and PAE. confidence_score, the ranking score and primary metric, selects the best sample; avg_plddt is on a 0 to 1 scale and PAE and gPDE are in angstroms. has_clash flags steric clashes. Set include_pae_matrix to attach the full per-token PAE matrix.
Modified residues are supported. Protein PTMs and DNA/RNA modifications are passed through as CCD codes, as in AlphaFold3.

Toolkit Notes

These apply to every Protenix tool in this toolkit (protenix-prediction).

Requires a GPU. Protenix runs through a PyTorch backend and needs an NVIDIA GPU; base models are memory-intensive and slower, while mini and tiny variants run on more modest hardware. CPU execution is not practical.
Open AlphaFold3 reproduction. Unlike AlphaFold3, whose weights are gated and non-commercial, Protenix releases both code and weights under Apache-2.0 for academic and commercial use. Like Boltz-2 it follows the AlphaFold3 diffusion architecture, and additionally accepts modified residues.
Predictions are stochastic. Structures come from a diffusion process, so repeated runs vary unless sampling is seeded.

Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.

​Background

​Learning Resources

​Tools

​Protenix Structure Prediction (protenix-prediction)

​API Reference

​Applications

​Usage Tips

​Toolkit Notes

​Infrastructure Guides

Tool Persistence

Device Management

Parallel Execution

Cloud Inference

Background

Learning Resources

Tools

Protenix Structure Prediction (`protenix-prediction`)

API Reference

Applications

Usage Tips

Toolkit Notes

Infrastructure Guides