Skip to main content
License: ProteinMPNN is open source and free for academic and commercial use under an MIT license. Please refer to the license for full terms.

Proto is not affiliated with Institute for Protein Design. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.


dauparas/ProteinMPNN
dauparas/ProteinMPNN
Code for the ProteinMPNN paper
1.7k stars
View repo
Robust deep learning—based protein sequence design using ProteinMPNN
Justas Dauparas, Ivan Anishchenko, … Neville Bethel
Science (2022)
Read paper
@article{dauparas2022proteinmpnn,
  title={Robust deep learning--based protein sequence design using ProteinMPNN},
  author={Dauparas, Justas and Anishchenko, Ivan and Bennett, Nathaniel and Bai, Hua and Ragotte, Robert J and Milles, Lukas F and Wicky, Basile IM and Courber, Alexis and de Haas, Rob J and Bethel, Neville and others},
  journal={Science},
  volume={378},
  number={6615},
  pages={49--56},
  year={2022},
  publisher={American Association for the Advancement of Science},
  doi={10.1126/science.add2187}
}
Copy citation
proto-bio/proto-tools/proto_tools/tools/inverse_folding/proteinmpnn
View source
Open Notebook
Open notebook
Coming soon!
Run this tool directly in Proto with no setup required.
FunctionDescription
run_proteinmpnn_gradient()Compute ProteinMPNN structure-conditioned perplexity gradient for relaxed protein sequences (GPU) Docs Source
run_proteinmpnn_sample()Sample protein sequences using ProteinMPNN (GPU) Docs Source
run_proteinmpnn_score()Score protein sequences using ProteinMPNN (GPU) Docs Source

Background

ProteinMPNN (Dauparas et al., 2022) solves the inverse-folding problem: given a fixed protein backbone (the 3D coordinates of its N, C-alpha, C, and O atoms), predict an amino-acid sequence that will fold into that structure. It is the inverse of structure prediction and a core step in protein design, where a backbone is proposed first and a sequence that encodes it is designed afterwards. Internally, ProteinMPNN encodes the backbone as a graph: each residue is a node connected to its 48 nearest neighbors in space, with edges featurized by inter-atomic distances between the backbone atoms (including a virtual C-beta). A neural network called a “message-passing” encoder turns this geometry into node and edge representations, and a decoder then generates the sequence autoregressively. ProteinMPNN is trained with a random decoding order rather than a fixed N-to-C order, so at inference any order can be used and arbitrary subsets of positions can be held fixed while the rest are designed in full structural context. It was trained on protein structures from the Protein Data Bank. During training, a small amount of Gaussian noise was added to the backbone coordinates so the model is robust to imperfect, non-crystal backbones; this slightly lowers native-sequence recovery but yields sequences that more reliably fold to the intended structure. On native backbones it recovers roughly 52% of the native sequence on average, compared with roughly 33% for physically based Rosetta design. ProteinMPNN designs have been experimentally validated by X-ray crystallography and cryo-electron microscopy, and ProteinMPNN rescued monomers, cyclic homo-oligomers, nanoparticles, and target-binding proteins that had failed when designed with Rosetta or AlphaFold.

Learning Resources

Tools

ProteinMPNN Sampling (proteinmpnn-sample)

Designs new sequences for a given backbone. Each input structure is encoded once and decoded into one or more candidate sequences, each returned with a perplexity and the sequence recovery against the structure’s original sequence.

API Reference

Source
inputs
List[InverseFoldingStructureInput]
required
Per-structure inputs, each containing a structure plus optional chains_to_redesign and fixed_positions selections.
Source
model_choice
enum
default:"proteinmpnn"
Model weights. "proteinmpnn" is ColabDesign’s default v_48_020 (medium training noise). The v_48_* variants are the same architecture trained at different noise levels (002 / 010 / 030). "abmpnn" is antibody-optimized; "soluble" is soluble-protein-trained.Available options: proteinmpnn, v_48_002, v_48_010, v_48_030, abmpnn, soluble
backbone_noise
number
default:"0.0"
Gaussian noise (A) added to backbone coordinates before each forward pass.
excluded_amino_acids
array
One-letter codes of amino acids to exclude.
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cuda"
Device to run the model on. Options include ‘cuda’ (NVIDIA GPU), ‘cpu’ (CPU execution), or specific GPU devices like ‘cuda:0’. Defaults to ‘cuda’.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed to use for sampling.
num_sequences_per_structure
integer
default:"1"
Total number of sequences to generate per input structure.
batch_size
integer
Number of sequences to process simultaneously on GPU. Defaults to num_sequences_per_structure.
temperature
number
default:"0.1"
Controls randomness in sampling from logits.
Source
design_sets
List[ProteinMPNNDesignSet]
required
One ProteinMPNNDesignSet per input structure, in input order. Entry i holds all complexes for input structure i.

Applications

Use this to redesign or stabilize a natural protein, or to generate sequences for a de novo backbone (for example one from RFdiffusion). The standard design loop is to sample many sequences per backbone, rank by perplexity, and validate the top candidates with a structure predictor.

Usage Tips

  • temperature (default 0.1) controls diversity. Lower values are greedier and stay close to the single most likely sequence, while higher values sample more varied sequences. A value near 0.0 behaves like an argmax, and the temperature must be at least 0.
  • Lower batch_size if you hit GPU memory limits. It defaults to num_sequences_per_structure, so every requested sequence is generated in one forward pass. For large requests or long backbones this can exhaust GPU memory, and a smaller batch_size trades speed for lower memory.
  • model_choice selects the weights. The default proteinmpnn is v_48_020. The v_48_002, v_48_010, and v_48_030 variants are trained with increasing backbone noise, which makes designs more robust and diverse at the cost of native-sequence recovery. abmpnn is antibody-tuned. Use soluble when the design must be water-soluble, because the default model tends to place hydrophobic residues on membrane-like surfaces whereas soluble is retrained with transmembrane proteins excluded.
  • fixed_positions is counted from 1, not 0. Listing a position keeps that residue at its input identity, which is how you preserve catalytic or interface residues while redesigning everything else.
  • excluded_amino_acids forbids residue types everywhere. Use it to keep unwanted residues out of every design, for example ["C"] to avoid introducing cysteines.
  • backbone_noise (default 0.0) and seed. backbone_noise adds Gaussian noise in angstroms to the input backbone. Small values such as 0.02 increase diversity at some cost in recovery. Set seed for reproducible sampling.

ProteinMPNN Scoring (proteinmpnn-score)

Evaluates how well existing sequences fit a structure. Each (sequence, structure) pair is scored under ProteinMPNN’s structure-conditioned likelihood, returning log-likelihood, average log-likelihood, and perplexity, with optional per-position logits.

API Reference

Source
sequence_structure_pairs
List[SequenceStructurePair]
required
List of sequence-structure pairs to score. Each pair contains a sequence, a structure, and optional per-pair fixed_positions excluded from the scoring metrics.
Source
return_logits
boolean
default:"False"
Whether to include per-position logits in the output. When True, returns logits for each sequence. When False, only returns metrics (saves memory and serialization time). Default: False.
model_choice
enum
default:"proteinmpnn"
Model weights. "proteinmpnn" is ColabDesign’s default v_48_020 (medium training noise). The v_48_* variants are the same architecture trained at different noise levels (002 / 010 / 030). "abmpnn" is antibody-optimized; "soluble" is soluble-protein-trained.Available options: proteinmpnn, v_48_002, v_48_010, v_48_030, abmpnn, soluble
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cuda"
Device to run the model on. Options include "cuda" (NVIDIA GPU), "cpu" (CPU execution). Default: "cuda".
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
scores
List[InverseFoldingScoringMetrics]
required
List of scoring outputs, one per input sequence-structure pair. Each entry is a Metrics subclass with scalar metrics (accessed via score.perplexity or score["perplexity"]) plus declared logits / vocab fields.
Metrics (one set per scores item)
MetricTypeRangeAvailability
log_likelihoodfloat≤ 0.0always
avg_log_likelihoodfloat≤ 0.0always
perplexityfloat≥ 1.0always

Applications

Use this to rank candidate sequences or point mutations by structural compatibility without generating new ones: compare designs, assess the effect of a substitution, or filter a library before experimental testing. Lower perplexity indicates a better structure-sequence fit.

Usage Tips

  • Set fixed_positions per (sequence, structure) pair to score only part of a chain. It lives on each input pair as a {chain: [positions]} selection, not in the config. Listed positions are skipped when computing log-likelihood and perplexity, so the score reflects just the residues you care about instead of the whole sequence. NOTE: Positions are per chain and counted from 1, not 0, to match biological residue selection conventions.
  • return_logits (default False) has a size trade-off. Enabling it returns a per-position (sequence length x 21) logit array per sequence for residue-level analysis. That array dominates output size and memory for long sequences or large batches, so leave it off unless you need it.

ProteinMPNN Gradient (proteinmpnn-gradient)

Exposes ProteinMPNN as a differentiable structure-conditioned objective: given a relaxed (L, 20) sequence distribution and a backbone, it returns the mean negative log-likelihood and its gradient with respect to the input logits, for use as a loss in gradient-based or MCMC sequence optimization.

API Reference

Source
logits
List[array]
required
Relaxed sequence state, shape L x 20 in canonical amino-acid order ACDEFGHIKLMNPQRSTVWY.
structure
Structure
required
Backbone structure to condition ProteinMPNN on.
chains_to_redesign
ChainSelection
Chains to score/design. If None, all chains in structure are used.
fixed_positions
ResidueSelection
Per-chain positions excluded from the perplexity objective.
temperature
number
Optional softmax temperature. When set, applies softmax(input / temperature) before evaluating the relaxed sequence. When None, the input is used as-is.
Source
model_choice
enum
default:"proteinmpnn"
ProteinMPNN weight variant.Available options: proteinmpnn, v_48_002, v_48_010, v_48_030, abmpnn, soluble
use_ste
boolean
default:"True"
Use hard one-hot forward pass with soft-probability gradients.
compute_gradient
boolean
default:"True"
Return gradients when true; run forward scoring only when false.
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cuda"
Device for ProteinMPNN execution.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
gradient
array
Gradient w.r.t. input logits. None when compute_gradient=False.
loss
number
required
Mean negative log-likelihood over ProteinMPNN-scored positions.
metrics
Dict[string, any]
Log-likelihood, perplexity, sequence length, and objective details.
vocab
List[string]
required
Canonical amino-acid column ordering for logits and gradients.

Applications

Use this when ProteinMPNN is one term in a larger optimization over a continuous sequence representation (for example combined with other structure or property objectives), rather than for standalone sampling. Set compute_gradient=False for forward-only NLL scoring, such as ranking MCMC proposals.

Usage Tips

  • logits columns must be in the order ACDEFGHIKLMNPQRSTVWY. The columns are read by position, so a different amino-acid ordering silently produces the wrong gradient. An optional temperature runs softmax(logits / T) first. Leave it unset to use the logits as they are.
  • compute_gradient (default True). Returns the gradient of the mean negative log-likelihood with respect to logits. Set False for forward-only scoring (loss only, gradient is None), for example to cheaply rank MCMC proposals.
  • use_ste (default True) sets the forward pass. Straight-through: a hard one-hot in the forward pass with soft-probability gradients in the backward pass. Set False for fully soft blended embeddings, smoother but biased.
  • fixed_positions is counted from 1 and is left out of the objective. Positions you list are excluded from both the loss and its gradient, so set it to optimize only the residues you are designing.

Toolkit Notes

These apply to every ProteinMPNN tool in this toolkit (proteinmpnn-sample, proteinmpnn-score, proteinmpnn-gradient).
  • GPU recommended; CPU works but is slower. ProteinMPNN is a small model and runs on CPU, but a GPU is far faster when sampling or scoring many sequences. Model weights (a few hundred MB across variants) download automatically on first use.
  • Reproducibility. proteinmpnn-sample and proteinmpnn-gradient are stochastic; set seed for reproducible runs.
  • Multi-chain sequences are ”/“-delimited. Designs spanning multiple chains are returned as a single string with chains separated by / (for example "MASCQT/EVQLVE").
Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.