ProteinMPNN - Proto

License: ProteinMPNN is open source and free for academic and commercial use under an MIT license. Please refer to the license for full terms.

Proto is not affiliated with Institute for Protein Design. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.

GitHub 1.7k GitHub 1.7k Publication Publication Cite Cite Tool Source Tool Source Open as Notebook Open as Notebook Open on Proto Open on Proto

dauparas/ProteinMPNN

Code for the ProteinMPNN paper

1.7k stars

View repo

Robust deep learning—based protein sequence design using ProteinMPNN

Justas Dauparas, Ivan Anishchenko, … Neville Bethel

Science (2022)

Read paper

@article{dauparas2022proteinmpnn,
  title={Robust deep learning--based protein sequence design using ProteinMPNN},
  author={Dauparas, Justas and Anishchenko, Ivan and Bennett, Nathaniel and Bai, Hua and Ragotte, Robert J and Milles, Lukas F and Wicky, Basile IM and Courber, Alexis and de Haas, Rob J and Bethel, Neville and others},
  journal={Science},
  volume={378},
  number={6615},
  pages={49--56},
  year={2022},
  publisher={American Association for the Advancement of Science},
  doi={10.1126/science.add2187}
}

Copy citation

proto-bio/proto-tools/proto_tools/tools/inverse_folding/proteinmpnn

View source

Open Notebook

Open notebook

Coming soon!

Run this tool directly in Proto with no setup required.

Function	Description
`run_proteinmpnn_gradient()`	Compute ProteinMPNN structure-conditioned perplexity gradient for relaxed protein sequences (GPU)	Docs Source
`run_proteinmpnn_sample()`	Sample protein sequences using ProteinMPNN (GPU)	Docs Source
`run_proteinmpnn_score()`	Score protein sequences using ProteinMPNN (GPU)	Docs Source

Background

ProteinMPNN (Dauparas et al., 2022) solves the inverse-folding problem: given a fixed protein backbone (the 3D coordinates of its N, C-alpha, C, and O atoms), predict an amino-acid sequence that will fold into that structure. It is the inverse of structure prediction and a core step in protein design, where a backbone is proposed first and a sequence that encodes it is designed afterwards. Internally, ProteinMPNN encodes the backbone as a graph: each residue is a node connected to its 48 nearest neighbors in space, with edges featurized by inter-atomic distances between the backbone atoms (including a virtual C-beta). A neural network called a “message-passing” encoder turns this geometry into node and edge representations, and a decoder then generates the sequence autoregressively. ProteinMPNN is trained with a random decoding order rather than a fixed N-to-C order, so at inference any order can be used and arbitrary subsets of positions can be held fixed while the rest are designed in full structural context. It was trained on protein structures from the Protein Data Bank. During training, a small amount of Gaussian noise was added to the backbone coordinates so the model is robust to imperfect, non-crystal backbones; this slightly lowers native-sequence recovery but yields sequences that more reliably fold to the intended structure. On native backbones it recovers roughly 52% of the native sequence on average, compared with roughly 33% for physically based Rosetta design. ProteinMPNN designs have been experimentally validated by X-ray crystallography and cryo-electron microscopy, and ProteinMPNN rescued monomers, cyclic homo-oligomers, nanoparticles, and target-binding proteins that had failed when designed with Rosetta or AlphaFold.

Learning Resources

Sequence Design with ProteinMPNN - a video walkthrough of using ProteinMPNN for fixed-backbone protein sequence design.
MPNN - ML for protein sequence design - a talk on the message-passing machine-learning approach behind ProteinMPNN.

Tools

ProteinMPNN Sampling (`proteinmpnn-sample`)

Designs new sequences for a given backbone. Each input structure is encoded once and decoded into one or more candidate sequences, each returned with a perplexity and the sequence recovery against the structure’s original sequence.

API Reference

Source

Input: InverseFoldingInput

inputs

List[InverseFoldingStructureInput]

required

Per-structure inputs, each containing a structure plus optional chains_to_redesign and fixed_positions selections.

Show InverseFoldingStructureInput

chains_to_redesign

ChainSelection

Chains to redesign. None means redesign every chain in the structure. Accepts shorthand "A" or ["A", "B"] at construction.

fixed_positions

ResidueSelection

Per-chain positions whose residue identity is held fixed during design (1-indexed). Accepts shorthand {"A": [1, 2, 3]} at construction.

structure

Structure

required

Protein structure. Accepts a file path, raw PDB/CIF content string, Structure object, or a dict in the shape produced by Structure.model_dump(mode='json').

Source

Config: ProteinMPNNSampleConfig

model_choice

enum

default:"proteinmpnn"

Model weights. "proteinmpnn" is ColabDesign’s default v_48_020 (medium training noise). The v_48_* variants are the same architecture trained at different noise levels (002 / 010 / 030). "abmpnn" is antibody-optimized; "soluble" is soluble-protein-trained.Available options: proteinmpnn, v_48_002, v_48_010, v_48_030, abmpnn, soluble

backbone_noise

number

default:"0.0"

Gaussian noise (A) added to backbone coordinates before each forward pass.

excluded_amino_acids

array

One-letter codes of amino acids to exclude.

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cuda"

Device to run the model on. Options include ‘cuda’ (NVIDIA GPU), ‘cpu’ (CPU execution), or specific GPU devices like ‘cuda:0’. Defaults to ‘cuda’.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Random seed to use for sampling.

num_sequences_per_structure

integer

default:"1"

Total number of sequences to generate per input structure.

batch_size

integer

Number of sequences to process simultaneously on GPU. Defaults to num_sequences_per_structure.

temperature

number

default:"0.1"

Controls randomness in sampling from logits.

Source

Output: ProteinMPNNSampleOutput

design_sets

List[ProteinMPNNDesignSet]

required

One ProteinMPNNDesignSet per input structure, in input order. Entry i holds all complexes for input structure i.

Show ProteinMPNNDesignSet

complexes

List[ProteinMPNNDesign]

required

The complexes generated for one input structure, each a complete multi-chain complex with per-design metrics.

Applications

Use this to redesign or stabilize a natural protein, or to generate sequences for a de novo backbone (for example one from RFdiffusion). The standard design loop is to sample many sequences per backbone, rank by perplexity, and validate the top candidates with a structure predictor.

Usage Tips

temperature (default 0.1) controls diversity. Lower values are greedier and stay close to the single most likely sequence, while higher values sample more varied sequences. A value near 0.0 behaves like an argmax, and the temperature must be at least 0.
Lower batch_size if you hit GPU memory limits. It defaults to num_sequences_per_structure, so every requested sequence is generated in one forward pass. For large requests or long backbones this can exhaust GPU memory, and a smaller batch_size trades speed for lower memory.
model_choice selects the weights. The default proteinmpnn is v_48_020. The v_48_002, v_48_010, and v_48_030 variants are trained with increasing backbone noise, which makes designs more robust and diverse at the cost of native-sequence recovery. abmpnn is antibody-tuned. Use soluble when the design must be water-soluble, because the default model tends to place hydrophobic residues on membrane-like surfaces whereas soluble is retrained with transmembrane proteins excluded.
fixed_positions is counted from 1, not 0. Listing a position keeps that residue at its input identity, which is how you preserve catalytic or interface residues while redesigning everything else.
excluded_amino_acids forbids residue types everywhere. Use it to keep unwanted residues out of every design, for example ["C"] to avoid introducing cysteines.
backbone_noise (default 0.0) and seed. backbone_noise adds Gaussian noise in angstroms to the input backbone. Small values such as 0.02 increase diversity at some cost in recovery. Set seed for reproducible sampling.

ProteinMPNN Scoring (`proteinmpnn-score`)

Evaluates how well existing sequences fit a structure. Each (sequence, structure) pair is scored under ProteinMPNN’s structure-conditioned likelihood, returning log-likelihood, average log-likelihood, and perplexity, with optional per-position logits.

API Reference

Source

Input: ProteinMPNNScoringInput

sequence_structure_pairs

List[SequenceStructurePair]

required

List of sequence-structure pairs to score. Each pair contains a sequence, a structure, and optional per-pair fixed_positions excluded from the scoring metrics.

Show SequenceStructurePair

sequence

string

required

Protein sequence to score against the structure.

structure

Structure

required

Protein structure to score the sequence against.

fixed_positions

ResidueSelection

Per-chain 1-indexed positions excluded from the aggregate scoring metrics. Accepts {"A": [1, 2]}.

Source

Config: ProteinMPNNScoringConfig

return_logits

boolean

default:"False"

Whether to include per-position logits in the output. When True, returns logits for each sequence. When False, only returns metrics (saves memory and serialization time). Default: False.

model_choice

enum

default:"proteinmpnn"

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cuda"

Device to run the model on. Options include "cuda" (NVIDIA GPU), "cpu" (CPU execution). Default: "cuda".

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.

Source

Output: InverseFoldingScoringOutput

scores

List[InverseFoldingScoringMetrics]

required

List of scoring outputs, one per input sequence-structure pair. Each entry is a Metrics subclass with scalar metrics (accessed via score.perplexity or score["perplexity"]) plus declared logits / vocab fields.

Show InverseFoldingScoringMetrics

logits

array

Per-position logits array (seq_len, vocab_size). None unless the tool returns logits.

vocab

array

Token ordering for logits.

primary_metric

string

Name of the metric that best summarizes the result overall (e.g. "avg_plddt" for AlphaFold2). Used by downstream UI and reporting to pick a headline value.

Metrics (one set per scores item)

Metric	Type	Range	Availability
`log_likelihood`	float	≤ 0.0	always
`avg_log_likelihood`	float	≤ 0.0	always
`perplexity`	float	≥ 1.0	always

Applications

Use this to rank candidate sequences or point mutations by structural compatibility without generating new ones: compare designs, assess the effect of a substitution, or filter a library before experimental testing. Lower perplexity indicates a better structure-sequence fit.

Usage Tips

Set fixed_positions per (sequence, structure) pair to score only part of a chain. It lives on each input pair as a {chain: [positions]} selection, not in the config. Listed positions are skipped when computing log-likelihood and perplexity, so the score reflects just the residues you care about instead of the whole sequence. NOTE: Positions are per chain and counted from 1, not 0, to match biological residue selection conventions.
return_logits (default False) has a size trade-off. Enabling it returns a per-position (sequence length x 21) logit array per sequence for residue-level analysis. That array dominates output size and memory for long sequences or large batches, so leave it off unless you need it.

ProteinMPNN Gradient (`proteinmpnn-gradient`)

Exposes ProteinMPNN as a differentiable structure-conditioned objective: given a relaxed (L, 20) sequence distribution and a backbone, it returns the mean negative log-likelihood and its gradient with respect to the input logits, for use as a loss in gradient-based or MCMC sequence optimization.

API Reference

Source

Input: ProteinMPNNGradientInput

logits

List[array]

required

Relaxed sequence state, shape L x 20 in canonical amino-acid order ACDEFGHIKLMNPQRSTVWY.

structure

Structure

required

Backbone structure to condition ProteinMPNN on.

Show Structure

structure

string

required

Raw structure content in PDB or CIF format.

structure_format

string

Format of the content string (auto-detected if omitted).

b_factor_type

BFactorType

default:"unspecified"

What the B-factor column represents.

source

string

Optional source identifier (filepath or tool name).

metrics

Metrics

Associated metrics (e.g., pLDDT, pTM scores, per-chain lists, pairwise matrices). None values are stripped at construction.

chains_to_redesign

ChainSelection

Chains to score/design. If None, all chains in structure are used.

fixed_positions

ResidueSelection

Per-chain positions excluded from the perplexity objective.

temperature

number

Optional softmax temperature. When set, applies softmax(input / temperature) before evaluating the relaxed sequence. When None, the input is used as-is.

Source

Config: ProteinMPNNGradientConfig

model_choice

enum

default:"proteinmpnn"

ProteinMPNN weight variant.Available options: proteinmpnn, v_48_002, v_48_010, v_48_030, abmpnn, soluble

use_ste

boolean

default:"True"

Use hard one-hot forward pass with soft-probability gradients.

compute_gradient

boolean

default:"True"

Return gradients when true; run forward scoring only when false.

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cuda"

Device for ProteinMPNN execution.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Source

Output: ProteinMPNNGradientOutput

gradient

array

Gradient w.r.t. input logits. None when compute_gradient=False.

loss

number

required

Mean negative log-likelihood over ProteinMPNN-scored positions.

metrics

Dict[string, any]

Log-likelihood, perplexity, sequence length, and objective details.

vocab

List[string]

required

Canonical amino-acid column ordering for logits and gradients.

Applications

Use this when ProteinMPNN is one term in a larger optimization over a continuous sequence representation (for example combined with other structure or property objectives), rather than for standalone sampling. Set compute_gradient=False for forward-only NLL scoring, such as ranking MCMC proposals.

Usage Tips

logits columns must be in the order ACDEFGHIKLMNPQRSTVWY. The columns are read by position, so a different amino-acid ordering silently produces the wrong gradient. An optional temperature runs softmax(logits / T) first. Leave it unset to use the logits as they are.
compute_gradient (default True). Returns the gradient of the mean negative log-likelihood with respect to logits. Set False for forward-only scoring (loss only, gradient is None), for example to cheaply rank MCMC proposals.
use_ste (default True) sets the forward pass. Straight-through: a hard one-hot in the forward pass with soft-probability gradients in the backward pass. Set False for fully soft blended embeddings, smoother but biased.
fixed_positions is counted from 1 and is left out of the objective. Positions you list are excluded from both the loss and its gradient, so set it to optimize only the residues you are designing.

Toolkit Notes

These apply to every ProteinMPNN tool in this toolkit (proteinmpnn-sample, proteinmpnn-score, proteinmpnn-gradient).

GPU recommended; CPU works but is slower. ProteinMPNN is a small model and runs on CPU, but a GPU is far faster when sampling or scoring many sequences. Model weights (a few hundred MB across variants) download automatically on first use.
Reproducibility. proteinmpnn-sample and proteinmpnn-gradient are stochastic; set seed for reproducible runs.
Multi-chain sequences are ”/“-delimited. Designs spanning multiple chains are returned as a single string with chains separated by / (for example "MASCQT/EVQLVE").

Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.

​Background

​Learning Resources

​Tools

​ProteinMPNN Sampling (proteinmpnn-sample)

​API Reference

​Applications

​Usage Tips

​ProteinMPNN Scoring (proteinmpnn-score)

​API Reference

​Applications

​Usage Tips

​ProteinMPNN Gradient (proteinmpnn-gradient)

​API Reference

​Applications

​Usage Tips

​Toolkit Notes

​Infrastructure Guides

Tool Persistence

Device Management

Parallel Execution

Cloud Inference

Background

Learning Resources

Tools

ProteinMPNN Sampling (`proteinmpnn-sample`)

API Reference

Applications

Usage Tips

ProteinMPNN Scoring (`proteinmpnn-score`)

API Reference

Applications

Usage Tips

ProteinMPNN Gradient (`proteinmpnn-gradient`)

API Reference

Applications

Usage Tips

Toolkit Notes

Infrastructure Guides