Proto is not affiliated with Institute for Protein Design. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.
| Function | Description | |
|---|---|---|
run_proteinmpnn_gradient() | Compute ProteinMPNN structure-conditioned perplexity gradient for relaxed protein sequences (GPU) | Docs Source |
run_proteinmpnn_sample() | Sample protein sequences using ProteinMPNN (GPU) | Docs Source |
run_proteinmpnn_score() | Score protein sequences using ProteinMPNN (GPU) | Docs Source |
Background
ProteinMPNN (Dauparas et al., 2022) solves the inverse-folding problem: given a fixed protein backbone (the 3D coordinates of its N, C-alpha, C, and O atoms), predict an amino-acid sequence that will fold into that structure. It is the inverse of structure prediction and a core step in protein design, where a backbone is proposed first and a sequence that encodes it is designed afterwards. Internally, ProteinMPNN encodes the backbone as a graph: each residue is a node connected to its 48 nearest neighbors in space, with edges featurized by inter-atomic distances between the backbone atoms (including a virtual C-beta). A neural network called a “message-passing” encoder turns this geometry into node and edge representations, and a decoder then generates the sequence autoregressively. ProteinMPNN is trained with a random decoding order rather than a fixed N-to-C order, so at inference any order can be used and arbitrary subsets of positions can be held fixed while the rest are designed in full structural context. It was trained on protein structures from the Protein Data Bank. During training, a small amount of Gaussian noise was added to the backbone coordinates so the model is robust to imperfect, non-crystal backbones; this slightly lowers native-sequence recovery but yields sequences that more reliably fold to the intended structure. On native backbones it recovers roughly 52% of the native sequence on average, compared with roughly 33% for physically based Rosetta design. ProteinMPNN designs have been experimentally validated by X-ray crystallography and cryo-electron microscopy, and ProteinMPNN rescued monomers, cyclic homo-oligomers, nanoparticles, and target-binding proteins that had failed when designed with Rosetta or AlphaFold.Learning Resources
- Sequence Design with ProteinMPNN - a video walkthrough of using ProteinMPNN for fixed-backbone protein sequence design.
- MPNN - ML for protein sequence design - a talk on the message-passing machine-learning approach behind ProteinMPNN.
Tools
ProteinMPNN Sampling (proteinmpnn-sample)
Designs new sequences for a given backbone. Each input structure is encoded once and decoded into one or more candidate sequences, each returned with a perplexity and the sequence recovery against the structure’s original sequence.API Reference
Input: InverseFoldingInput
Input: InverseFoldingInput
chains_to_redesign and fixed_positions selections.Config: ProteinMPNNSampleConfig
Config: ProteinMPNNSampleConfig
"proteinmpnn" is ColabDesign’s default v_48_020 (medium training noise). The v_48_* variants are the same architecture trained at different noise levels (002 / 010 / 030). "abmpnn" is antibody-optimized; "soluble" is soluble-protein-trained.Available options: proteinmpnn, v_48_002, v_48_010, v_48_030, abmpnn, solubleTrue is coerced to 1 and False to 0.None waits indefinitely.Output: ProteinMPNNSampleOutput
Output: ProteinMPNNSampleOutput
ProteinMPNNDesignSet per input structure, in input order. Entry i holds all complexes for input structure i.Applications
Use this to redesign or stabilize a natural protein, or to generate sequences for a de novo backbone (for example one from RFdiffusion). The standard design loop is to sample many sequences per backbone, rank by perplexity, and validate the top candidates with a structure predictor.Usage Tips
temperature(default0.1) controls diversity. Lower values are greedier and stay close to the single most likely sequence, while higher values sample more varied sequences. A value near0.0behaves like an argmax, and the temperature must be at least0.- Lower
batch_sizeif you hit GPU memory limits. It defaults tonum_sequences_per_structure, so every requested sequence is generated in one forward pass. For large requests or long backbones this can exhaust GPU memory, and a smallerbatch_sizetrades speed for lower memory. model_choiceselects the weights. The defaultproteinmpnnisv_48_020. Thev_48_002,v_48_010, andv_48_030variants are trained with increasing backbone noise, which makes designs more robust and diverse at the cost of native-sequence recovery.abmpnnis antibody-tuned. Usesolublewhen the design must be water-soluble, because the default model tends to place hydrophobic residues on membrane-like surfaces whereassolubleis retrained with transmembrane proteins excluded.fixed_positionsis counted from 1, not 0. Listing a position keeps that residue at its input identity, which is how you preserve catalytic or interface residues while redesigning everything else.excluded_amino_acidsforbids residue types everywhere. Use it to keep unwanted residues out of every design, for example["C"]to avoid introducing cysteines.backbone_noise(default0.0) andseed.backbone_noiseadds Gaussian noise in angstroms to the input backbone. Small values such as0.02increase diversity at some cost in recovery. Setseedfor reproducible sampling.
ProteinMPNN Scoring (proteinmpnn-score)
Evaluates how well existing sequences fit a structure. Each (sequence, structure) pair is scored under ProteinMPNN’s structure-conditioned likelihood, returning log-likelihood, average log-likelihood, and perplexity, with optional per-position logits.API Reference
Input: ProteinMPNNScoringInput
Input: ProteinMPNNScoringInput
fixed_positions excluded from the scoring metrics.Config: ProteinMPNNScoringConfig
Config: ProteinMPNNScoringConfig
True, returns logits for each sequence. When False, only returns metrics (saves memory and serialization time). Default: False."proteinmpnn" is ColabDesign’s default v_48_020 (medium training noise). The v_48_* variants are the same architecture trained at different noise levels (002 / 010 / 030). "abmpnn" is antibody-optimized; "soluble" is soluble-protein-trained.Available options: proteinmpnn, v_48_002, v_48_010, v_48_030, abmpnn, solubleTrue is coerced to 1 and False to 0."cuda" (NVIDIA GPU), "cpu" (CPU execution). Default: "cuda".None waits indefinitely.BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.Output: InverseFoldingScoringOutput
Output: InverseFoldingScoringOutput
Metrics subclass with scalar metrics (accessed via score.perplexity or score["perplexity"]) plus declared logits / vocab fields.scores item)| Metric | Type | Range | Availability |
|---|---|---|---|
log_likelihood | float | ≤ 0.0 | always |
avg_log_likelihood | float | ≤ 0.0 | always |
perplexity | float | ≥ 1.0 | always |
Applications
Use this to rank candidate sequences or point mutations by structural compatibility without generating new ones: compare designs, assess the effect of a substitution, or filter a library before experimental testing. Lower perplexity indicates a better structure-sequence fit.Usage Tips
- Set
fixed_positionsper (sequence, structure) pair to score only part of a chain. It lives on each input pair as a{chain: [positions]}selection, not in the config. Listed positions are skipped when computing log-likelihood and perplexity, so the score reflects just the residues you care about instead of the whole sequence. NOTE: Positions are per chain and counted from 1, not 0, to match biological residue selection conventions. return_logits(defaultFalse) has a size trade-off. Enabling it returns a per-position(sequence length x 21)logit array per sequence for residue-level analysis. That array dominates output size and memory for long sequences or large batches, so leave it off unless you need it.
ProteinMPNN Gradient (proteinmpnn-gradient)
Exposes ProteinMPNN as a differentiable structure-conditioned objective: given a relaxed (L, 20) sequence distribution and a backbone, it returns the mean negative log-likelihood and its gradient with respect to the input logits, for use as a loss in gradient-based or MCMC sequence optimization.API Reference
Input: ProteinMPNNGradientInput
Input: ProteinMPNNGradientInput
L x 20 in canonical amino-acid order ACDEFGHIKLMNPQRSTVWY.None, all chains in structure are used.softmax(input / temperature) before evaluating the relaxed sequence. When None, the input is used as-is.Config: ProteinMPNNGradientConfig
Config: ProteinMPNNGradientConfig
proteinmpnn, v_48_002, v_48_010, v_48_030, abmpnn, solubleTrue is coerced to 1 and False to 0.None waits indefinitely.BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.Output: ProteinMPNNGradientOutput
Output: ProteinMPNNGradientOutput
compute_gradient=False.Applications
Use this when ProteinMPNN is one term in a larger optimization over a continuous sequence representation (for example combined with other structure or property objectives), rather than for standalone sampling. Setcompute_gradient=False for forward-only NLL scoring, such as ranking MCMC proposals.Usage Tips
logitscolumns must be in the orderACDEFGHIKLMNPQRSTVWY. The columns are read by position, so a different amino-acid ordering silently produces the wrong gradient. An optionaltemperaturerunssoftmax(logits / T)first. Leave it unset to use the logits as they are.compute_gradient(defaultTrue). Returns the gradient of the mean negative log-likelihood with respect tologits. SetFalsefor forward-only scoring (lossonly,gradientisNone), for example to cheaply rank MCMC proposals.use_ste(defaultTrue) sets the forward pass. Straight-through: a hard one-hot in the forward pass with soft-probability gradients in the backward pass. SetFalsefor fully soft blended embeddings, smoother but biased.fixed_positionsis counted from 1 and is left out of the objective. Positions you list are excluded from both the loss and its gradient, so set it to optimize only the residues you are designing.
Toolkit Notes
These apply to every ProteinMPNN tool in this toolkit (proteinmpnn-sample, proteinmpnn-score, proteinmpnn-gradient).
- GPU recommended; CPU works but is slower. ProteinMPNN is a small model and runs on CPU, but a GPU is far faster when sampling or scoring many sequences. Model weights (a few hundred MB across variants) download automatically on first use.
- Reproducibility.
proteinmpnn-sampleandproteinmpnn-gradientare stochastic; setseedfor reproducible runs. - Multi-chain sequences are ”/“-delimited. Designs spanning multiple chains are returned as a single string with chains separated by
/(for example"MASCQT/EVQLVE").

Institute for Protein Design