LigandMPNN - Proto

License: LigandMPNN is open source and free for academic and commercial use under an MIT license. Please refer to the license for full terms.

Proto is not affiliated with Institute for Protein Design. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.

GitHub 562 GitHub 562 Publication Publication Cite Cite Tool Source Tool Source Open as Notebook Open as Notebook Open on Proto Open on Proto

dauparas/LigandMPNN

562 stars

View repo

Atomic context-conditioned protein sequence design using LigandMPNN

Justas Dauparas, Gyu Rie Lee, … David Baker

Nat. Methods (2025)

Read paper

@ARTICLE{Dauparas2025-eg,
  title     = "Atomic context-conditioned protein sequence design using
               {LigandMPNN}",
  author    = "Dauparas, Justas and Lee, Gyu Rie and Pecoraro, Robert and An,
               Linna and Anishchenko, Ivan and Glasscock, Cameron and Baker,
               David",
  journal   = "Nat. Methods",
  publisher = "Springer Science and Business Media LLC",
  volume    =  22,
  number    =  4,
  pages     = "717--723",
  doi       = "10.1038/s41592-025-02626-1",
  month     =  apr,
  year      =  2025,
  language  = "en"
}

Copy citation

proto-bio/proto-tools/proto_tools/tools/inverse_folding/ligandmpnn

View source

Open Notebook

Open notebook

Coming soon!

Run this tool directly in Proto with no setup required.

Function	Description
`run_ligandmpnn_sample()`	Sample protein sequences using LigandMPNN (GPU)	Docs Source
`run_ligandmpnn_score()`	Score protein sequences using LigandMPNN (GPU)	Docs Source

Background

LigandMPNN (Dauparas et al., 2025) solves the inverse-folding problem for biomolecular assemblies: given a protein backbone together with the non-protein atoms around it, it predicts an amino-acid sequence compatible with that environment. It is a direct extension of ProteinMPNN, which sees only protein backbone atoms and is therefore blind to the bound ligands, nucleic acids, and metals that strongly shape which residues fit. Internally, LigandMPNN keeps ProteinMPNN’s message-passing design model and adds a second graph over the non-protein atoms. Residues and nearby ligand atoms exchange messages, and the model reads each atom’s chemical element, which is what lets it reason about coordinating a metal or packing against a large or unusual ligand. It generates the sequence autoregressively and can also produce sidechain conformations so binding interactions can be inspected directly. On native backbones it recovers roughly 63% of the native residues that contact small molecules, 51% of those contacting nucleotides, and 78% of those coordinating metals. The reference implementation is maintained by the Institute for Protein Design at dauparas/LigandMPNN.

Learning Resources

Introducing LigandMPNN (Institute for Protein Design) - an accessible overview of what LigandMPNN adds over ProteinMPNN and when to use it.

Tools

LigandMPNN Sampling (`ligandmpnn-sample`)

Designs new sequences for a backbone in the presence of its non-protein context. Each input structure is encoded once, with any ligand, nucleotide, or metal atoms included, and decoded into one or more candidate sequences with a perplexity and sequence recovery score.

API Reference

Source

Input: InverseFoldingInput

inputs

List[InverseFoldingStructureInput]

required

Per-structure inputs, each containing a structure plus optional chains_to_redesign and fixed_positions selections.

Show InverseFoldingStructureInput

chains_to_redesign

ChainSelection

Chains to redesign. None means redesign every chain in the structure. Accepts shorthand "A" or ["A", "B"] at construction.

fixed_positions

ResidueSelection

Per-chain positions whose residue identity is held fixed during design (1-indexed). Accepts shorthand {"A": [1, 2, 3]} at construction.

structure

Structure

required

Protein structure. Accepts a file path, raw PDB/CIF content string, Structure object, or a dict in the shape produced by Structure.model_dump(mode='json').

Source

Config: LigandMPNNSampleConfig

model_type

string

default:"ligand_mpnn"

LigandMPNN variant to load.

ligand_mpnn_use_atom_context

boolean

default:"True"

Whether ligand-aware variants encode ligand atom context.

ligand_mpnn_use_side_chain_context

boolean

default:"False"

Whether to condition on fixed-residue sidechain atoms.

ligand_mpnn_cutoff_for_score

number

default:"8.0"

Ligand-residue distance cutoff (A) for the ligand-interface recovery score.

excluded_amino_acids

array

One-letter codes of amino acids to exclude.

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cuda"

Device to run the model on. Options include ‘cuda’ (NVIDIA GPU), ‘cpu’ (CPU execution), or specific GPU devices like ‘cuda:0’. Defaults to ‘cuda’.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Random seed to use for sampling.

num_sequences_per_structure

integer

default:"1"

Total number of sequences to generate per input structure.

batch_size

integer

Number of sequences to process simultaneously on GPU. Defaults to num_sequences_per_structure.

temperature

number

default:"0.1"

Controls randomness in sampling from logits.

Source

Output: LigandMPNNSampleOutput

design_sets

List[LigandMPNNDesignSet]

required

One LigandMPNNDesignSet per input structure, in input order.

Show LigandMPNNDesignSet

complexes

List[LigandMPNNDesign]

required

The complexes generated for one input structure, each a complete multi-chain complex with recovery metrics.

Applications

Use this to design or redesign binding sites, enzyme active sites, nucleic-acid-binding interfaces, and metal-coordination sites, where the identity of nearby non-protein atoms determines which residues work. It is the right choice over backbone-only ProteinMPNN whenever a ligand, cofactor, nucleic acid, or metal is part of the target.

Usage Tips

Keep ligand_mpnn_use_atom_context enabled. It defaults to True and is the whole point of LigandMPNN: it encodes the surrounding ligand, nucleotide, and metal atoms. Turning it off makes the model effectively ligand-blind, close to plain ProteinMPNN.
Set ligand_mpnn_use_side_chain_context to True to honor a fixed motif. It conditions on the sidechain atoms of fixed residues, which helps when redesigning around a preserved catalytic or binding motif. It defaults to False.
fixed_positions is counted from 1, not 0, to match biological residue selection conventions. Listed positions keep their input residue, and chains or atoms you do not redesign still act as context rather than being removed.

LigandMPNN Scoring (`ligandmpnn-score`)

Evaluates how well existing sequences fit a structure and its non-protein context, returning log-likelihood-based metrics with optional per-position logits.

API Reference

Source

Input: LigandMPNNScoringInput

sequence_structure_pairs

List[SequenceStructurePair]

required

Sequence and structure pairs to score; each pair may carry per-pair fixed_positions excluded from the metrics.

Show SequenceStructurePair

sequence

string

required

Protein sequence to score against the structure.

structure

Structure

required

Protein structure to score the sequence against.

fixed_positions

ResidueSelection

Per-chain 1-indexed positions excluded from the aggregate scoring metrics. Accepts {"A": [1, 2]}.

Source

Config: LigandMPNNScoringConfig

return_logits

boolean

default:"False"

Whether to include per-position logits.

scoring_mode

enum

default:"single_aa"

Single-position or autoregressive scoring mode.Available options: single_aa, autoregressive

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cuda"

Device to run the model on.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.

Source

Output: InverseFoldingScoringOutput

scores

List[InverseFoldingScoringMetrics]

required

List of scoring outputs, one per input sequence-structure pair. Each entry is a Metrics subclass with scalar metrics (accessed via score.perplexity or score["perplexity"]) plus declared logits / vocab fields.

Show InverseFoldingScoringMetrics

logits

array

Per-position logits array (seq_len, vocab_size). None unless the tool returns logits.

vocab

array

Token ordering for logits.

primary_metric

string

Name of the metric that best summarizes the result overall (e.g. "avg_plddt" for AlphaFold2). Used by downstream UI and reporting to pick a headline value.

Metrics (one set per scores item)

Metric	Type	Range	Availability
`log_likelihood`	float	≤ 0.0	always
`avg_log_likelihood`	float	≤ 0.0	always
`perplexity`	float	≥ 1.0	always

Applications

Use this to rank designs or assess mutations near ligands, nucleic acids, or metals, where backbone-only scoring would miss the very interactions that matter. Lower perplexity indicates a better fit to the structure and its bound environment.

Usage Tips

scoring_mode changes what the score means. single_aa (the default) scores each position from its own conditional probability and is order-independent, which is what you usually want for ranking. autoregressive scores along one seed-determined decoding order, so it depends on the seed.
fixed_positions excludes residues from the aggregate score. Set it per (sequence, structure) input pair as a {chain: [positions]} selection counted from 1, not 0, to match biological residue selection conventions, so the score reflects only the residues you care about.
return_logits (default False) has a size trade-off. Enabling it adds a per-position logit array per sequence for residue-level analysis, which dominates output size and memory for long sequences, so leave it off unless you need it.

Toolkit Notes

These apply to every LigandMPNN tool in this toolkit (ligandmpnn-sample, ligandmpnn-score).

A GPU is recommended. LigandMPNN is a small message-passing model that also runs on CPU, but a GPU is much faster when designing or scoring many sequences.
The non-protein context must be in the input structure. LigandMPNN only conditions on ligands, nucleotides, or metals that are present in the supplied structure; if they are absent, it behaves like backbone-only ProteinMPNN.

Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.

​Background

​Learning Resources

​Tools

​LigandMPNN Sampling (ligandmpnn-sample)

​API Reference

​Applications

​Usage Tips

​LigandMPNN Scoring (ligandmpnn-score)

​API Reference

​Applications

​Usage Tips

​Toolkit Notes

​Infrastructure Guides

Tool Persistence

Device Management

Parallel Execution

Cloud Inference

Background

Learning Resources

Tools

LigandMPNN Sampling (`ligandmpnn-sample`)

API Reference

Applications

Usage Tips

LigandMPNN Scoring (`ligandmpnn-score`)

API Reference

Applications

Usage Tips

Toolkit Notes

Infrastructure Guides