Skip to main content
This program establishes a baseline for a backbone: it samples sequences for a fixed structure with ProteinMPNN, selects the lowest-perplexity design, and predicts that design’s structural ensemble with BioEmu. It illustrates inverse folding followed by ensemble prediction. The full script sweeps four backbones; this runs one backbone at full sampling: 200 ProteinMPNN sequences and a 100-conformation BioEmu ensemble. It requires a GPU. Open as a runnable notebook View as a Python script
Runtime: this walkthrough runs real models on a GPU and takes several minutes to complete. The first run is slower because it builds the tool environment and downloads model weights.

The backbone

The design target is a single chain of a cached PDB structure. The bundled pdb_cache supplies the backbone file; PDB points at 6au6.pdb and CHAIN selects chain A. ProteinMPNN is an inverse-folding model: it reads this fixed three-dimensional backbone and proposes amino acid sequences predicted to fold into it, redesigning chain A while the geometry stays put.
python
import proto_language
from pathlib import Path

# The bundled example data lives in examples/data, one level up from this notebook.
DATA = Path.cwd().parent / "data"
PDB = str(DATA / "pdb_cache" / "6au6.pdb")
CHAIN = "A"

Sample sequences with ProteinMPNN

ProteinMPNNGenerator proposes sequences conditioned on the backbone. The config passes the structure through InverseFoldingStructureInput with chains_to_redesign=[CHAIN], so only chain A is redesigned, and sets temperature=0.1; for this generator temperature controls sampling randomness from 0 to 1, where near 0 is nearly deterministic and near 1 samples proportionally to the model’s predicted probabilities. The Segment length is read from the chain’s own sequence, and seeding 200 proposal slots makes the generator emit 200 sequences from the single structure. ToolInstance.persist() keeps one warm worker cached and reused across all 200 calls instead of starting one per sample. Each proposal carries a perplexity in its generator metadata; the lowest-perplexity design is selected here.
python
import copy
from proto_tools import InverseFoldingStructureInput
from proto_tools.utils.tool_instance import ToolInstance
from proto_language.core import Segment
from proto_language.generator import ProteinMPNNGenerator, ProteinMPNNGeneratorConfig

N_SAMPLES = 200

config = ProteinMPNNGeneratorConfig(
    structure_inputs=InverseFoldingStructureInput(structure=PDB, chains_to_redesign=[CHAIN]),
    temperature=0.1,
)
proteinmpnn = ProteinMPNNGenerator(config)

seq_len = len(config.structure_inputs[0].structure.get_chain_sequence(CHAIN))
segment = Segment(length=seq_len, sequence_type="protein")
segment.proposal_sequences = [copy.deepcopy(segment.original_sequence) for _ in range(N_SAMPLES)]
proteinmpnn.assign(segment)

# ProteinMPNN samples one sequence per worker call (batch_size is 1), so keep a
# single warm worker alive across all samples instead of paying startup per sample.
with ToolInstance.persist():
    proteinmpnn.sample()

# Pick the lowest-perplexity design.
best = min(segment.proposal_sequences, key=lambda s: s._generator_metadata["proteinmpnn"]["perplexity"])
print(f"best perplexity: {best._generator_metadata['proteinmpnn']['perplexity']:.3f}")
print(f"best sequence:   {best.sequence}")
best perplexity: 2.265
best sequence:   MKTYKLLLLGISRSGKSTILRQFRILYKDGFGGEEERERLRRVVLDDLRTAVSTIVAAMPKLDPPVALADPALQPDVDYVLATRDVPDPAYPPEDFERMARLAADAGFQAALERRHETDLIDSAPYFLARIDRIRQPDYVPTTEDLLRAVDPEPGLKEIEFEKDGITYKVYDVSGAEKERKKWPEYFKDVDAIIFVVDASAFDETTSEDKKTNVLQASLDLFEEIWTHPDLKDVPIVLFLNKVSDLRARVLAGRYDIADYFPEFADYELPADAKPEPGEDPAVARARYFIRDLFMRIAEKAGNEKRFVYPFFVDATDVENMKKVLDKVFEILEELEERKKELP

Predict the structural ensemble

run_bioemu samples conformations of the chosen sequence, approximating its structural ensemble rather than a single static fold. BioEmuInput takes the design as a single-chain protein (complexes), and BioEmuConfig sets num_samples=100 (the number of conformations to sample per sequence), batch_size=100, and an output_dir for the raw BioEmu output files. The run writes its ensemble to the temporary directory and prints a completion message.
python
import tempfile
from proto_tools import BioEmuConfig, BioEmuInput, run_bioemu

with tempfile.TemporaryDirectory() as out_dir:
    result = run_bioemu(
        BioEmuInput(complexes=[str(best.sequence)]),
        BioEmuConfig(num_samples=100, batch_size=100, output_dir=out_dir),
    )
print("BioEmu ensemble prediction complete")
BioEmu ensemble prediction complete

Next Steps

Using Generators

The inverse-folding generator family.

Protein Hunter

Inverse folding inside a design cycle.