BioEmu - Proto

License: BioEmu is open source and free for academic and commercial use under an MIT license. Please refer to the license for full terms.

Proto is not affiliated with Microsoft Research. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.

GitHub 783 GitHub 783 Publication Publication Cite Cite Tool Source Tool Source Open as Notebook Open as Notebook Open on Proto Open on Proto

microsoft/bioemu

Inference code for scalable emulation of protein equilibrium ensembles with generative deep learning

783 stars

View repo

Scalable emulation of protein equilibrium ensembles with generative deep learning

Sarah Lewis, Tim Hempel, … Frank Noe

Science (2025)

Read paper

@article{lewis2025bioemu,
  title={Scalable emulation of protein equilibrium ensembles with generative deep learning},
  author={Lewis, Sarah and Hempel, Tim and Jim{\'e}nez-Luna, Jos{\'e} and Gastegger, Michael and Xie, Yu and Foong, Andrew Y K and Satorras, Victor Garc{\'i}a and Abdin, Osama and Veeling, Bastiaan S and Zaporozhets, Iryna and Chen, Yaoyi and Yang, Soojung and Foster, Adam E and Schneuing, Arne and Nigam, Jigyasa and Barbero, Federico and Stimper, Vincent and Campbell, Andrew and Yim, Jason and Lienen, Marten and Shi, Yu and Zheng, Shuxin and Schulz, Hannes and Munir, Usman and Sordillo, Roberto and Tomioka, Ryota and Clementi, Cecilia and No{\'e}, Frank},
  journal={Science},
  volume={389},
  number={6761},
  pages={eadv9817},
  year={2025},
  publisher={American Association for the Advancement of Science},
  doi={10.1126/science.adv9817}
}

Copy citation

proto-bio/proto-tools/proto_tools/tools/structure_dynamics/bioemu

View source

Open Notebook

Open notebook

Coming soon!

Run this tool directly in Proto with no setup required.

Function	Description
`run_bioemu()`	Protein conformational ensemble sampling using BioEmu (GPU)	Docs Source

Background

A protein in solution is not a single fixed shape. It fluctuates among many conformations, and this flexibility underlies catalysis, allosteric regulation, and molecular recognition. Characterizing this ensemble experimentally is difficult, and physically simulating it with molecular dynamics is accurate but computationally demanding, since the timescales of biologically relevant motions can require enormous amounts of simulation. BioEmu (Lewis et al., 2025) approaches the problem with a diffusion-based generative model that learns to emulate protein equilibrium ensembles directly. Starting from noise, the model iteratively denoises protein backbone coordinates conditioned on a sequence embedding, producing thousands of statistically independent structures per hour on a single graphics processing unit. The published model was trained on a large corpus of molecular dynamics simulation alongside static structures and experimental protein stability measurements, and it reproduces functional motions such as cryptic pocket formation, local unfolding, and domain rearrangements while approximating relative free energies. The conditioning sequence embedding is derived from a multiple sequence alignment, so each sequence is first searched against sequence databases to assemble its alignment.

Learning Resources

BioEmu repository (Microsoft Research) - the reference implementation, model checkpoints, and usage examples.

Tools

Conformational Ensemble Sampling (`bioemu-sample`)

Samples a conformational ensemble of protein backbone structures for one or more single-chain protein sequences. Each sequence yields an independent ensemble whose members represent distinct conformations drawn from the model’s learned equilibrium distribution.

API Reference

Source

Input: BioEmuInput

complexes

List[Complex]

required

Protein complexes to sample. BioEmu supports monomer-only inputs, so each complex must contain one protein chain.

Show Complex

chains

List[Chain | Fragment]

required

Chains in the complex, in input order.

msas

array

Pre-computed MSAs, one entry per complex. Each entry maps chain index to its MSA. BioEmu is single-chain, so only chain index 0 is read. Populated by preprocess() or supplied directly. Default: None.

Source

Config: BioEmuConfig

num_samples

integer

default:"500"

Number of conformations to sample per input sequence.

model_name

enum

default:"bioemu-v1.1"

Checkpoint variant (v1.1 = Science paper; v1.2 = extended MD + folding-FE).Available options: bioemu-v1.0, bioemu-v1.1, bioemu-v1.2

filter_samples

boolean

default:"True"

Drop unphysical samples (steric clashes, chain breaks).

batch_size

integer

default:"10"

Upstream’s batch_size_100; effective batch is batch_size * (100 / L) ** 2.

denoiser_type

enum

default:"dpm"

Sampler algorithm — dpm is 50 deterministic steps; heun is stochastic.Available options: dpm, heun

denoiser_config

string

Path to a custom denoiser/steering YAML; overrides denoiser_type when set.

msa_host_url

string

Override the ColabFold MMseqs2 MSA server URL.

cache_embeds_dir

string

Directory to cache MSA embeddings across runs.

cache_so3_dir

string

Directory to cache SO3 precomputations across runs.

output_dir

string

Optional directory for raw BioEmu outputs.

msa_search_config

Mmseqs2HomologySearchConfig

MMseqs2 homology search config (MSA generation). Defaults are used when None.

verbose

integer

default:"0"

Verbose logging toggle (inherited).

device

string

default:"cuda"

Inference device (inherited).

timeout

integer

default:"3600"

Maximum execution time in seconds. Default: 3600.

seed

integer

Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.

include_pae_matrix

boolean

default:"False"

Inherited but unused (no PAE in conformational sampling).

Source

Output: BioEmuOutput

ensembles

List[StructureEnsemble]

required

Generated ensembles, one per input complex.

Show StructureEnsemble

structures

List[Structure]

required

List of sampled conformational structures. Each Structure represents a single backbone conformation from the ensemble.

sequence

string

required

The input protein sequence.

Applications

Surveying the conformational flexibility of a protein, including the relative populations of folded and alternative states.
Revealing functional motions such as cryptic pocket opening, local unfolding, and domain rearrangements that a single predicted structure does not show.
Generating a structural ensemble for downstream analysis such as clustering into metastable states or estimating per-residue flexibility.

Usage Tips

The input must be a single-chain monomer of standard amino acids. Multi-chain complexes, non-protein chains, and non-standard residues are rejected, and sequences beyond roughly 500 residues raise a warning because quality and cost both degrade with length.
num_samples sets the size of the ensemble. A few tens of samples give a quick read on conformational diversity, while several hundred or more give the coverage needed to estimate state populations or free-energy differences.
filter_samples removes unphysical structures. Leaving it enabled drops samples with steric clashes or broken chain geometry, so the returned ensemble may hold fewer structures than num_samples requested. Disabling it returns the raw samples for inspection.
model_name selects the checkpoint. The default bioemu-v1.1 matches the published Science paper. bioemu-v1.2 is trained on additional molecular dynamics and folding free-energy data and is preferable when folding-state thermodynamics matter, while bioemu-v1.0 reproduces the earlier preprint.
denoiser_config enables physical steering. Pointing it at a steering configuration biases sampling toward more physically plausible structures and overrides denoiser_type, which otherwise selects the deterministic dpm or stochastic heun sampler.

Toolkit Notes

A multiple sequence alignment is always required. Each sequence is searched against the ColabFold MMseqs2 server during preprocessing to build its alignment, unless an alignment is supplied directly on the input, so network access is needed when alignments are not provided.
Sampling is stochastic and seeded. Results depend on the configured seed, so repeating a run with the same seed reproduces the ensemble while changing it explores new conformations.
Output is backbone only and runs on GPU. The model returns backbone coordinates without side chains and requires a CUDA GPU, since diffusion sampling is impractical on CPU.

Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.

​Background

​Learning Resources

​Tools

​Conformational Ensemble Sampling (bioemu-sample)

​API Reference

​Applications

​Usage Tips

​Toolkit Notes

​Infrastructure Guides

Tool Persistence

Device Management

Parallel Execution

Cloud Inference

Background

Learning Resources

Tools

Conformational Ensemble Sampling (`bioemu-sample`)

API Reference

Applications

Usage Tips

Toolkit Notes

Infrastructure Guides