Skip to main content
License: BioEmu is open source and free for academic and commercial use under an MIT license. Please refer to the license for full terms.

Proto is not affiliated with Microsoft Research. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.


microsoft/bioemu
microsoft/bioemu
Inference code for scalable emulation of protein equilibrium ensembles with generative deep learning
783 stars
View repo
Scalable emulation of protein equilibrium ensembles with generative deep learning
Sarah Lewis, Tim Hempel, … Frank Noe
Science (2025)
Read paper
@article{lewis2025bioemu,
  title={Scalable emulation of protein equilibrium ensembles with generative deep learning},
  author={Lewis, Sarah and Hempel, Tim and Jim{\'e}nez-Luna, Jos{\'e} and Gastegger, Michael and Xie, Yu and Foong, Andrew Y K and Satorras, Victor Garc{\'i}a and Abdin, Osama and Veeling, Bastiaan S and Zaporozhets, Iryna and Chen, Yaoyi and Yang, Soojung and Foster, Adam E and Schneuing, Arne and Nigam, Jigyasa and Barbero, Federico and Stimper, Vincent and Campbell, Andrew and Yim, Jason and Lienen, Marten and Shi, Yu and Zheng, Shuxin and Schulz, Hannes and Munir, Usman and Sordillo, Roberto and Tomioka, Ryota and Clementi, Cecilia and No{\'e}, Frank},
  journal={Science},
  volume={389},
  number={6761},
  pages={eadv9817},
  year={2025},
  publisher={American Association for the Advancement of Science},
  doi={10.1126/science.adv9817}
}
Copy citation
proto-bio/proto-tools/proto_tools/tools/structure_dynamics/bioemu
View source
Open Notebook
Open notebook
Coming soon!
Run this tool directly in Proto with no setup required.
FunctionDescription
run_bioemu()Protein conformational ensemble sampling using BioEmu (GPU) Docs Source

Background

A protein in solution is not a single fixed shape. It fluctuates among many conformations, and this flexibility underlies catalysis, allosteric regulation, and molecular recognition. Characterizing this ensemble experimentally is difficult, and physically simulating it with molecular dynamics is accurate but computationally demanding, since the timescales of biologically relevant motions can require enormous amounts of simulation. BioEmu (Lewis et al., 2025) approaches the problem with a diffusion-based generative model that learns to emulate protein equilibrium ensembles directly. Starting from noise, the model iteratively denoises protein backbone coordinates conditioned on a sequence embedding, producing thousands of statistically independent structures per hour on a single graphics processing unit. The published model was trained on a large corpus of molecular dynamics simulation alongside static structures and experimental protein stability measurements, and it reproduces functional motions such as cryptic pocket formation, local unfolding, and domain rearrangements while approximating relative free energies. The conditioning sequence embedding is derived from a multiple sequence alignment, so each sequence is first searched against sequence databases to assemble its alignment.

Learning Resources

  • BioEmu repository (Microsoft Research) - the reference implementation, model checkpoints, and usage examples.

Tools

Conformational Ensemble Sampling (bioemu-sample)

Samples a conformational ensemble of protein backbone structures for one or more single-chain protein sequences. Each sequence yields an independent ensemble whose members represent distinct conformations drawn from the model’s learned equilibrium distribution.

API Reference

Source
complexes
List[Complex]
required
Protein complexes to sample. BioEmu supports monomer-only inputs, so each complex must contain one protein chain.
msas
array
Pre-computed MSAs, one entry per complex. Each entry maps chain index to its MSA. BioEmu is single-chain, so only chain index 0 is read. Populated by preprocess() or supplied directly. Default: None.
Source
num_samples
integer
default:"500"
Number of conformations to sample per input sequence.
model_name
enum
default:"bioemu-v1.1"
Checkpoint variant (v1.1 = Science paper; v1.2 = extended MD + folding-FE).Available options: bioemu-v1.0, bioemu-v1.1, bioemu-v1.2
filter_samples
boolean
default:"True"
Drop unphysical samples (steric clashes, chain breaks).
batch_size
integer
default:"10"
Upstream’s batch_size_100; effective batch is batch_size * (100 / L) ** 2.
denoiser_type
enum
default:"dpm"
Sampler algorithm — dpm is 50 deterministic steps; heun is stochastic.Available options: dpm, heun
denoiser_config
string
Path to a custom denoiser/steering YAML; overrides denoiser_type when set.
msa_host_url
string
Override the ColabFold MMseqs2 MSA server URL.
cache_embeds_dir
string
Directory to cache MSA embeddings across runs.
cache_so3_dir
string
Directory to cache SO3 precomputations across runs.
output_dir
string
Optional directory for raw BioEmu outputs.
msa_search_config
Mmseqs2HomologySearchConfig
MMseqs2 homology search config (MSA generation). Defaults are used when None.
verbose
integer
default:"0"
Verbose logging toggle (inherited).
device
string
default:"cuda"
Inference device (inherited).
timeout
integer
default:"3600"
Maximum execution time in seconds. Default: 3600.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
include_pae_matrix
boolean
default:"False"
Inherited but unused (no PAE in conformational sampling).
Source
ensembles
List[StructureEnsemble]
required
Generated ensembles, one per input complex.

Applications

  • Surveying the conformational flexibility of a protein, including the relative populations of folded and alternative states.
  • Revealing functional motions such as cryptic pocket opening, local unfolding, and domain rearrangements that a single predicted structure does not show.
  • Generating a structural ensemble for downstream analysis such as clustering into metastable states or estimating per-residue flexibility.

Usage Tips

  • The input must be a single-chain monomer of standard amino acids. Multi-chain complexes, non-protein chains, and non-standard residues are rejected, and sequences beyond roughly 500 residues raise a warning because quality and cost both degrade with length.
  • num_samples sets the size of the ensemble. A few tens of samples give a quick read on conformational diversity, while several hundred or more give the coverage needed to estimate state populations or free-energy differences.
  • filter_samples removes unphysical structures. Leaving it enabled drops samples with steric clashes or broken chain geometry, so the returned ensemble may hold fewer structures than num_samples requested. Disabling it returns the raw samples for inspection.
  • model_name selects the checkpoint. The default bioemu-v1.1 matches the published Science paper. bioemu-v1.2 is trained on additional molecular dynamics and folding free-energy data and is preferable when folding-state thermodynamics matter, while bioemu-v1.0 reproduces the earlier preprint.
  • denoiser_config enables physical steering. Pointing it at a steering configuration biases sampling toward more physically plausible structures and overrides denoiser_type, which otherwise selects the deterministic dpm or stochastic heun sampler.

Toolkit Notes

  • A multiple sequence alignment is always required. Each sequence is searched against the ColabFold MMseqs2 server during preprocessing to build its alignment, unless an alignment is supplied directly on the input, so network access is needed when alignments are not provided.
  • Sampling is stochastic and seeded. Results depend on the configured seed, so repeating a run with the same seed reproduces the ensemble while changing it explores new conformations.
  • Output is backbone only and runs on GPU. The model returns backbone coordinates without side chains and requires a CUDA GPU, since diffusion sampling is impractical on CPU.
Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.