Proto is not affiliated with Microsoft Research. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.
Background
A protein in solution is not a single fixed shape. It fluctuates among many conformations, and this flexibility underlies catalysis, allosteric regulation, and molecular recognition. Characterizing this ensemble experimentally is difficult, and physically simulating it with molecular dynamics is accurate but computationally demanding, since the timescales of biologically relevant motions can require enormous amounts of simulation. BioEmu (Lewis et al., 2025) approaches the problem with a diffusion-based generative model that learns to emulate protein equilibrium ensembles directly. Starting from noise, the model iteratively denoises protein backbone coordinates conditioned on a sequence embedding, producing thousands of statistically independent structures per hour on a single graphics processing unit. The published model was trained on a large corpus of molecular dynamics simulation alongside static structures and experimental protein stability measurements, and it reproduces functional motions such as cryptic pocket formation, local unfolding, and domain rearrangements while approximating relative free energies. The conditioning sequence embedding is derived from a multiple sequence alignment, so each sequence is first searched against sequence databases to assemble its alignment.Learning Resources
- BioEmu repository (Microsoft Research) - the reference implementation, model checkpoints, and usage examples.
Tools
Conformational Ensemble Sampling (bioemu-sample)
Samples a conformational ensemble of protein backbone structures for one or more single-chain protein sequences. Each sequence yields an independent ensemble whose members represent distinct conformations drawn from the model’s learned equilibrium distribution.API Reference
Input: BioEmuInput
Input: BioEmuInput
0 is read. Populated by preprocess() or supplied directly. Default: None.Config: BioEmuConfig
Config: BioEmuConfig
bioemu-v1.0, bioemu-v1.1, bioemu-v1.2batch_size_100; effective batch is batch_size * (100 / L) ** 2.dpm is 50 deterministic steps; heun is stochastic.Available options: dpm, heundenoiser_type when set.None.BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.Output: BioEmuOutput
Output: BioEmuOutput
Applications
- Surveying the conformational flexibility of a protein, including the relative populations of folded and alternative states.
- Revealing functional motions such as cryptic pocket opening, local unfolding, and domain rearrangements that a single predicted structure does not show.
- Generating a structural ensemble for downstream analysis such as clustering into metastable states or estimating per-residue flexibility.
Usage Tips
- The input must be a single-chain monomer of standard amino acids. Multi-chain complexes, non-protein chains, and non-standard residues are rejected, and sequences beyond roughly 500 residues raise a warning because quality and cost both degrade with length.
num_samplessets the size of the ensemble. A few tens of samples give a quick read on conformational diversity, while several hundred or more give the coverage needed to estimate state populations or free-energy differences.filter_samplesremoves unphysical structures. Leaving it enabled drops samples with steric clashes or broken chain geometry, so the returned ensemble may hold fewer structures thannum_samplesrequested. Disabling it returns the raw samples for inspection.model_nameselects the checkpoint. The defaultbioemu-v1.1matches the published Science paper.bioemu-v1.2is trained on additional molecular dynamics and folding free-energy data and is preferable when folding-state thermodynamics matter, whilebioemu-v1.0reproduces the earlier preprint.denoiser_configenables physical steering. Pointing it at a steering configuration biases sampling toward more physically plausible structures and overridesdenoiser_type, which otherwise selects the deterministicdpmor stochasticheunsampler.
Toolkit Notes
- A multiple sequence alignment is always required. Each sequence is searched against the ColabFold MMseqs2 server during preprocessing to build its alignment, unless an alignment is supplied directly on the input, so network access is needed when alignments are not provided.
- Sampling is stochastic and seeded. Results depend on the configured seed, so repeating a run with the same seed reproduces the ensemble while changing it explores new conformations.
- Output is backbone only and runs on GPU. The model returns backbone coordinates without side chains and requires a CUDA GPU, since diffusion sampling is impractical on CPU.

Microsoft Research