Proto is not affiliated with Meta AI and Biohub. This toolkit is open source and builds on the implementations produced by these organizations. Product names, logos, and trademarks are the property of their respective owners.
| Function | Description | |
|---|---|---|
run_esm2_embeddings() | Extract protein sequence embeddings and logits using ESM2 (GPU) | Docs Source |
run_esm2_gradient() | Compute ESM2 masked pseudo-log-likelihood gradient for relaxed protein sequences (GPU) | Docs Source |
run_esm2_sample() | Sample masked positions in protein sequences using ESM2 language model (GPU) | Docs Source |
run_esm2_score() | Score protein sequences using ESM2 language model (GPU) | Docs Source |
Background
In 2023, Lin et al. introduced ESM-2, a family of Transformer encoders trained with a BERT-style masked-language-modeling objective. Training used UniRef50, a clustered subset of UniProt covering roughly 65 million unique protein sequences. A central focus of the paper was the impact of scale, which was treated as the experimental variable across six model checkpoints spanning more than three orders of magnitude (8M, 35M, 150M, 650M, 3B, and 15B parameters). ESM-2 models were trained using a simple masked language modeling (MLM) objective adapted from BERT. Unlike autoregressive language models, which predict each token from preceding context only, MLM lets every residue attend to its full sequence context in both directions. At each training step a randomly generated mask covers 15% of input residues and replaces those tokens with a<mask> symbol. The model is then trained to predict the original amino acid from the surrounding bidirectional context. No structural, functional, or alignment supervision is used.
ESM-2 has since become a de facto sequence representation model for protein engineering. Its direct successor, ESM3 (Hayes et al., 2025), extends the recipe at EvolutionaryScale into a multimodal generative model that jointly handles sequence, structure, and function tracks via discrete diffusion. ESM-2 still remains the lightest and most widely deployed protein language model. Within this toolkit, the 650M checkpoint (esm2_t33_650M_UR50D) is a standard quality/speed tradeoff and is the default for every tool.
Tools
ESM2 Sampling (esm2-sample)
Selects positions to mutate via a specifiable masking strategy, replaces them with <mask>, and resamples from ESM-2’s predicted distribution. Two decoding modes are available. The single_pass mode fills every masked position in one forward pass with independent draws. The iterative_refinement mode instead runs a MaskGIT-style multi-round commit loop. Each round of that loop uses a cosine or linear unmask schedule with optional temperature annealing. To target specific positions directly, pre-mask them yourself with _ in the input string. The tool will then fill exactly those positions and skip the masking strategy entirely.API Reference
Input: ESM2SampleInput
Input: ESM2SampleInput
_ at positions to sample. Each must be ≤ 1022 residues (ESM-2’s positional-encoding cap); over-length inputs raise ValueError.Config: ESM2SampleConfig
Config: ESM2SampleConfig
esm2_t6_8M_UR50D, esm2_t12_35M_UR50D, esm2_t30_150M_UR50D, esm2_t33_650M_UR50D, esm2_t36_3B_UR50D, esm2_t48_15B_UR50Dsingle_pass, iterative_refinementcosine, linearrandom, entropyTrue is coerced to 1 and False to 0.None waits indefinitely.BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.Output: ESM2SampleOutput
Output: ESM2SampleOutput
Applications
This tool drives guided point mutation, variant generation, and infilling at designable sites for protein engineering work. Resampling masked positions from a protein language model is the core operation behind directed-evolution proposals and antibody affinity maturation, which was demonstrated at experimental scale in Hie et al., 2024. It is also the inner loop of MaskGIT-style iterative refinement schemes adapted from image generation (Chang et al., 2022) for biological sequences.Usage Tips
iterative_refinementproduces more coherent joint samples thansingle_pass. It is a multi-round MaskGIT-style commit loop (each round uses a cosine or linear unmask schedule) and is roughlynum_steps×slower than the one-shotsingle_passmode. Default to it whenever you mask more than a handful of sites.masking_strategycontrols which positions get masked before sampling. See the masking strategy README for the available selection methods and tuning knobs. As an alternative to passing a strategy, pre-mask exact positions yourself with_directly in the input string and the masking strategy is skipped entirely.temperaturescales the per-position logits before sampling. Values of 0.5 to 0.7 yield conservative mutations close to the input; values above 1.0 broaden exploration of the model’s distribution.- Long-range coherence is weak. ESM-2 has no global coherence beyond its local context window, so very long-range dependencies between distant residues are not well captured even in iterative mode.
- ESM-2 was trained as a masked language model, not with a generative objective. Resampling masked positions works for local edits, but the model was optimized for representation rather than de novo generation. For generative workloads (large-scale infilling, sequence design), ESM3 adds an explicit generative training objective and is the better fit.
ESM2 Scoring (esm2-score)
Computes the masked-language-model pseudo-perplexity for each input sequence. Each position is masked individually, and the model’s log-probability of the true amino acid under bidirectional context is recorded. The per-position scores are then aggregated into per-sequence log-likelihood, average log-likelihood, and perplexity metrics.API Reference
Input: ESM2ScoringInput
Input: ESM2ScoringInput
ValueError.Config: ESM2ScoringConfig
Config: ESM2ScoringConfig
esm2_t6_8M_UR50D, esm2_t12_35M_UR50D, esm2_t30_150M_UR50D, esm2_t33_650M_UR50D, esm2_t36_3B_UR50D, esm2_t48_15B_UR50DNone waits indefinitely.BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.Output: MaskedModelScoringOutput
Output: MaskedModelScoringOutput
Metrics subclass with scalar metrics (accessed via score.perplexity or score["perplexity"]) plus declared logits / vocab fields that carry raw model outputs when requested.scores item)| Metric | Type | Range | Availability |
|---|---|---|---|
log_likelihood | float | ≤ 0.0 | always |
avg_log_likelihood | float | ≤ 0.0 | always |
perplexity | float | ≥ 1.0 | always |
Applications
ESM2 pseudo-perplexity is a standard fitness proxy when ranking variants, filtering generated sequences for naturalness, or comparing engineered constructs against wild type. The same masked log-likelihood difference between wild-type and mutant residues is a canonical zero-shot baseline for variant-effect prediction.Usage Tips
- Pseudo-perplexity is a relative score, not an absolute fitness. It is measured against ESM-2’s training distribution, which is UniRef50 (the natural proteins it saw during pretraining), which can bias it to proteins that are more heavily represented. The metric is also sensitive to length, so it is most useful for comparing closely related sequences of similar length.
- Ambiguous residues are excluded. Perplexity is computed only over the 20 canonical amino acids;
X,B,Z, and similar are dropped from both the log-likelihood sum and the position count.
ESM2 Gradient (esm2-gradient)
Computes the gradient of the mean masked negative log-likelihood with respect to a relaxed (L, 20) input distribution over the canonical amino-acid order ACDEFGHIKLMNPQRSTVWY. The ESM-2 weights are kept frozen throughout. The relaxed distribution is mixed against ESM-2’s per-residue token embeddings to form a soft input. Each amino-acid position is then masked in turn, and a per-chunk backward pass accumulates the gradient. An optional Straight-Through Estimator runs the forward on hard one-hot tokens while still routing gradients through the soft probabilities.API Reference
Input: ESM2GradientInput
Input: ESM2GradientInput
(L, 20) in canonical amino-acid order ACDEFGHIKLMNPQRSTVWY. L must be ≤ 1022 (ESM-2’s positional-encoding cap); over-length inputs raise ValueError.softmax(input / temperature) before computing the gradient. When None (default), the input is used as-is.Config: ESM2GradientConfig
Config: ESM2GradientConfig
esm2_t6_8M_UR50D, esm2_t12_35M_UR50D, esm2_t30_150M_UR50D, esm2_t33_650M_UR50D, esm2_t36_3B_UR50D, esm2_t48_15B_UR50DFalse, uses soft blended embeddings directly.False for forward-only log-likelihood scoring.None selects the backend default.True is coerced to 1 and False to 0.None waits indefinitely.BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.Output: ESM2GradientOutput
Output: ESM2GradientOutput
None when compute_gradient=False.Applications
This tool exposes ESM-2 as a differentiable, structure-free protein-language-model loss for use inside MCMC, gradient descent, or any other optimization loop over relaxed protein sequences. It is most often used as a naturalness prior in continuous design pipelines, including latent Bayesian optimization frameworks and discrete walk-jump sampling approaches for de novo protein design.Usage Tips
temperaturecontrols how the raw input is converted into a distribution. With a value set, the tool appliessoftmax(logits / T)before the forward pass; leave itNone(the default) if the input already sums to 1 per position.use_steenables the Straight-Through Estimator. The forward then runs on hard one-hot tokens while gradients still route through the soft probabilities, giving stronger guidance toward discrete sequences. Leave it off for smooth optimization over the relaxed simplex.compute_gradienttoggles whether the backward pass runs. When set toFalse, thegradientfield isNone, butlossandmetrics(log-likelihood, perplexity, and so on) are still populated. Useful for ranking MCMC proposals without paying the backward cost.
Toolkit Notes
These apply to every ESM-2 tool in this toolkit (esm2-embedding, esm2-sample, esm2-score, esm2-gradient).
- Different ESM-2 checkpoints produce different embedding sizes. Downstream tasks built on one checkpoint will not transfer to another without re-fitting; pick one and stick with it for an analysis.
- Smaller checkpoints run faster. The 150M and 35M variants are significantly faster than the 650M default, with drops in representation quality.
- Max sequence length is 1022 residues. ESM-2’s positional encoding caps inputs at 1022 residues, and will raise
ValueErroron longer inputs rather than truncating. batch_sizecontrols memory usage across the toolkit. Lower it if you OOM; raise it for short-sequence throughput. One nuance: foresm2-score,batch_sizecounts masked variants pooled across all input sequences rather than sequences themselves (each input contributesLmasked variants).

Meta AI
Biohub