Skip to main content
License: ESM2 is open source and free for academic and commercial use under an MIT license. Please refer to the license for full terms.

Proto is not affiliated with Meta AI and Biohub. This toolkit is open source and builds on the implementations produced by these organizations. Product names, logos, and trademarks are the property of their respective owners.


facebookresearch/esm
facebookresearch/esm
Evolutionary Scale Modeling (esm): Pretrained language models for proteins
4.0k stars
View repo
Evolutionary-scale prediction of atomic-level protein structure with a language model
Zeming Lin, Halil Akin, … Yaniv Shmueli
Science (2023)
Read paper
@article{lin2023esm2,
  title={Evolutionary-scale prediction of atomic-level protein structure with a language model},
  author={Lin, Zeming and Akin, Halil and Rao, Roshan and Hie, Brian and Zhu, Zhongkai and Lu, Wenting and Smetanin, Nikita and Verkuil, Robert and Kabeli, Ori and Shmueli, Yaniv and others},
  journal={Science},
  volume={379},
  number={6637},
  pages={1123--1130},
  year={2023},
  publisher={American Association for the Advancement of Science},
  doi={10.1126/science.ade2574}
}
Copy citation
proto-bio/proto-tools/proto_tools/tools/masked_models/esm2
View source
Open Notebook
Open notebook
Coming soon!
Run this tool directly in Proto with no setup required.
FunctionDescription
run_esm2_embeddings()Extract protein sequence embeddings and logits using ESM2 (GPU) Docs Source
run_esm2_gradient()Compute ESM2 masked pseudo-log-likelihood gradient for relaxed protein sequences (GPU) Docs Source
run_esm2_sample()Sample masked positions in protein sequences using ESM2 language model (GPU) Docs Source
run_esm2_score()Score protein sequences using ESM2 language model (GPU) Docs Source

Background

In 2023, Lin et al. introduced ESM-2, a family of Transformer encoders trained with a BERT-style masked-language-modeling objective. Training used UniRef50, a clustered subset of UniProt covering roughly 65 million unique protein sequences. A central focus of the paper was the impact of scale, which was treated as the experimental variable across six model checkpoints spanning more than three orders of magnitude (8M, 35M, 150M, 650M, 3B, and 15B parameters). ESM-2 models were trained using a simple masked language modeling (MLM) objective adapted from BERT. Unlike autoregressive language models, which predict each token from preceding context only, MLM lets every residue attend to its full sequence context in both directions. At each training step a randomly generated mask covers 15% of input residues and replaces those tokens with a <mask> symbol. The model is then trained to predict the original amino acid from the surrounding bidirectional context. No structural, functional, or alignment supervision is used. ESM-2 has since become a de facto sequence representation model for protein engineering. Its direct successor, ESM3 (Hayes et al., 2025), extends the recipe at EvolutionaryScale into a multimodal generative model that jointly handles sequence, structure, and function tracks via discrete diffusion. ESM-2 still remains the lightest and most widely deployed protein language model. Within this toolkit, the 650M checkpoint (esm2_t33_650M_UR50D) is a standard quality/speed tradeoff and is the default for every tool.

Tools

ESM2 Embeddings (esm2-embedding)

Runs a single forward pass over ESM-2 to extract contextualized per-residue hidden states. The hidden states are mean-pooled across valid positions to produce a fixed-length sequence descriptor. Per-position 20-way amino-acid logits over the canonical order ACDEFGHIKLMNPQRSTVWY are also returned on request.

API Reference

Source
sequences
List[string]
required
Protein sequence(s) to process. Each must be ≤ 1022 residues (ESM-2’s positional-encoding cap); over-length inputs raise ValueError.
Source
model_checkpoint
enum
default:"esm2_t33_650M_UR50D"
ESM2 weights variant. Sizes range from 8M (320-dim, fastest) to 15B (5120-dim, highest quality). The 650M variant offers a good speed/quality trade-off.Available options: esm2_t6_8M_UR50D, esm2_t12_35M_UR50D, esm2_t30_150M_UR50D, esm2_t33_650M_UR50D, esm2_t36_3B_UR50D, esm2_t48_15B_UR50D
return_logits
boolean
default:"False"
Include per-position logits in the output (large; disable to save memory).
repr_layer
integer
default:"-1"
Transformer layer index for embeddings. -1 selects the last (top) layer; uses HuggingFace hidden_states indexing where 0 is the embedding-layer output and N is transformer layer N.
verbose
integer
default:"0"
Print status messages during model execution.
device
string
default:"cuda"
Device to run the model on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
batch_size
integer
default:"1"
Number of sequences to process in parallel. Larger batches improve throughput but require more GPU memory.
Source
results
List[SequenceEmbedding]
required
Per-sequence embedding results. Each SequenceEmbedding contains:

Applications

The mean-pooled embedding is a standard learned protein representation for downstream supervised tasks like clustering, classification, and regression on protein properties. The same embeddings also power similarity search through cosine similarity on the mean vector. The per-position logits support variant-effect screening by comparing wild-type and mutant log-probabilities at each position. The underlying attention maps are themselves rich enough to recover residue-residue contacts without explicit supervision.

Usage Tips

  • The last transformer layer carries the richest bidirectional context. repr_layer chooses which layer to read for the mean-pooled embedding; the default -1 selects the last layer and is the standard pick for downstream classification, regression, and variant-effect work. Earlier layers can outperform the top on certain probes (contact prediction is the canonical example).
  • Per-position logits are large and slow to materialize. Enabling return_logits adds a seq_len × 20 float tensor per sequence to the output, dominating wall time on long inputs. Leave it False unless you actually need the per-position distribution.

ESM2 Sampling (esm2-sample)

Selects positions to mutate via a specifiable masking strategy, replaces them with <mask>, and resamples from ESM-2’s predicted distribution. Two decoding modes are available. The single_pass mode fills every masked position in one forward pass with independent draws. The iterative_refinement mode instead runs a MaskGIT-style multi-round commit loop. Each round of that loop uses a cosine or linear unmask schedule with optional temperature annealing. To target specific positions directly, pre-mask them yourself with _ in the input string. The tool will then fill exactly those positions and skip the masking strategy entirely.

API Reference

Source
sequences
List[string]
required
Protein sequence(s) with _ at positions to sample. Each must be ≤ 1022 residues (ESM-2’s positional-encoding cap); over-length inputs raise ValueError.
Source
masking_strategy
MaskingStrategy
Positions to mask before sampling.
model_checkpoint
enum
default:"esm2_t33_650M_UR50D"
ESM2 weights variant.Available options: esm2_t6_8M_UR50D, esm2_t12_35M_UR50D, esm2_t30_150M_UR50D, esm2_t33_650M_UR50D, esm2_t36_3B_UR50D, esm2_t48_15B_UR50D
sampling_method
enum
default:"single_pass"
“single_pass” fills every mask in one forward; “iterative_refinement” runs an iterative MaskGIT-style loop driven by the five knobs below.Available options: single_pass, iterative_refinement
temperature
number
default:"1.0"
Softmax temperature.
top_p
number
default:"1.0"
Nucleus threshold (iterative only).
num_steps
integer
default:"20"
Refinement steps (iterative only).
schedule
enum
default:"cosine"
Unmask schedule (iterative only).Available options: cosine, linear
strategy
enum
default:"random"
Per-round commit selection (iterative only).Available options: random, entropy
temperature_annealing
boolean
default:"True"
Anneal toward 0 across rounds (iterative only).
return_logits
boolean
default:"False"
Include per-position logits.
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cuda"
Device to run on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
batch_size
integer
default:"1"
Sequences per GPU forward pass.
Source
logits
array
Per-position logits for each sequence. Shape is (num_sequences, seq_len, vocab_size=20). Only present if return_logits=True in config.
sequences
List[string]
required
Sampled or mutated protein sequences. Each sequence is a string of amino acid characters and is a modified version of the input sequence with masked positions changed to model-predicted alternatives.

Applications

This tool drives guided point mutation, variant generation, and infilling at designable sites for protein engineering work. Resampling masked positions from a protein language model is the core operation behind directed-evolution proposals and antibody affinity maturation, which was demonstrated at experimental scale in Hie et al., 2024. It is also the inner loop of MaskGIT-style iterative refinement schemes adapted from image generation (Chang et al., 2022) for biological sequences.

Usage Tips

  • iterative_refinement produces more coherent joint samples than single_pass. It is a multi-round MaskGIT-style commit loop (each round uses a cosine or linear unmask schedule) and is roughly num_steps× slower than the one-shot single_pass mode. Default to it whenever you mask more than a handful of sites.
  • masking_strategy controls which positions get masked before sampling. See the masking strategy README for the available selection methods and tuning knobs. As an alternative to passing a strategy, pre-mask exact positions yourself with _ directly in the input string and the masking strategy is skipped entirely.
  • temperature scales the per-position logits before sampling. Values of 0.5 to 0.7 yield conservative mutations close to the input; values above 1.0 broaden exploration of the model’s distribution.
  • Long-range coherence is weak. ESM-2 has no global coherence beyond its local context window, so very long-range dependencies between distant residues are not well captured even in iterative mode.
  • ESM-2 was trained as a masked language model, not with a generative objective. Resampling masked positions works for local edits, but the model was optimized for representation rather than de novo generation. For generative workloads (large-scale infilling, sequence design), ESM3 adds an explicit generative training objective and is the better fit.

ESM2 Scoring (esm2-score)

Computes the masked-language-model pseudo-perplexity for each input sequence. Each position is masked individually, and the model’s log-probability of the true amino acid under bidirectional context is recorded. The per-position scores are then aggregated into per-sequence log-likelihood, average log-likelihood, and perplexity metrics.

API Reference

Source
sequences
List[string]
required
Protein sequence(s) to score. Each must be ≤ 1022 residues (ESM-2’s positional-encoding cap); over-length inputs raise ValueError.
Source
model_checkpoint
enum
default:"esm2_t33_650M_UR50D"
ESM2 weights variant.Available options: esm2_t6_8M_UR50D, esm2_t12_35M_UR50D, esm2_t30_150M_UR50D, esm2_t33_650M_UR50D, esm2_t36_3B_UR50D, esm2_t48_15B_UR50D
verbose
integer
default:"0"
Print status messages during scoring.
device
string
default:"cuda"
Device to run the model on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
batch_size
integer
default:"1"
Masked variants per forward pass, pooled across all input sequences. Larger batches improve throughput but use more memory.
return_logits
boolean
default:"False"
Include per-position logits in the output (large; disable to save memory).
Source
scores
List[MaskedModelScoringMetrics]
required
List of scoring outputs, one per input sequence. Each entry is a Metrics subclass with scalar metrics (accessed via score.perplexity or score["perplexity"]) plus declared logits / vocab fields that carry raw model outputs when requested.
Metrics (one set per scores item)
MetricTypeRangeAvailability
log_likelihoodfloat≤ 0.0always
avg_log_likelihoodfloat≤ 0.0always
perplexityfloat≥ 1.0always

Applications

ESM2 pseudo-perplexity is a standard fitness proxy when ranking variants, filtering generated sequences for naturalness, or comparing engineered constructs against wild type. The same masked log-likelihood difference between wild-type and mutant residues is a canonical zero-shot baseline for variant-effect prediction.

Usage Tips

  • Pseudo-perplexity is a relative score, not an absolute fitness. It is measured against ESM-2’s training distribution, which is UniRef50 (the natural proteins it saw during pretraining), which can bias it to proteins that are more heavily represented. The metric is also sensitive to length, so it is most useful for comparing closely related sequences of similar length.
  • Ambiguous residues are excluded. Perplexity is computed only over the 20 canonical amino acids; X, B, Z, and similar are dropped from both the log-likelihood sum and the position count.

ESM2 Gradient (esm2-gradient)

Computes the gradient of the mean masked negative log-likelihood with respect to a relaxed (L, 20) input distribution over the canonical amino-acid order ACDEFGHIKLMNPQRSTVWY. The ESM-2 weights are kept frozen throughout. The relaxed distribution is mixed against ESM-2’s per-residue token embeddings to form a soft input. Each amino-acid position is then masked in turn, and a per-chunk backward pass accumulates the gradient. An optional Straight-Through Estimator runs the forward on hard one-hot tokens while still routing gradients through the soft probabilities.

API Reference

Source
logits
List[array]
required
Relaxed protein sequence state with shape (L, 20) in canonical amino-acid order ACDEFGHIKLMNPQRSTVWY. L must be ≤ 1022 (ESM-2’s positional-encoding cap); over-length inputs raise ValueError.
temperature
number
Optional softmax temperature. When set, applies softmax(input / temperature) before computing the gradient. When None (default), the input is used as-is.
Source
model_checkpoint
enum
default:"esm2_t33_650M_UR50D"
ESM2 weights variant.Available options: esm2_t6_8M_UR50D, esm2_t12_35M_UR50D, esm2_t30_150M_UR50D, esm2_t33_650M_UR50D, esm2_t36_3B_UR50D, esm2_t48_15B_UR50D
use_ste
boolean
default:"False"
Straight-Through Estimator: hard one-hot in the forward pass with gradients flowing through soft probabilities. When False, uses soft blended embeddings directly.
compute_gradient
boolean
default:"True"
Run backward pass and return gradient. Set False for forward-only log-likelihood scoring.
batch_size
integer
AA positions per forward pass for batched PLL. None selects the backend default.
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cuda"
Device to run the model on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
gradient
array
Gradient w.r.t. input logits, or None when compute_gradient=False.
loss
number
required
Mean negative log-likelihood over AA positions.
metrics
Dict[string, any]
Log-likelihood, perplexity, sequence length, and objective details.
vocab
List[string]
required
Amino-acid column ordering for the input logits.

Applications

This tool exposes ESM-2 as a differentiable, structure-free protein-language-model loss for use inside MCMC, gradient descent, or any other optimization loop over relaxed protein sequences. It is most often used as a naturalness prior in continuous design pipelines, including latent Bayesian optimization frameworks and discrete walk-jump sampling approaches for de novo protein design.

Usage Tips

  • temperature controls how the raw input is converted into a distribution. With a value set, the tool applies softmax(logits / T) before the forward pass; leave it None (the default) if the input already sums to 1 per position.
  • use_ste enables the Straight-Through Estimator. The forward then runs on hard one-hot tokens while gradients still route through the soft probabilities, giving stronger guidance toward discrete sequences. Leave it off for smooth optimization over the relaxed simplex.
  • compute_gradient toggles whether the backward pass runs. When set to False, the gradient field is None, but loss and metrics (log-likelihood, perplexity, and so on) are still populated. Useful for ranking MCMC proposals without paying the backward cost.

Toolkit Notes

These apply to every ESM-2 tool in this toolkit (esm2-embedding, esm2-sample, esm2-score, esm2-gradient).
  • Different ESM-2 checkpoints produce different embedding sizes. Downstream tasks built on one checkpoint will not transfer to another without re-fitting; pick one and stick with it for an analysis.
  • Smaller checkpoints run faster. The 150M and 35M variants are significantly faster than the 650M default, with drops in representation quality.
  • Max sequence length is 1022 residues. ESM-2’s positional encoding caps inputs at 1022 residues, and will raise ValueError on longer inputs rather than truncating.
  • batch_size controls memory usage across the toolkit. Lower it if you OOM; raise it for short-sequence throughput. One nuance: for esm2-score, batch_size counts masked variants pooled across all input sequences rather than sequences themselves (each input contributes L masked variants).
Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.