ESM2 - Proto

License: ESM2 is open source and free for academic and commercial use under an MIT license. Please refer to the license for full terms.

Proto is not affiliated with Meta AI and Biohub. This toolkit is open source and builds on the implementations produced by these organizations. Product names, logos, and trademarks are the property of their respective owners.

GitHub 4.0k GitHub 4.0k

HuggingFace

HuggingFace Publication Publication Cite Cite Tool Source Tool Source Open as Notebook Open as Notebook Open on Proto Open on Proto

facebookresearch/esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins

Evolutionary-scale prediction of atomic-level protein structure with a language model

Zeming Lin, Halil Akin, … Yaniv Shmueli

Science (2023)

Read paper

@article{lin2023esm2,
  title={Evolutionary-scale prediction of atomic-level protein structure with a language model},
  author={Lin, Zeming and Akin, Halil and Rao, Roshan and Hie, Brian and Zhu, Zhongkai and Lu, Wenting and Smetanin, Nikita and Verkuil, Robert and Kabeli, Ori and Shmueli, Yaniv and others},
  journal={Science},
  volume={379},
  number={6637},
  pages={1123--1130},
  year={2023},
  publisher={American Association for the Advancement of Science},
  doi={10.1126/science.ade2574}
}

Copy citation

proto-bio/proto-tools/proto_tools/tools/masked_models/esm2

View source

Open Notebook

Open notebook

Coming soon!

Run this tool directly in Proto with no setup required.

Function	Description
`run_esm2_embeddings()`	Extract protein sequence embeddings and logits using ESM2 (GPU)	Docs Source
`run_esm2_gradient()`	Compute ESM2 masked pseudo-log-likelihood gradient for relaxed protein sequences (GPU)	Docs Source
`run_esm2_sample()`	Sample masked positions in protein sequences using ESM2 language model (GPU)	Docs Source
`run_esm2_score()`	Score protein sequences using ESM2 language model (GPU)	Docs Source

Background

In 2023, Lin et al. introduced ESM-2, a family of Transformer encoders trained with a BERT-style masked-language-modeling objective. Training used UniRef50, a clustered subset of UniProt covering roughly 65 million unique protein sequences. A central focus of the paper was the impact of scale, which was treated as the experimental variable across six model checkpoints spanning more than three orders of magnitude (8M, 35M, 150M, 650M, 3B, and 15B parameters). ESM-2 models were trained using a simple masked language modeling (MLM) objective adapted from BERT. Unlike autoregressive language models, which predict each token from preceding context only, MLM lets every residue attend to its full sequence context in both directions. At each training step a randomly generated mask covers 15% of input residues and replaces those tokens with a <mask> symbol. The model is then trained to predict the original amino acid from the surrounding bidirectional context. No structural, functional, or alignment supervision is used. ESM-2 has since become a de facto sequence representation model for protein engineering. Its direct successor, ESM3 (Hayes et al., 2025), extends the recipe at EvolutionaryScale into a multimodal generative model that jointly handles sequence, structure, and function tracks via discrete diffusion. ESM-2 still remains the lightest and most widely deployed protein language model. Within this toolkit, the 650M checkpoint (esm2_t33_650M_UR50D) is a standard quality/speed tradeoff and is the default for every tool.

Tools

ESM2 Embeddings (`esm2-embedding`)

Runs a single forward pass over ESM-2 to extract contextualized per-residue hidden states. The hidden states are mean-pooled across valid positions to produce a fixed-length sequence descriptor. Per-position 20-way amino-acid logits over the canonical order ACDEFGHIKLMNPQRSTVWY are also returned on request.

API Reference

Source

Input: ESM2EmbeddingsInput

sequences

List[string]

required

Protein sequence(s) to process. Each must be ≤ 1022 residues (ESM-2’s positional-encoding cap); over-length inputs raise ValueError.

Source

Config: ESM2EmbeddingsConfig

model_checkpoint

enum

default:"esm2_t33_650M_UR50D"

ESM2 weights variant. Sizes range from 8M (320-dim, fastest) to 15B (5120-dim, highest quality). The 650M variant offers a good speed/quality trade-off.Available options: esm2_t6_8M_UR50D, esm2_t12_35M_UR50D, esm2_t30_150M_UR50D, esm2_t33_650M_UR50D, esm2_t36_3B_UR50D, esm2_t48_15B_UR50D

return_logits

boolean

default:"False"

Include per-position logits in the output (large; disable to save memory).

repr_layer

integer

default:"-1"

Transformer layer index for embeddings. -1 selects the last (top) layer; uses HuggingFace hidden_states indexing where 0 is the embedding-layer output and N is transformer layer N.

verbose

integer

default:"0"

Print status messages during model execution.

device

string

default:"cuda"

Device to run the model on.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.

batch_size

integer

default:"1"

Number of sequences to process in parallel. Larger batches improve throughput but require more GPU memory.

Source

Output: ESM2EmbeddingsOutput

results

List[SequenceEmbedding]

required

Per-sequence embedding results. Each SequenceEmbedding contains:

Show SequenceEmbedding

mean_embedding

List[number]

required

Mean-pooled embedding vector for one sequence.

attention_mask

List[integer]

required

Binary mask indicating valid positions (1) vs padding (0).

logits

array

Optional per-position amino acid logits for one sequence.

projection

Projection2D

Optional 2D coordinate from a UMAP projection of all embeddings in the same call. Populated when n_sequences >= 4; None otherwise (single-point or 2-3-point UMAP is meaningless).

Applications

The mean-pooled embedding is a standard learned protein representation for downstream supervised tasks like clustering, classification, and regression on protein properties. The same embeddings also power similarity search through cosine similarity on the mean vector. The per-position logits support variant-effect screening by comparing wild-type and mutant log-probabilities at each position. The underlying attention maps are themselves rich enough to recover residue-residue contacts without explicit supervision.

Usage Tips

The last transformer layer carries the richest bidirectional context. repr_layer chooses which layer to read for the mean-pooled embedding; the default -1 selects the last layer and is the standard pick for downstream classification, regression, and variant-effect work. Earlier layers can outperform the top on certain probes (contact prediction is the canonical example).
Per-position logits are large and slow to materialize. Enabling return_logits adds a seq_len × 20 float tensor per sequence to the output, dominating wall time on long inputs. Leave it False unless you actually need the per-position distribution.

ESM2 Sampling (`esm2-sample`)

Selects positions to mutate via a specifiable masking strategy, replaces them with <mask>, and resamples from ESM-2’s predicted distribution. Two decoding modes are available. The single_pass mode fills every masked position in one forward pass with independent draws. The iterative_refinement mode instead runs a MaskGIT-style multi-round commit loop. Each round of that loop uses a cosine or linear unmask schedule with optional temperature annealing. To target specific positions directly, pre-mask them yourself with _ in the input string. The tool will then fill exactly those positions and skip the masking strategy entirely.

API Reference

Source

Input: ESM2SampleInput

sequences

List[string]

required

Protein sequence(s) with _ at positions to sample. Each must be ≤ 1022 residues (ESM-2’s positional-encoding cap); over-length inputs raise ValueError.

Source

Config: ESM2SampleConfig

masking_strategy

MaskingStrategy

Positions to mask before sampling.

Show MaskingStrategy

method

enum

default:"random"

Scoring method for position selection. "random": uniform random, "entropy": highest model uncertainty, "max-logit": lowest model confidence.Available options: random, entropy, max-logit

num_mutations

integer

Exact number of positions to mask per sequence.

mask_fraction

number

Fraction of designable positions to mask (e.g. 0.15 for ~15%).

fixed_positions

array

1-indexed positions that must NOT be masked. Applied uniformly to all sequences.

temperature

number

default:"1.0"

Temperature for position selection. < 1.0 is greedy, 1.0 uses scores as-is, > 1.0 is more uniform.

model_name

string

Which masked model to use for scoring. Defaults to the sampling tool’s model when unset.

model_checkpoint

string

Model checkpoint override (uses tool default if None).

model_checkpoint

enum

default:"esm2_t33_650M_UR50D"

ESM2 weights variant.Available options: esm2_t6_8M_UR50D, esm2_t12_35M_UR50D, esm2_t30_150M_UR50D, esm2_t33_650M_UR50D, esm2_t36_3B_UR50D, esm2_t48_15B_UR50D

sampling_method

enum

default:"single_pass"

“single_pass” fills every mask in one forward; “iterative_refinement” runs an iterative MaskGIT-style loop driven by the five knobs below.Available options: single_pass, iterative_refinement

temperature

number

default:"1.0"

Softmax temperature.

top_p

number

default:"1.0"

Nucleus threshold (iterative only).

num_steps

integer

default:"20"

Refinement steps (iterative only).

schedule

enum

default:"cosine"

Unmask schedule (iterative only).Available options: cosine, linear

strategy

enum

default:"random"

Per-round commit selection (iterative only).Available options: random, entropy

temperature_annealing

boolean

default:"True"

Anneal toward 0 across rounds (iterative only).

return_logits

boolean

default:"False"

Include per-position logits.

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cuda"

Device to run on.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

batch_size

integer

default:"1"

Sequences per GPU forward pass.

Source

Output: ESM2SampleOutput

logits

array

Per-position logits for each sequence. Shape is (num_sequences, seq_len, vocab_size=20). Only present if return_logits=True in config.

sequences

List[string]

required

Sampled or mutated protein sequences. Each sequence is a string of amino acid characters and is a modified version of the input sequence with masked positions changed to model-predicted alternatives.

Applications

This tool drives guided point mutation, variant generation, and infilling at designable sites for protein engineering work. Resampling masked positions from a protein language model is the core operation behind directed-evolution proposals and antibody affinity maturation, which was demonstrated at experimental scale in Hie et al., 2024. It is also the inner loop of MaskGIT-style iterative refinement schemes adapted from image generation (Chang et al., 2022) for biological sequences.

Usage Tips

iterative_refinement produces more coherent joint samples than single_pass. It is a multi-round MaskGIT-style commit loop (each round uses a cosine or linear unmask schedule) and is roughly num_steps× slower than the one-shot single_pass mode. Default to it whenever you mask more than a handful of sites.
masking_strategy controls which positions get masked before sampling. See the masking strategy README for the available selection methods and tuning knobs. As an alternative to passing a strategy, pre-mask exact positions yourself with _ directly in the input string and the masking strategy is skipped entirely.
temperature scales the per-position logits before sampling. Values of 0.5 to 0.7 yield conservative mutations close to the input; values above 1.0 broaden exploration of the model’s distribution.
Long-range coherence is weak. ESM-2 has no global coherence beyond its local context window, so very long-range dependencies between distant residues are not well captured even in iterative mode.
ESM-2 was trained as a masked language model, not with a generative objective. Resampling masked positions works for local edits, but the model was optimized for representation rather than de novo generation. For generative workloads (large-scale infilling, sequence design), ESM3 adds an explicit generative training objective and is the better fit.

ESM2 Scoring (`esm2-score`)

Computes the masked-language-model pseudo-perplexity for each input sequence. Each position is masked individually, and the model’s log-probability of the true amino acid under bidirectional context is recorded. The per-position scores are then aggregated into per-sequence log-likelihood, average log-likelihood, and perplexity metrics.

API Reference

Source

Input: ESM2ScoringInput

sequences

List[string]

required

Protein sequence(s) to score. Each must be ≤ 1022 residues (ESM-2’s positional-encoding cap); over-length inputs raise ValueError.

Source

Config: ESM2ScoringConfig

model_checkpoint

enum

default:"esm2_t33_650M_UR50D"

ESM2 weights variant.Available options: esm2_t6_8M_UR50D, esm2_t12_35M_UR50D, esm2_t30_150M_UR50D, esm2_t33_650M_UR50D, esm2_t36_3B_UR50D, esm2_t48_15B_UR50D

verbose

integer

default:"0"

Print status messages during scoring.

device

string

default:"cuda"

Device to run the model on.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

batch_size

integer

default:"1"

Masked variants per forward pass, pooled across all input sequences. Larger batches improve throughput but use more memory.

return_logits

boolean

default:"False"

Include per-position logits in the output (large; disable to save memory).

Source

Output: MaskedModelScoringOutput

scores

List[MaskedModelScoringMetrics]

required

List of scoring outputs, one per input sequence. Each entry is a Metrics subclass with scalar metrics (accessed via score.perplexity or score["perplexity"]) plus declared logits / vocab fields that carry raw model outputs when requested.

Show MaskedModelScoringMetrics

logits

array

Per-position logits array (seq_len, vocab_size). None unless return_logits=True.

vocab

array

Token ordering for logits.

primary_metric

string

Name of the metric that best summarizes the result overall (e.g. "avg_plddt" for AlphaFold2). Used by downstream UI and reporting to pick a headline value.

Metrics (one set per scores item)

Metric	Type	Range	Availability
`log_likelihood`	float	≤ 0.0	always
`avg_log_likelihood`	float	≤ 0.0	always
`perplexity`	float	≥ 1.0	always

Applications

ESM2 pseudo-perplexity is a standard fitness proxy when ranking variants, filtering generated sequences for naturalness, or comparing engineered constructs against wild type. The same masked log-likelihood difference between wild-type and mutant residues is a canonical zero-shot baseline for variant-effect prediction.

Usage Tips

Pseudo-perplexity is a relative score, not an absolute fitness. It is measured against ESM-2’s training distribution, which is UniRef50 (the natural proteins it saw during pretraining), which can bias it to proteins that are more heavily represented. The metric is also sensitive to length, so it is most useful for comparing closely related sequences of similar length.
Ambiguous residues are excluded. Perplexity is computed only over the 20 canonical amino acids; X, B, Z, and similar are dropped from both the log-likelihood sum and the position count.

ESM2 Gradient (`esm2-gradient`)

Computes the gradient of the mean masked negative log-likelihood with respect to a relaxed (L, 20) input distribution over the canonical amino-acid order ACDEFGHIKLMNPQRSTVWY. The ESM-2 weights are kept frozen throughout. The relaxed distribution is mixed against ESM-2’s per-residue token embeddings to form a soft input. Each amino-acid position is then masked in turn, and a per-chunk backward pass accumulates the gradient. An optional Straight-Through Estimator runs the forward on hard one-hot tokens while still routing gradients through the soft probabilities.

API Reference

Source

Input: ESM2GradientInput

logits

List[array]

required

Relaxed protein sequence state with shape (L, 20) in canonical amino-acid order ACDEFGHIKLMNPQRSTVWY. L must be ≤ 1022 (ESM-2’s positional-encoding cap); over-length inputs raise ValueError.

temperature

number

Optional softmax temperature. When set, applies softmax(input / temperature) before computing the gradient. When None (default), the input is used as-is.

Source

Config: ESM2GradientConfig

model_checkpoint

enum

default:"esm2_t33_650M_UR50D"

ESM2 weights variant.Available options: esm2_t6_8M_UR50D, esm2_t12_35M_UR50D, esm2_t30_150M_UR50D, esm2_t33_650M_UR50D, esm2_t36_3B_UR50D, esm2_t48_15B_UR50D

use_ste

boolean

default:"False"

Straight-Through Estimator: hard one-hot in the forward pass with gradients flowing through soft probabilities. When False, uses soft blended embeddings directly.

compute_gradient

boolean

default:"True"

Run backward pass and return gradient. Set False for forward-only log-likelihood scoring.

batch_size

integer

AA positions per forward pass for batched PLL. None selects the backend default.

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cuda"

Device to run the model on.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Source

Output: ESM2GradientOutput

gradient

array

Gradient w.r.t. input logits, or None when compute_gradient=False.

loss

number

required

Mean negative log-likelihood over AA positions.

metrics

Dict[string, any]

Log-likelihood, perplexity, sequence length, and objective details.

vocab

List[string]

required

Amino-acid column ordering for the input logits.

Applications

This tool exposes ESM-2 as a differentiable, structure-free protein-language-model loss for use inside MCMC, gradient descent, or any other optimization loop over relaxed protein sequences. It is most often used as a naturalness prior in continuous design pipelines, including latent Bayesian optimization frameworks and discrete walk-jump sampling approaches for de novo protein design.

Usage Tips

temperature controls how the raw input is converted into a distribution. With a value set, the tool applies softmax(logits / T) before the forward pass; leave it None (the default) if the input already sums to 1 per position.
use_ste enables the Straight-Through Estimator. The forward then runs on hard one-hot tokens while gradients still route through the soft probabilities, giving stronger guidance toward discrete sequences. Leave it off for smooth optimization over the relaxed simplex.
compute_gradient toggles whether the backward pass runs. When set to False, the gradient field is None, but loss and metrics (log-likelihood, perplexity, and so on) are still populated. Useful for ranking MCMC proposals without paying the backward cost.

Toolkit Notes

These apply to every ESM-2 tool in this toolkit (esm2-embedding, esm2-sample, esm2-score, esm2-gradient).

Different ESM-2 checkpoints produce different embedding sizes. Downstream tasks built on one checkpoint will not transfer to another without re-fitting; pick one and stick with it for an analysis.
Smaller checkpoints run faster. The 150M and 35M variants are significantly faster than the 650M default, with drops in representation quality.
Max sequence length is 1022 residues. ESM-2’s positional encoding caps inputs at 1022 residues, and will raise ValueError on longer inputs rather than truncating.
batch_size controls memory usage across the toolkit. Lower it if you OOM; raise it for short-sequence throughput. One nuance: for esm2-score, batch_size counts masked variants pooled across all input sequences rather than sequences themselves (each input contributes L masked variants).

Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.

​Background

​Tools

​ESM2 Embeddings (esm2-embedding)

​API Reference

​Applications

​Usage Tips

​ESM2 Sampling (esm2-sample)

​API Reference

​Applications

​Usage Tips

​ESM2 Scoring (esm2-score)

​API Reference

​Applications

​Usage Tips

​ESM2 Gradient (esm2-gradient)

​API Reference

​Applications

​Usage Tips

​Toolkit Notes

​Infrastructure Guides

Tool Persistence

Device Management

Parallel Execution

Cloud Inference

Background

Tools

ESM2 Embeddings (`esm2-embedding`)

API Reference

Applications

Usage Tips

ESM2 Sampling (`esm2-sample`)

API Reference

Applications

Usage Tips

ESM2 Scoring (`esm2-score`)

API Reference

Applications

Usage Tips

ESM2 Gradient (`esm2-gradient`)

API Reference

Applications

Usage Tips

Toolkit Notes

Infrastructure Guides