ESM C (Cambrian)

License: ESM C (Cambrian) is licensed under Custom (Cambrian Open License Agreement) and has restrictions around commercial use and may require explicit attribution when utilized. Please refer to the license for full terms.

Proto is not affiliated with Biohub. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.

GitHub 2.3k GitHub 2.3k

HuggingFace

HuggingFace Cite Cite Tool Source Tool Source Open as Notebook Open as Notebook

evolutionaryscale/esm

2.3k stars

View repo

EvolutionaryScale/esmc-300m-2024-12

View model

@misc{evolutionaryscale2024esmcambrian,
  title={ESM Cambrian: Revealing the mysteries of proteins with unsupervised learning},
  author={{EvolutionaryScale Team}},
  year={2024},
  url={https://www.evolutionaryscale.ai/blog/esm-cambrian},
  note={Blog post}
}

Copy citation

proto-bio/proto-tools/proto_tools/tools/masked_models/esmc

View source

Open Notebook

Open notebook

Function	Description
`run_esmc_embeddings()`	Extract protein sequence embeddings and logits using ESM C (Cambrian) (GPU)	Docs Source

Background

ESM C (EvolutionaryScale, 2024) is a protein language model trained with the masked language modeling objective: during training, residues are hidden at random and the model learns to predict the original amino acid from the surrounding residues on both sides. For each residue it produces a contextual numerical representation (an embedding), along with per-position scores (logits) over the 20 standard amino acids. ESM C is distributed in the same esm software package as ESM3, but does not include ESM3’s structure track or sequence-generation capability; it provides only embeddings and per-position scores. Two openly licensed model sizes are wrapped here: esmc_300m (embedding size 960, Cambrian Open License, commercial use permitted) and esmc_600m (embedding size 1152, Cambrian Non-Commercial License, research and internal use only). A larger 6B-parameter ESM C model is available only through EvolutionaryScale’s hosted Forge service and is not exposed by this wrapper.

Tools

ESM C Embeddings (`esmc-embedding`)

Runs each input sequence through ESM C once and averages the per-residue representations, excluding the start and end tokens and any padding, into a single fixed-length vector per sequence. Per-position scores (logits) over the 20 standard amino acids are also returned when requested.

API Reference

Source

Input: MaskedModelInput

sequences

List[string]

required

Protein sequence(s) to process. Can be provided as:

Source

Config: ESMCEmbeddingsConfig

model_checkpoint

enum

default:"esmc_300m"

ESM C weights variant. "esmc_300m" is the Cambrian Open License (commercial use permitted with attribution); "esmc_600m" is the Cambrian Non-Commercial License (research/internal only).Available options: esmc_300m, esmc_600m

return_logits

boolean

default:"False"

Include per-position logits in the output (large; disable to save memory).

repr_layer

integer

default:"-1"

Transformer layer index for embeddings. -1 returns the post-norm final-layer output (outputs.embeddings); other indices select from pre-norm per-block outputs.hidden_states. Range is checkpoint- dependent (esmc_300m: 30 layers, esmc_600m: 36 layers).

verbose

integer

default:"0"

Print status messages during model execution.

device

string

default:"cuda"

Device to run the model on.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.

batch_size

integer

default:"1"

Number of sequences to process in parallel. Larger batches improve throughput but require more GPU memory.

Source

Output: ESMCEmbeddingsOutput

results

List[SequenceEmbedding]

required

Per-sequence embedding results. Each SequenceEmbedding contains:

Show SequenceEmbedding

mean_embedding

List[number]

required

Mean-pooled embedding vector for one sequence.

attention_mask

List[integer]

required

Binary mask indicating valid positions (1) vs padding (0).

logits

array

Optional per-position amino acid logits for one sequence.

projection

Projection2D

Optional 2D coordinate from a UMAP projection of all embeddings in the same call. Populated when n_sequences >= 4; None otherwise (single-point or 2-3-point UMAP is meaningless).

Applications

The averaged embedding is a learned numerical representation of a protein, suitable for machine-learning tasks such as clustering, classification, and property prediction, and for similarity search by comparing these vectors (for example with cosine similarity). The optional per-position scores give the model’s predicted amino-acid preference at each site, useful for conservation analysis and for examining the model’s expectations at specific positions. ESM C is embedding-focused, so it is the lighter-weight choice when you need embeddings or per-position scores but not sequence generation or scoring.

Usage Tips

model_checkpoint selects the model size. esmc_300m (the default) has embedding size 960 and esmc_600m has embedding size 1152. The two checkpoints carry different licenses (see Toolkit Notes), so esmc_300m is the choice for any commercial use.
repr_layer selects which internal model layer the embedding is taken from. The default -1 uses the final layer; other values select earlier layers.
Per-position scores are large. Enabling return_logits adds an array of size (sequence length by 20) per sequence, which dominates runtime and memory for long inputs. Leave it set to False unless you need the per-position scores.

Toolkit Notes

These apply to every ESM C tool in this toolkit (esmc-embedding).

ESM C shares the Biohub esm environment with ESM3. Both are distributed in the same esm package and use a single shared on-disk environment (biohub_esm); installing either tool installs the environment for both.
The license depends on the model size. esmc_300m (the default) is under the Cambrian Open License, with commercial use permitted subject to the naming and attribution requirement; esmc_600m is under the Cambrian Non-Commercial License and must not be used commercially. The 6B model is available only through EvolutionaryScale’s hosted Forge service and is not wrapped here.
batch_size controls memory usage. Lower it if you run out of GPU memory; raise it to process short sequences faster. For repeated single-batch calls, use ToolInstance.persist_tool("esmc") to keep the model loaded in memory between calls; for multi-GPU or large-batch runs, prefer ToolPool.

Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.

​Background

​Tools

​ESM C Embeddings (esmc-embedding)

​API Reference

​Applications

​Usage Tips

​Toolkit Notes

​Infrastructure Guides

Tool Persistence

Device Management

Parallel Execution

Cloud Inference

Background

Tools

ESM C Embeddings (`esmc-embedding`)

API Reference

Applications

Usage Tips

Toolkit Notes

Infrastructure Guides