Proto is not affiliated with Salesforce Research. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.
Background
ProGen2 (Nijkamp et al., 2023) is a family of autoregressive protein language models trained with a next-token prediction objective: during training the model learns to predict the next residue given all preceding residues. The family spansprogen2-small (151 million parameters) up to progen2-xlarge (6.4 billion parameters). The checkpoints were trained on different protein collections as a result of the paper’s finding that the training-data distribution has a large and sometimes counterintuitive effect on downstream performance. Most checkpoints are trained on natural proteins drawn from UniRef90 and the BFD metagenomic set; progen2-BFD90 uses the BFD90 collection, and progen2-oas is trained on antibody sequences from the Observed Antibody Space database.
The autoregressive training objective instills two primary capabilities. First, new candidate protein sequences can be sampled from a starting prompt via the predicted next-residue distributions. Second, the model can be used to score existing protein sequences, as the likelihood the model assigns to a sequence is shown in the paper to provide a proxy zero-shot fitness score or measure of plausibility with no additional task-specific training.
Tools
ProGen2 Sampling (progen2-sample)
Generates protein sequences by autoregressive sampling. Given one or more prompt sequences, the model extends each prompt one amino acid at a time, drawing each residue from the model’s predicted distribution under the configured temperature, top_p, and top_k settings, until a stop token is produced or max_new_tokens new residues have been generated (default 256).API Reference
Input: CausalModelSampleInput
Input: CausalModelSampleInput
Config: ProGen2SampleConfig
Config: ProGen2SampleConfig
progen2-small, progen2-medium, progen2-base, progen2-oas, progen2-large, progen2-BFD90, progen2-xlarge0 disables and uses top-p only.1/2) from output.True is coerced to 1 and False to 0.None waits indefinitely.BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.False, only newly generated tokens are returned.Applications
This tool performs de novo protein design, generating novel sequences that resemble natural proteins conditioned on a prompt such as a starting motif or partial domain. The antibody-trainedprogen2-oas checkpoint targets antibody and immune-repertoire generation specifically.Usage Tips
- Generated output is trimmed by default. Generated sequences are cut at the first stop token with the start/stop sentinels removed (
truncate_at_stopandstrip_special_tokens, bothTrue); set themFalseto keep the raw model output. - Sampling defaults are conservative.
temperaturedefaults to0.2andtop_pto0.95, which keep generations close to natural-looking sequences; raisetemperaturefor more diverse but riskier designs.top_kdefaults to0, which disables top-k truncation so only nucleus (top_p) sampling is applied. max_new_tokensbounds the generated length. It caps newly generated residues (default256), separate from the prompt length.- Output includes the prompt by default.
prepend_prompt=True(the toolkit default) returns the prompt joined to its continuation; set itFalseto receive only the newly generated residues. - Generated sequences are candidates. Validate them with downstream tools (for example structure prediction, function annotation, or homology search) before drawing biological conclusions.
ProGen2 Scoring (progen2-score)
Scores existing protein sequences using ProGen2. For each sequence it computes the model’s predicted probability of every residue given the preceding residues and aggregates these into a log-likelihood, an average log-likelihood per residue, and a perplexity (perplexity is fully determined by the average log-likelihood, computed as exp(-avg_log_likelihood), but is the conventionally reported metric). Optionally returns the per-position logits and the token vocabulary.API Reference
Input: CausalModelScoringInput
Input: CausalModelScoringInput
Config: ProGen2ScoringConfig
Config: ProGen2ScoringConfig
progen2-small, progen2-medium, progen2-base, progen2-oas, progen2-large, progen2-BFD90, progen2-xlargeTrue is coerced to 1 and False to 0.None waits indefinitely.BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.Output: CausalModelScoringOutput
Output: CausalModelScoringOutput
Metrics subclass with scalar metrics (log_likelihood, avg_log_likelihood, perplexity) and optional per-position _pp-suffixed list extras; logits and vocab are declared fields for raw model outputs.scores item)| Metric | Type | Range | Availability |
|---|---|---|---|
log_likelihood | float | ≤ 0.0 | always |
avg_log_likelihood | float | ≤ 0.0 | always |
perplexity | float | ≥ 1.0 | always |
Applications
This tool gives a zero-shot measure of how consistent a protein sequence is with ProGen2’s training distribution, which is used in the paper as a proxy-fitness predictor without additional task-specific training. It can be used to rank or filter candidate sequences (including the output ofprogen2-sample), to compare variants of a sequence, or to flag sequences far from the model’s training distribution.Usage Tips
- Compare length-normalized scores within one checkpoint. Total
log_likelihoodscales with sequence length, so useperplexityoravg_log_likelihoodwhen comparing sequences of different lengths. Different checkpoints learn different distributions that are not calibrated to a common scale, so scores from differentmodel_checkpointvalues are hard to compare directly. A lower perplexity means the sequence is more consistent with that checkpoint’s training distribution. return_logitsdefaults toFalse. Leave it off unless you need the per-position distributions, since the logits tensor is large (sequence length by the token vocabulary).- A domain-matched checkpoint is not automatically better for scoring. The ProGen2 paper found the antibody-specific
progen2-oascheckpoint underperformed the universal checkpoints on antibody fitness prediction, so a universal checkpoint (such as the defaultprogen2-large) is often the safer choice for scoring.
Toolkit Notes
These apply to every ProGen2 tool in this toolkit (progen2-sample, progen2-score).
- Requires a GPU; memory scales with checkpoint size. The larger checkpoints, up to
progen2-xlargeat 6.4 billion parameters, need substantially more GPU memory thanprogen2-small. CPU execution is not practical. batch_sizetrades memory for throughput across both tools. It sets how many prompts (progen2-sample) or sequences (progen2-score) are processed per GPU forward pass. Raise it for higher throughput on many short sequences; lower it (default1) if generation or scoring runs out of GPU memory.model_checkpointselects the training distribution. The defaultprogen2-largeand thesmall,medium,base, andxlargecheckpoints are trained on broad natural-protein collections (UniRef90 and BFD);progen2-BFD90is trained on the BFD90 set andprogen2-oason antibody sequences from the Observed Antibody Space. The choice of model has performance implications for both sampling and scoring.

Salesforce Research