Skip to main content
License: ProGen2 is open source and free for academic and commercial use under a BSD-3-Clause license. Please refer to the license for full terms.

Proto is not affiliated with Salesforce Research. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.


enijkamp/progen2
enijkamp/progen2
Official release of the ProGen models
59 stars
View repo
ProGen2: Exploring the boundaries of protein language models
Erik Nijkamp, Jeffrey A Ruffolo, … Ali Madani
Cell Systems (2023)
Read paper
@article{nijkamp2023progen2,
  title={ProGen2: Exploring the boundaries of protein language models},
  author={Nijkamp, Erik and Ruffolo, Jeffrey A and Weinstein, Eli N and Naik, Nikhil and Madani, Ali},
  journal={Cell Systems},
  volume={14},
  number={11},
  pages={968--978},
  year={2023},
  publisher={Elsevier},
  doi={10.1016/j.cels.2023.10.002}
}
Copy citation
proto-bio/proto-tools/proto_tools/tools/causal_models/progen2
View source
Open Notebook
Open notebook
Coming soon!
Run this tool directly in Proto with no setup required.
FunctionDescription
run_progen2_sample()Sample protein sequences using ProGen2 language model (GPU) Docs Source
run_progen2_score()Score protein sequences using ProGen2 language model (GPU) Docs Source

Background

ProGen2 (Nijkamp et al., 2023) is a family of autoregressive protein language models trained with a next-token prediction objective: during training the model learns to predict the next residue given all preceding residues. The family spans progen2-small (151 million parameters) up to progen2-xlarge (6.4 billion parameters). The checkpoints were trained on different protein collections as a result of the paper’s finding that the training-data distribution has a large and sometimes counterintuitive effect on downstream performance. Most checkpoints are trained on natural proteins drawn from UniRef90 and the BFD metagenomic set; progen2-BFD90 uses the BFD90 collection, and progen2-oas is trained on antibody sequences from the Observed Antibody Space database. The autoregressive training objective instills two primary capabilities. First, new candidate protein sequences can be sampled from a starting prompt via the predicted next-residue distributions. Second, the model can be used to score existing protein sequences, as the likelihood the model assigns to a sequence is shown in the paper to provide a proxy zero-shot fitness score or measure of plausibility with no additional task-specific training.

Tools

ProGen2 Sampling (progen2-sample)

Generates protein sequences by autoregressive sampling. Given one or more prompt sequences, the model extends each prompt one amino acid at a time, drawing each residue from the model’s predicted distribution under the configured temperature, top_p, and top_k settings, until a stop token is produced or max_new_tokens new residues have been generated (default 256).

API Reference

Source
prompts
List[string]
required
Prompt sequences to condition generation on. Can be provided as a single string or a list of strings.
Source
model_checkpoint
enum
default:"progen2-large"
ProGen2 weights variant. Sizes range from 151M (small) to 6B (xlarge).Available options: progen2-small, progen2-medium, progen2-base, progen2-oas, progen2-large, progen2-BFD90, progen2-xlarge
local_path
string
Override the default download with a local weights directory.
top_k
integer
default:"0"
Top-k truncation; 0 disables and uses top-p only.
max_new_tokens
integer
default:"256"
Maximum number of new tokens to generate per prompt (excludes prompt).
truncate_at_stop
boolean
default:"True"
Truncate generated sequences at the first stop token.
strip_special_tokens
boolean
default:"True"
Strip ProGen2 start/stop sentinel tokens (1/2) from output.
return_logits
boolean
default:"False"
Include per-position logits in the output.
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cuda"
Device to run the model on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
prepend_prompt
boolean
default:"True"
Include the input prompt at the start of each generated sequence; when False, only newly generated tokens are returned.
temperature
number
default:"0.2"
Softmax temperature; lower values are more deterministic.
top_p
number
default:"0.95"
Nucleus sampling threshold over per-position token probabilities.
batch_size
integer
default:"1"
Number of prompts to process simultaneously on GPU.
Source
logits
array
Per-position logits for each generated sequence (shape: [n_outputs, generated_len, vocab_size]).
sequences
List[string]
required
Generated protein sequences.

Applications

This tool performs de novo protein design, generating novel sequences that resemble natural proteins conditioned on a prompt such as a starting motif or partial domain. The antibody-trained progen2-oas checkpoint targets antibody and immune-repertoire generation specifically.

Usage Tips

  • Generated output is trimmed by default. Generated sequences are cut at the first stop token with the start/stop sentinels removed (truncate_at_stop and strip_special_tokens, both True); set them False to keep the raw model output.
  • Sampling defaults are conservative. temperature defaults to 0.2 and top_p to 0.95, which keep generations close to natural-looking sequences; raise temperature for more diverse but riskier designs. top_k defaults to 0, which disables top-k truncation so only nucleus (top_p) sampling is applied.
  • max_new_tokens bounds the generated length. It caps newly generated residues (default 256), separate from the prompt length.
  • Output includes the prompt by default. prepend_prompt=True (the toolkit default) returns the prompt joined to its continuation; set it False to receive only the newly generated residues.
  • Generated sequences are candidates. Validate them with downstream tools (for example structure prediction, function annotation, or homology search) before drawing biological conclusions.

ProGen2 Scoring (progen2-score)

Scores existing protein sequences using ProGen2. For each sequence it computes the model’s predicted probability of every residue given the preceding residues and aggregates these into a log-likelihood, an average log-likelihood per residue, and a perplexity (perplexity is fully determined by the average log-likelihood, computed as exp(-avg_log_likelihood), but is the conventionally reported metric). Optionally returns the per-position logits and the token vocabulary.

API Reference

Source
sequences
List[string]
required
Sequences to score. Can be provided as a single string or a list of strings.
Source
model_checkpoint
enum
default:"progen2-large"
ProGen2 weights variant.Available options: progen2-small, progen2-medium, progen2-base, progen2-oas, progen2-large, progen2-BFD90, progen2-xlarge
local_path
string
Override the default download with a local weights directory.
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cuda"
Device to run the model on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
batch_size
integer
default:"1"
Number of sequences to process simultaneously on GPU.
return_logits
boolean
default:"False"
Include per-position logits in the output.
Source
scores
List[CausalModelScoringMetrics]
required
List of scoring outputs, one per input sequence. Each entry is a Metrics subclass with scalar metrics (log_likelihood, avg_log_likelihood, perplexity) and optional per-position _pp-suffixed list extras; logits and vocab are declared fields for raw model outputs.
Metrics (one set per scores item)
MetricTypeRangeAvailability
log_likelihoodfloat≤ 0.0always
avg_log_likelihoodfloat≤ 0.0always
perplexityfloat≥ 1.0always

Applications

This tool gives a zero-shot measure of how consistent a protein sequence is with ProGen2’s training distribution, which is used in the paper as a proxy-fitness predictor without additional task-specific training. It can be used to rank or filter candidate sequences (including the output of progen2-sample), to compare variants of a sequence, or to flag sequences far from the model’s training distribution.

Usage Tips

  • Compare length-normalized scores within one checkpoint. Total log_likelihood scales with sequence length, so use perplexity or avg_log_likelihood when comparing sequences of different lengths. Different checkpoints learn different distributions that are not calibrated to a common scale, so scores from different model_checkpoint values are hard to compare directly. A lower perplexity means the sequence is more consistent with that checkpoint’s training distribution.
  • return_logits defaults to False. Leave it off unless you need the per-position distributions, since the logits tensor is large (sequence length by the token vocabulary).
  • A domain-matched checkpoint is not automatically better for scoring. The ProGen2 paper found the antibody-specific progen2-oas checkpoint underperformed the universal checkpoints on antibody fitness prediction, so a universal checkpoint (such as the default progen2-large) is often the safer choice for scoring.

Toolkit Notes

These apply to every ProGen2 tool in this toolkit (progen2-sample, progen2-score).
  • Requires a GPU; memory scales with checkpoint size. The larger checkpoints, up to progen2-xlarge at 6.4 billion parameters, need substantially more GPU memory than progen2-small. CPU execution is not practical.
  • batch_size trades memory for throughput across both tools. It sets how many prompts (progen2-sample) or sequences (progen2-score) are processed per GPU forward pass. Raise it for higher throughput on many short sequences; lower it (default 1) if generation or scoring runs out of GPU memory.
  • model_checkpoint selects the training distribution. The default progen2-large and the small, medium, base, and xlarge checkpoints are trained on broad natural-protein collections (UniRef90 and BFD); progen2-BFD90 is trained on the BFD90 set and progen2-oas on antibody sequences from the Observed Antibody Space. The choice of model has performance implications for both sampling and scoring.
Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.