ProGen2 - Proto

License: ProGen2 is open source and free for academic and commercial use under a BSD-3-Clause license. Please refer to the license for full terms.

Proto is not affiliated with Salesforce Research. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.

GitHub 59 GitHub 59

HuggingFace

HuggingFace Publication Publication Cite Cite Tool Source Tool Source Open as Notebook Open as Notebook Open on Proto Open on Proto

enijkamp/progen2

Official release of the ProGen models

ProGen2: Exploring the boundaries of protein language models

Erik Nijkamp, Jeffrey A Ruffolo, … Ali Madani

Cell Systems (2023)

Read paper

@article{nijkamp2023progen2,
  title={ProGen2: Exploring the boundaries of protein language models},
  author={Nijkamp, Erik and Ruffolo, Jeffrey A and Weinstein, Eli N and Naik, Nikhil and Madani, Ali},
  journal={Cell Systems},
  volume={14},
  number={11},
  pages={968--978},
  year={2023},
  publisher={Elsevier},
  doi={10.1016/j.cels.2023.10.002}
}

Copy citation

proto-bio/proto-tools/proto_tools/tools/causal_models/progen2

View source

Open Notebook

Open notebook

Coming soon!

Run this tool directly in Proto with no setup required.

Function	Description
`run_progen2_sample()`	Sample protein sequences using ProGen2 language model (GPU)	Docs Source
`run_progen2_score()`	Score protein sequences using ProGen2 language model (GPU)	Docs Source

Background

ProGen2 (Nijkamp et al., 2023) is a family of autoregressive protein language models trained with a next-token prediction objective: during training the model learns to predict the next residue given all preceding residues. The family spans progen2-small (151 million parameters) up to progen2-xlarge (6.4 billion parameters). The checkpoints were trained on different protein collections as a result of the paper’s finding that the training-data distribution has a large and sometimes counterintuitive effect on downstream performance. Most checkpoints are trained on natural proteins drawn from UniRef90 and the BFD metagenomic set; progen2-BFD90 uses the BFD90 collection, and progen2-oas is trained on antibody sequences from the Observed Antibody Space database. The autoregressive training objective instills two primary capabilities. First, new candidate protein sequences can be sampled from a starting prompt via the predicted next-residue distributions. Second, the model can be used to score existing protein sequences, as the likelihood the model assigns to a sequence is shown in the paper to provide a proxy zero-shot fitness score or measure of plausibility with no additional task-specific training.

Tools

ProGen2 Sampling (`progen2-sample`)

Generates protein sequences by autoregressive sampling. Given one or more prompt sequences, the model extends each prompt one amino acid at a time, drawing each residue from the model’s predicted distribution under the configured temperature, top_p, and top_k settings, until a stop token is produced or max_new_tokens new residues have been generated (default 256).

API Reference

Source

Input: CausalModelSampleInput

prompts

List[string]

required

Prompt sequences to condition generation on. Can be provided as a single string or a list of strings.

Source

Config: ProGen2SampleConfig

model_checkpoint

enum

default:"progen2-large"

ProGen2 weights variant. Sizes range from 151M (small) to 6B (xlarge).Available options: progen2-small, progen2-medium, progen2-base, progen2-oas, progen2-large, progen2-BFD90, progen2-xlarge

local_path

string

Override the default download with a local weights directory.

top_k

integer

default:"0"

Top-k truncation; 0 disables and uses top-p only.

max_new_tokens

integer

default:"256"

Maximum number of new tokens to generate per prompt (excludes prompt).

truncate_at_stop

boolean

default:"True"

Truncate generated sequences at the first stop token.

strip_special_tokens

boolean

default:"True"

Strip ProGen2 start/stop sentinel tokens (1/2) from output.

return_logits

boolean

default:"False"

Include per-position logits in the output.

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cuda"

Device to run the model on.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.

prepend_prompt

boolean

default:"True"

Include the input prompt at the start of each generated sequence; when False, only newly generated tokens are returned.

temperature

number

default:"0.2"

Softmax temperature; lower values are more deterministic.

top_p

number

default:"0.95"

Nucleus sampling threshold over per-position token probabilities.

batch_size

integer

default:"1"

Number of prompts to process simultaneously on GPU.

Source

Output: ProGen2SampleOutput

logits

array

Per-position logits for each generated sequence (shape: [n_outputs, generated_len, vocab_size]).

sequences

List[string]

required

Generated protein sequences.

Applications

This tool performs de novo protein design, generating novel sequences that resemble natural proteins conditioned on a prompt such as a starting motif or partial domain. The antibody-trained progen2-oas checkpoint targets antibody and immune-repertoire generation specifically.

Usage Tips

Generated output is trimmed by default. Generated sequences are cut at the first stop token with the start/stop sentinels removed (truncate_at_stop and strip_special_tokens, both True); set them False to keep the raw model output.
Sampling defaults are conservative. temperature defaults to 0.2 and top_p to 0.95, which keep generations close to natural-looking sequences; raise temperature for more diverse but riskier designs. top_k defaults to 0, which disables top-k truncation so only nucleus (top_p) sampling is applied.
max_new_tokens bounds the generated length. It caps newly generated residues (default 256), separate from the prompt length.
Output includes the prompt by default. prepend_prompt=True (the toolkit default) returns the prompt joined to its continuation; set it False to receive only the newly generated residues.
Generated sequences are candidates. Validate them with downstream tools (for example structure prediction, function annotation, or homology search) before drawing biological conclusions.

ProGen2 Scoring (`progen2-score`)

Scores existing protein sequences using ProGen2. For each sequence it computes the model’s predicted probability of every residue given the preceding residues and aggregates these into a log-likelihood, an average log-likelihood per residue, and a perplexity (perplexity is fully determined by the average log-likelihood, computed as exp(-avg_log_likelihood), but is the conventionally reported metric). Optionally returns the per-position logits and the token vocabulary.

API Reference

Source

Input: CausalModelScoringInput

sequences

List[string]

required

Sequences to score. Can be provided as a single string or a list of strings.

Source

Config: ProGen2ScoringConfig

model_checkpoint

enum

default:"progen2-large"

ProGen2 weights variant.Available options: progen2-small, progen2-medium, progen2-base, progen2-oas, progen2-large, progen2-BFD90, progen2-xlarge

local_path

string

Override the default download with a local weights directory.

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cuda"

Device to run the model on.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

batch_size

integer

default:"1"

Number of sequences to process simultaneously on GPU.

return_logits

boolean

default:"False"

Include per-position logits in the output.

Source

Output: CausalModelScoringOutput

scores

List[CausalModelScoringMetrics]

required

List of scoring outputs, one per input sequence. Each entry is a Metrics subclass with scalar metrics (log_likelihood, avg_log_likelihood, perplexity) and optional per-position _pp-suffixed list extras; logits and vocab are declared fields for raw model outputs.

Show CausalModelScoringMetrics

logits

array

Per-position logits array (seq_len, vocab_size). None unless return_logits=True.

vocab

array

Token ordering for logits.

primary_metric

string

Name of the metric that best summarizes the result overall (e.g. "avg_plddt" for AlphaFold2). Used by downstream UI and reporting to pick a headline value.

Metrics (one set per scores item)

Metric	Type	Range	Availability
`log_likelihood`	float	≤ 0.0	always
`avg_log_likelihood`	float	≤ 0.0	always
`perplexity`	float	≥ 1.0	always

Applications

This tool gives a zero-shot measure of how consistent a protein sequence is with ProGen2’s training distribution, which is used in the paper as a proxy-fitness predictor without additional task-specific training. It can be used to rank or filter candidate sequences (including the output of progen2-sample), to compare variants of a sequence, or to flag sequences far from the model’s training distribution.

Usage Tips

Compare length-normalized scores within one checkpoint. Total log_likelihood scales with sequence length, so use perplexity or avg_log_likelihood when comparing sequences of different lengths. Different checkpoints learn different distributions that are not calibrated to a common scale, so scores from different model_checkpoint values are hard to compare directly. A lower perplexity means the sequence is more consistent with that checkpoint’s training distribution.
return_logits defaults to False. Leave it off unless you need the per-position distributions, since the logits tensor is large (sequence length by the token vocabulary).
A domain-matched checkpoint is not automatically better for scoring. The ProGen2 paper found the antibody-specific progen2-oas checkpoint underperformed the universal checkpoints on antibody fitness prediction, so a universal checkpoint (such as the default progen2-large) is often the safer choice for scoring.

Toolkit Notes

These apply to every ProGen2 tool in this toolkit (progen2-sample, progen2-score).

Requires a GPU; memory scales with checkpoint size. The larger checkpoints, up to progen2-xlarge at 6.4 billion parameters, need substantially more GPU memory than progen2-small. CPU execution is not practical.
batch_size trades memory for throughput across both tools. It sets how many prompts (progen2-sample) or sequences (progen2-score) are processed per GPU forward pass. Raise it for higher throughput on many short sequences; lower it (default 1) if generation or scoring runs out of GPU memory.
model_checkpoint selects the training distribution. The default progen2-large and the small, medium, base, and xlarge checkpoints are trained on broad natural-protein collections (UniRef90 and BFD); progen2-BFD90 is trained on the BFD90 set and progen2-oas on antibody sequences from the Observed Antibody Space. The choice of model has performance implications for both sampling and scoring.

Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.

​Background

​Tools

​ProGen2 Sampling (progen2-sample)

​API Reference

​Applications

​Usage Tips

​ProGen2 Scoring (progen2-score)

​API Reference

​Applications

​Usage Tips

​Toolkit Notes

​Infrastructure Guides

Tool Persistence

Device Management

Parallel Execution

Cloud Inference

Background

Tools

ProGen2 Sampling (`progen2-sample`)

API Reference

Applications

Usage Tips

ProGen2 Scoring (`progen2-score`)

API Reference

Applications

Usage Tips

Toolkit Notes

Infrastructure Guides