Skip to main content
ProGen2 Protein Language Model
License: ProGen2 is open source and free for academic and commercial use under a BSD-3-Clause license. Please refer to the license for full terms.

This generator is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.


Go to Tool Page
proto-bio/proto-language/proto_language/generator/progen2_generator.py
View source
@article{nijkamp2023progen2,
  title={ProGen2: Exploring the boundaries of protein language models},
  author={Nijkamp, Erik and Ruffolo, Jeffrey A and Weinstein, Eli N and Naik, Nikhil and Madani, Ali},
  journal={Cell Systems},
  volume={14},
  number={11},
  pages={968--978},
  year={2023},
  publisher={Elsevier},
  doi={10.1016/j.cels.2023.10.002}
}
Copy citation
Protein sequence generator using ProGen2 autoregressive language model.

API Reference

ConfigProGen2GeneratorConfig Source
Configuration object for ProGen2Generator.This class defines configuration parameters for the ProGen2 generator, which uses the ProGen2 protein language model to autoregressively generate protein sequences from prompt sequences.Models are loaded from HuggingFace: https://huggingface.co/hugohrban/
prompts
List[string]
required
Prompt sequences for protein sequence generation
model_checkpoint
enum
default:"progen2-large"
ProGen2 model variant to load (e.g. progen2-large).Options: progen2-small, progen2-medium, progen2-base, progen2-oas, progen2-large, progen2-BFD90, progen2-xlarge
local_path
string
Path to local model weights
device
string
default:"cuda"
GPU device to run ProGen2 on (e.g. ‘cuda’ or ‘cuda:0’).
temperature
number
default:"0.2"
Sharpness of sampling. Below 1 favors high-probability tokens; above 1 increases diversity.
top_p
number
default:"0.95"
Nucleus sampling cumulative probability cutoff. 1.0 disables nucleus sampling.
top_k
integer
default:"0"
At each step, restrict sampling to the k most probable tokens. Set to 0 to disable top-k truncation.
truncate_at_stop
boolean
default:"True"
Whether to truncate sequences at stop tokens
strip_special_tokens
boolean
default:"True"
Whether to strip start and stop tokens from final output
prepend_prompt
boolean
default:"True"
Whether to prepend prompt to generation
batch_size
integer
default:"1"
Number of sequences to process simultaneously on GPU
verbose
boolean
default:"False"
Whether to print verbose output

Usage

python
from proto_language.generator import ProGen2Generator, ProGen2GeneratorConfig
from proto_language.core import Segment

config = ProGen2GeneratorConfig(
    # Configure parameters here
)

generator = ProGen2Generator(config)

segment = Segment(length=100, sequence_type="protein")
generator.assign(segment)
generator.sample()

Metadata

PropertyValue
Keyprogen2
ClassProGen2Generator
Categoryautoregressive
Input Typeprompt
Uses GPUTrue
Supported Sequence Typesprotein
Allows Empty StartFalse