ProGen2 Protein Language Model

License: ProGen2 is open source and free for academic and commercial use under a BSD-3-Clause license. Please refer to the license for full terms.

This generator is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.

Tools Used Tools Used Source Source Cite Cite

Go to Tool Page

proto-bio/proto-language/proto_language/generator/progen2_generator.py

View source

@article{nijkamp2023progen2,
  title={ProGen2: Exploring the boundaries of protein language models},
  author={Nijkamp, Erik and Ruffolo, Jeffrey A and Weinstein, Eli N and Naik, Nikhil and Madani, Ali},
  journal={Cell Systems},
  volume={14},
  number={11},
  pages={968--978},
  year={2023},
  publisher={Elsevier},
  doi={10.1016/j.cels.2023.10.002}
}

Copy citation

Protein sequence generator using ProGen2 autoregressive language model.

API Reference

ConfigProGen2GeneratorConfig Source

Configuration object for ProGen2Generator.This class defines configuration parameters for the ProGen2 generator, which uses the ProGen2 protein language model to autoregressively generate protein sequences from prompt sequences.Models are loaded from HuggingFace: https://huggingface.co/hugohrban/

For detailed information on ProGen2, see:

HuggingFace: https://huggingface.co/hugohrban/
GitHub: https://github.com/hugohrban/ProGen2-finetuning
Original GitHub: https://github.com/enijkamp/progen2
Original paper: https://www.cell.com/cell-systems/fulltext/S2405-4712(23)00272-7

prompts

List[string]

required

Prompt sequences for protein sequence generation

model_checkpoint

enum

default:"progen2-large"

ProGen2 model variant to load (e.g. progen2-large).Options: progen2-small, progen2-medium, progen2-base, progen2-oas, progen2-large, progen2-BFD90, progen2-xlarge

local_path

string

Path to local model weights

device

string

default:"cuda"

GPU device to run ProGen2 on (e.g. ‘cuda’ or ‘cuda:0’).

temperature

number

default:"0.2"

Sharpness of sampling. Below 1 favors high-probability tokens; above 1 increases diversity.

top_p

number

default:"0.95"

Nucleus sampling cumulative probability cutoff. 1.0 disables nucleus sampling.

top_k

integer

default:"0"

At each step, restrict sampling to the k most probable tokens. Set to 0 to disable top-k truncation.

truncate_at_stop

boolean

default:"True"

Whether to truncate sequences at stop tokens

strip_special_tokens

boolean

default:"True"

Whether to strip start and stop tokens from final output

prepend_prompt

boolean

default:"True"

Whether to prepend prompt to generation

batch_size

integer

default:"1"

Number of sequences to process simultaneously on GPU

verbose

boolean

default:"False"

Whether to print verbose output

Usage

python

from proto_language.generator import ProGen2Generator, ProGen2GeneratorConfig
from proto_language.core import Segment

config = ProGen2GeneratorConfig(
    # Configure parameters here
)

generator = ProGen2Generator(config)

segment = Segment(length=100, sequence_type="protein")
generator.assign(segment)
generator.sample()

Metadata

Property	Value
Key	`progen2`
Class	`ProGen2Generator`
Category	`autoregressive`
Input Type	`prompt`
Uses GPU	`True`
Supported Sequence Types	`protein`
Allows Empty Start	`False`

​API Reference

​Usage

​Metadata

API Reference

Usage

Metadata