Generators

Generators propose candidate sequences during optimization. Where constraints define the requirements and optimizers orchestrate the search, generators determine where new candidate sequences come from. Every optimization step begins with generators proposing candidates. A generator takes the current sequences in a Segment, applies its strategy (random mutation, protein language model, structure-conditioned design), and fills the proposal_sequences pool for the optimizer to evaluate.

Generator Categories

Proto organizes generators by how they produce sequences. The three most common categories are below; a fourth, gradient-based generation (PositionWeightGenerator), produces differentiable position weights for the Gradient optimizer. Each category makes different trade-offs between speed, biological realism, and required prior knowledge.

Mutation
Autoregressive
Inverse Folding

Refine existing sequences by modifying selected positions.Mutation generators start from an existing sequence and introduce changes, either uniformly random or guided by a protein language model’s uncertainty estimates. Most require a starting sequence (ESM2Generator, for example, raises if the segment has none); the random generators (RandomProteinGenerator, RandomNucleotideGenerator) are the exception and initialize one automatically when none is provided.

When to use: A starting sequence is available and the goal is to refine it. This is the most common category for iterative optimization.

python

from proto_language.generator import (
    RandomNucleotideGenerator, RandomNucleotideGeneratorConfig
)
from proto_tools.transforms.masking import MaskingStrategy

generator = RandomNucleotideGenerator(
    RandomNucleotideGeneratorConfig(masking_strategy=MaskingStrategy(num_mutations=5))
)

See the Generator Reference for all available mutation generators and their configuration options.

Generate sequences from scratch, token by token (left-to-right).Autoregressive generators use large language models trained on biological sequences to generate entirely new sequences. They don’t need a starting sequence; they create one from a prompt or from nothing.

When to use: Novel sequences are needed (especially long DNA), or the goal is to sample from the model’s learned distribution over natural sequences.

Autoregressive generators overwrite any existing sequences in the segment. If the segment already has input sequences, they will be replaced.

python

from proto_language.generator import (
    Evo2Generator, Evo2GeneratorConfig
)

generator = Evo2Generator(
    Evo2GeneratorConfig(
        prompts=["ACGT"],
        model_checkpoint="evo2_7b",
        temperature=1.0,
    )
)

See the Generator Reference for all available autoregressive generators and their configuration options.

Design sequences that fold into a target 3D structure.Inverse folding generators solve the inverse protein design problem: given a desired backbone structure (as a PDB file), predict the amino acid sequence most likely to fold into that shape. They don’t need a starting sequence; unknown positions are initialized as X.

When to use: A target protein structure (from experiment or computational design) is available, and the goal is to find sequences that fold into it.

python

from proto_language.generator import (
    ProteinMPNNGenerator, ProteinMPNNGeneratorConfig
)

generator = ProteinMPNNGenerator(
    ProteinMPNNGeneratorConfig(
        structure_inputs="target.pdb",
        temperature=0.1,
    )
)

See the Generator Reference for all available inverse folding generators and their configuration options.

Assigning Generators to Segments

Before an optimizer can use a generator, it must be assigned to a specific Segment. This tells the generator which part of the construct to modify.

python

from proto_language.core import Segment
from proto_language.generator import (
    RandomNucleotideGenerator, RandomNucleotideGeneratorConfig
)
from proto_tools.transforms.masking import MaskingStrategy

# Create a segment
segment = Segment(length=100, sequence_type="dna")

# Create and assign the generator
generator = RandomNucleotideGenerator(
    RandomNucleotideGeneratorConfig(masking_strategy=MaskingStrategy(num_mutations=5))
)
generator.assign(segment)

The assign() method validates compatibility:

The segment’s sequence_type must be supported by the generator
Ligand segments cannot have generators assigned (they’re fixed)

python

# This works: ESM2 supports protein
protein_segment = Segment(length=100, sequence_type="protein")
esm2_gen = ESM2Generator(ESM2GeneratorConfig())
esm2_gen.assign(protein_segment)

# This raises ValueError: ESM2 doesn't support DNA
dna_segment = Segment(length=100, sequence_type="dna")
esm2_gen.assign(dna_segment)

Multiple Generators

In multi-segment constructs, different generators can be assigned to different segments. Each generator independently proposes candidates for its assigned segment:

python

from proto_language.core import Segment, Construct
from proto_language.optimizer import MCMCOptimizer, MCMCOptimizerConfig

# Two segments with different generation strategies
promoter = Segment(length=200, sequence_type="dna", label="promoter")
coding_seq = Segment(length=300, sequence_type="dna", label="cds")

construct = Construct([promoter, coding_seq])

# Assign different generators to each segment
gen_promoter = RandomNucleotideGenerator(
    RandomNucleotideGeneratorConfig(masking_strategy=MaskingStrategy(num_mutations=20))
)
gen_cds = RandomNucleotideGenerator(
    RandomNucleotideGeneratorConfig(masking_strategy=MaskingStrategy(num_mutations=6))
)

gen_promoter.assign(promoter)
gen_cds.assign(coding_seq)

optimizer = MCMCOptimizer(
    constructs=[construct],
    generators=[gen_promoter, gen_cds],
    constraints=[...],
    config=MCMCOptimizerConfig(num_steps=500, num_results=5, proposals_per_result=10),
)

Use different mutation counts for different segments. Conserved regions (like coding sequences) benefit from fewer mutations per step, while exploratory regions (like promoters) can tolerate more.

GPU Memory & Batch Size

GPU generators process multiple proposal sequences per forward pass. The batch_size config parameter controls how many sequences are sent to the GPU at once. All generators default to batch_size=1 (sequential processing); increase it to enable batching. The framework splits the full set of proposals into chunks of batch_size and processes each chunk on the GPU. For example, if the optimizer requests 50 proposals and batch_size=16, the generator runs 4 forward passes (16 + 16 + 16 + 2).

If GPU out-of-memory (OOM) errors occur, reduce the generator’s batch_size in its config. This is especially common with long sequences or large models. See the Generator Reference for per-generator configuration details and available parameters.

Next Steps

Constraints

Quality requirements that sequences must satisfy

Optimizers

Learn how optimizers coordinate generators and constraints

Tools

Explore the bioinformatics tools that power generators

Generator Reference

Full API reference for each generator

​Generators

​Generator Categories

​Assigning Generators to Segments

​Multiple Generators

​GPU Memory & Batch Size

​Next Steps

Constraints

Optimizers

Tools

Generator Reference

​Generator Catalog

Generators

Generator Categories

Assigning Generators to Segments

Multiple Generators

GPU Memory & Batch Size

Next Steps

Generator Catalog