Skip to main content
Making Biology Programmable
proto-bio/proto-language

Proto

Proto is a high-level programming language for designing DNA, RNA, and protein sequences. The required properties of a sequence are expressed as constraints; generators propose candidate sequences and optimizers search for candidates that satisfy them.

Overview

Biological sequence design is typically multi-objective: a single sequence must meet several requirements at once. A designed protein may need to fold to a target structure, bind a target, express in a host organism, and remain soluble. A coding sequence may need a controlled GC content, codon usage suited to its host, no long homopolymer runs, and the absence of specified restriction sites. Proto represents each requirement as a separate constraint and optimizes against the full set rather than a single objective. A design is specified declaratively. The sequence regions to be designed are defined as segments; a generator is assigned to each region to propose candidates; constraints score how well each candidate meets a requirement; and one or more optimizers search sequence space to minimize the combined score.
python
from proto_language.core import Segment, Construct, Constraint, Program
from proto_language.generator import RandomNucleotideGenerator, RandomNucleotideGeneratorConfig
from proto_language.optimizer import MCMCOptimizer, MCMCOptimizerConfig
from proto_language.constraint import gc_content_constraint, max_homopolymer_constraint
from proto_tools.transforms.masking import MaskingStrategy

# Define a 200bp DNA sequence to optimize
dna = Segment(length=200, sequence_type="dna")
construct = Construct(segments=[dna])

# Generator: random point mutations to explore sequence space
gen = RandomNucleotideGenerator(RandomNucleotideGeneratorConfig(masking_strategy=MaskingStrategy(num_mutations=3)))
gen.assign(dna)

# Constraints: what the sequence must satisfy
constraints = [
    Constraint(inputs=[dna], function=gc_content_constraint,
               function_config={"min_gc": 45, "max_gc": 55}, weight=1.0),
    Constraint(inputs=[dna], function=max_homopolymer_constraint,
               function_config={"max_length": 5}, threshold=0.0),
]

# Optimize with MCMC
optimizer = MCMCOptimizer(
    constructs=[construct], generators=[gen], constraints=constraints,
    config=MCMCOptimizerConfig(num_steps=500, num_results=5),
)

program = Program(optimizers=[optimizer], num_results=5)
program.run()

# Results: 5 optimized sequences ranked by quality
for seq in construct.joined_sequences:
    print(seq.sequence)

How It Works

1

Segments and Constructs

Segments are contiguous sequence regions to be designed; they are grouped into Constructs. A segment is initialized either from a target length or from an existing sequence.
python
promoter = Segment(length=50, sequence_type="dna")
cds = Segment(sequence="ATGAAA...", sequence_type="dna")
gene = Construct(segments=[promoter, cds])
2

Generators

A generator is assigned to a segment and proposes new sequences on each iteration. Generators range from random mutation to protein language models such as ESM2, ESM3, and ProteinMPNN.
python
mutation_gen = RandomNucleotideGenerator(
    RandomNucleotideGeneratorConfig(masking_strategy=MaskingStrategy(num_mutations=2))
)
mutation_gen.assign(promoter)
3

Constraints

A constraint scores how well each proposal meets a requirement, from 0.0 (perfect) to 1.0 (worst). It uses either weight for soft scoring or threshold for hard pass/fail filtering.
python
gc = Constraint(
    inputs=[promoter],
    function=gc_content_constraint,
    function_config={"min_gc": 40, "max_gc": 60},
    weight=1.0,
)
4

Optimizers and Programs

An optimizer searches sequence space to minimize the total constraint score. A Program chains several optimizers into a multi-stage pipeline, for example broad exploration followed by fine-tuning.
python
program = Program(optimizers=[optimizer], num_results=5)
program.run()
results = construct.joined_sequences  # Ranked by quality

Architecture

The framework has the following components, which form an optimization loop:
1. Define2. Generate3. Evaluate4. OptimizeSegments(sequence regions)Constructs(multi-segment units)Generators(mutation, ESM2, Evo2,ProteinMPNN, …)Tools(ESMFold, Boltz, Enformer,MMseqs2, …)Constraints(GC%, pLDDT, binding,expression, …)Optimizers(MCMC, Rejection Sampling,Beam Search, …)5. Results(ranked sequences)iterate
1. Define2. Generate3. Evaluate4. OptimizeSegments(sequence regions)Constructs(multi-segment units)Generators(mutation, ESM2, Evo2,ProteinMPNN, …)Tools(ESMFold, Boltz, Enformer,MMseqs2, …)Constraints(GC%, pLDDT, binding,expression, …)Optimizers(MCMC, Rejection Sampling,Beam Search, …)5. Results(ranked sequences)iterate
ComponentPurposeExamples
SegmentsContiguous sequence regions to design200bp promoter, 100aa protein domain, variable CDR loop
ConstructsMulti-segment containersPromoter + CDS + terminator, multi-chain protein complex
GeneratorsPropose new proposal sequences each iterationRandom mutation, protein language models, inverse folding, autoregressive DNA/protein models
ConstraintsScore how well sequences meet requirements (0 = perfect, 1 = worst)Sequence composition, protein structure, RNA splicing, functional annotation, and more
OptimizersSearch algorithms that minimize the total constraint scoreMCMC, Rejection Sampling, Beam Search, Gradient descent, Cycling
ProgramsMulti-stage optimizer pipelinesRejection Sampling exploration then MCMC fine-tuning

Applications

Proteins can be designed for predicted structural properties. ESM2 or ProteinMPNN generate proposals, which are scored by ESMFold or Boltz2 for folding confidence, by TM-score for structural similarity, and by additional quality metrics.
python
from proto_language.core import Segment, Construct, Constraint, Program
from proto_language.generator import ESM2Generator, ESM2GeneratorConfig
from proto_language.optimizer import MCMCOptimizer, MCMCOptimizerConfig
from proto_language.constraint import (
    structure_plddt_constraint, balanced_aa_constraint,
)

protein = Segment(length=80, sequence_type="protein")
construct = Construct(segments=[protein])

from proto_tools.transforms.masking import MaskingStrategy
gen = ESM2Generator(ESM2GeneratorConfig(masking_strategy=MaskingStrategy(num_mutations=3)))
gen.assign(protein)

constraints = [
    # High predicted structure confidence
    Constraint(inputs=[protein], function=structure_plddt_constraint,
               function_config={"structure_tool": "esmfold"}, weight=2.0),
    # Balanced amino acid composition
    Constraint(inputs=[protein], function=balanced_aa_constraint,
               function_config={}, weight=1.0),
]

optimizer = MCMCOptimizer(
    constructs=[construct], generators=[gen], constraints=constraints,
    config=MCMCOptimizerConfig(num_steps=200, num_results=5),
)
Program(optimizers=[optimizer], num_results=5).run()

Key Features

Declarative Design

Sequences are specified by the properties they must satisfy rather than by a search procedure. Constraints define the requirements; the optimizer performs the search.

Composable Components

Generators, constraints, and optimizers combine freely. Multi-stage pipelines chain broad exploration with targeted refinement.

Integrated ML Models

Built-in support for protein language models, structure predictors, inverse-folding models, and genomic deep-learning models.

Bioinformatics Tools

Tools for structure prediction, sequence search, motif analysis, splicing prediction, and annotation are callable as constraints.

Multi-Objective Optimization

Competing requirements are balanced through weighted scoring and hard threshold filters across any number of constraints.

CPU and GPU

Lightweight generators and constraints run on CPU; structure prediction, language models, and genomic deep learning run on GPU when available.

Get Started

Installation

Install Proto on CPU or GPU, using pip or conda.

Quickstart

A step-by-step, runnable tutorial for a first design.

Core Concepts

Reference on segments, constructs, generators, constraints, optimizers, and programs.