Making Biology Programmable

Proto

Proto is a high-level programming language for designing DNA, RNA, and protein sequences. The required properties of a sequence are expressed as constraints; generators propose candidate sequences and optimizers search for candidates that satisfy them.

Overview

Biological sequence design is typically multi-objective: a single sequence must meet several requirements at once. A designed protein may need to fold to a target structure, bind a target, express in a host organism, and remain soluble. A coding sequence may need a controlled GC content, codon usage suited to its host, no long homopolymer runs, and the absence of specified restriction sites. Proto represents each requirement as a separate constraint and optimizes against the full set rather than a single objective. A design is specified declaratively. The sequence regions to be designed are defined as segments; a generator is assigned to each region to propose candidates; constraints score how well each candidate meets a requirement; and one or more optimizers search sequence space to minimize the combined score.

python

from proto_language.core import Segment, Construct, Constraint, Program
from proto_language.generator import RandomNucleotideGenerator, RandomNucleotideGeneratorConfig
from proto_language.optimizer import MCMCOptimizer, MCMCOptimizerConfig
from proto_language.constraint import gc_content_constraint, max_homopolymer_constraint
from proto_tools.transforms.masking import MaskingStrategy

# Define a 200bp DNA sequence to optimize
dna = Segment(length=200, sequence_type="dna")
construct = Construct(segments=[dna])

# Generator: random point mutations to explore sequence space
gen = RandomNucleotideGenerator(RandomNucleotideGeneratorConfig(masking_strategy=MaskingStrategy(num_mutations=3)))
gen.assign(dna)

# Constraints: what the sequence must satisfy
constraints = [
    Constraint(inputs=[dna], function=gc_content_constraint,
               function_config={"min_gc": 45, "max_gc": 55}, weight=1.0),
    Constraint(inputs=[dna], function=max_homopolymer_constraint,
               function_config={"max_length": 5}, threshold=0.0),
]

# Optimize with MCMC
optimizer = MCMCOptimizer(
    constructs=[construct], generators=[gen], constraints=constraints,
    config=MCMCOptimizerConfig(num_steps=500, num_results=5),
)

program = Program(optimizers=[optimizer], num_results=5)
program.run()

# Results: 5 optimized sequences ranked by quality
for seq in construct.joined_sequences:
    print(seq.sequence)

How It Works

Segments and Constructs

Segments are contiguous sequence regions to be designed; they are grouped into Constructs. A segment is initialized either from a target length or from an existing sequence.

python

promoter = Segment(length=50, sequence_type="dna")
cds = Segment(sequence="ATGAAA...", sequence_type="dna")
gene = Construct(segments=[promoter, cds])

Generators

A generator is assigned to a segment and proposes new sequences on each iteration. Generators range from random mutation to protein language models such as ESM2, ESM3, and ProteinMPNN.

python

mutation_gen = RandomNucleotideGenerator(
    RandomNucleotideGeneratorConfig(masking_strategy=MaskingStrategy(num_mutations=2))
)
mutation_gen.assign(promoter)

Constraints

A constraint scores how well each proposal meets a requirement, from 0.0 (perfect) to 1.0 (worst). It uses either weight for soft scoring or threshold for hard pass/fail filtering.

python

gc = Constraint(
    inputs=[promoter],
    function=gc_content_constraint,
    function_config={"min_gc": 40, "max_gc": 60},
    weight=1.0,
)

Optimizers and Programs

An optimizer searches sequence space to minimize the total constraint score. A Program chains several optimizers into a multi-stage pipeline, for example broad exploration followed by fine-tuning.

python

program = Program(optimizers=[optimizer], num_results=5)
program.run()
results = construct.joined_sequences  # Ranked by quality

Architecture

The framework has the following components, which form an optimization loop:

Component	Purpose	Examples
Segments	Contiguous sequence regions to design	200bp promoter, 100aa protein domain, variable CDR loop
Constructs	Multi-segment containers	Promoter + CDS + terminator, multi-chain protein complex
Generators	Propose new proposal sequences each iteration	Random mutation, protein language models, inverse folding, autoregressive DNA/protein models
Constraints	Score how well sequences meet requirements (0 = perfect, 1 = worst)	Sequence composition, protein structure, RNA splicing, functional annotation, and more
Optimizers	Search algorithms that minimize the total constraint score	MCMC, Rejection Sampling, Beam Search, Gradient descent, Cycling
Programs	Multi-stage optimizer pipelines	Rejection Sampling exploration then MCMC fine-tuning

Applications

Protein Design
DNA Optimization
RNA Engineering

Proteins can be designed for predicted structural properties. ESM2 or ProteinMPNN generate proposals, which are scored by ESMFold or Boltz2 for folding confidence, by TM-score for structural similarity, and by additional quality metrics.

python

from proto_language.core import Segment, Construct, Constraint, Program
from proto_language.generator import ESM2Generator, ESM2GeneratorConfig
from proto_language.optimizer import MCMCOptimizer, MCMCOptimizerConfig
from proto_language.constraint import (
    structure_plddt_constraint, balanced_aa_constraint,
)

protein = Segment(length=80, sequence_type="protein")
construct = Construct(segments=[protein])

from proto_tools.transforms.masking import MaskingStrategy
gen = ESM2Generator(ESM2GeneratorConfig(masking_strategy=MaskingStrategy(num_mutations=3)))
gen.assign(protein)

constraints = [
    # High predicted structure confidence
    Constraint(inputs=[protein], function=structure_plddt_constraint,
               function_config={"structure_tool": "esmfold"}, weight=2.0),
    # Balanced amino acid composition
    Constraint(inputs=[protein], function=balanced_aa_constraint,
               function_config={}, weight=1.0),
]

optimizer = MCMCOptimizer(
    constructs=[construct], generators=[gen], constraints=constraints,
    config=MCMCOptimizerConfig(num_steps=200, num_results=5),
)
Program(optimizers=[optimizer], num_results=5).run()

DNA sequences can be optimized for synthesis and expression: GC content, homopolymer runs, restriction sites, and promoter strength are controlled simultaneously.

python

from proto_language.core import Segment, Construct, Constraint, Program
from proto_language.generator import (
    RandomNucleotideGenerator, RandomNucleotideGeneratorConfig,
)
from proto_language.optimizer import MCMCOptimizer, MCMCOptimizerConfig
from proto_language.constraint import (
    gc_content_constraint, max_homopolymer_constraint,
)
from proto_tools.transforms.masking import MaskingStrategy

gene = Segment(length=300, sequence_type="dna")
construct = Construct(segments=[gene])

gen = RandomNucleotideGenerator(RandomNucleotideGeneratorConfig(masking_strategy=MaskingStrategy(num_mutations=3)))
gen.assign(gene)

constraints = [
    Constraint(inputs=[gene], function=gc_content_constraint,
               function_config={"min_gc": 40, "max_gc": 60}, weight=1.0),
    Constraint(inputs=[gene], function=max_homopolymer_constraint,
               function_config={"max_length": 4}, threshold=0.0),
]

optimizer = MCMCOptimizer(
    constructs=[construct], generators=[gen], constraints=constraints,
    config=MCMCOptimizerConfig(num_steps=1000, num_results=10),
)
Program(optimizers=[optimizer], num_results=10).run()

RNA sequences can be designed for target secondary structures, splice-site properties, or regulatory motifs, combining sequence-level constraints with structure predictions.

python

from proto_language.core import Segment, Construct, Constraint, Program
from proto_language.generator import (
    RandomNucleotideGenerator, RandomNucleotideGeneratorConfig,
)
from proto_language.optimizer import RejectionSamplingOptimizer, RejectionSamplingOptimizerConfig
from proto_language.constraint import (
    gc_content_constraint, rna_property_similarity_constraint,
)
from proto_tools.transforms.masking import MaskingStrategy

rna = Segment(length=150, sequence_type="rna")
construct = Construct(segments=[rna])

gen = RandomNucleotideGenerator(RandomNucleotideGeneratorConfig(masking_strategy=MaskingStrategy(num_mutations=2)))
gen.assign(rna)

constraints = [
    Constraint(inputs=[rna], function=gc_content_constraint,
               function_config={"min_gc": 40, "max_gc": 55}, weight=1.0),
    Constraint(inputs=[rna], function=rna_property_similarity_constraint,
               function_config={"reference_sequence": "GGG" + "A" * 144 + "CCC"},
               weight=2.0),
]

optimizer = RejectionSamplingOptimizer(
    constructs=[construct], generators=[gen], constraints=constraints,
    config=RejectionSamplingOptimizerConfig(num_samples=500, num_results=10),
)
Program(optimizers=[optimizer], num_results=10).run()

Key Features

Declarative Design

Sequences are specified by the properties they must satisfy rather than by a search procedure. Constraints define the requirements; the optimizer performs the search.

Composable Components

Generators, constraints, and optimizers combine freely. Multi-stage pipelines chain broad exploration with targeted refinement.

Integrated ML Models

Built-in support for protein language models, structure predictors, inverse-folding models, and genomic deep-learning models.

Bioinformatics Tools

Tools for structure prediction, sequence search, motif analysis, splicing prediction, and annotation are callable as constraints.

Multi-Objective Optimization

Competing requirements are balanced through weighted scoring and hard threshold filters across any number of constraints.

CPU and GPU

Lightweight generators and constraints run on CPU; structure prediction, language models, and genomic deep learning run on GPU when available.

Get Started

Installation

Install Proto on CPU or GPU, using pip or conda.

Quickstart

A step-by-step, runnable tutorial for a first design.

Core Concepts

Reference on segments, constructs, generators, constraints, optimizers, and programs.

​Proto

​Overview

​How It Works

​Architecture

​Applications

​Key Features

Declarative Design

Composable Components

Integrated ML Models

Bioinformatics Tools

Multi-Objective Optimization

CPU and GPU

​Get Started

Installation

Quickstart

Core Concepts

Proto

Overview

How It Works

Architecture

Applications

Key Features

Get Started