Skip to main content
Proto translates a set of biological requirements into optimized sequences. The requirements are declared as constraints, and the framework searches sequence space for sequences that satisfy them. Without a framework, sequence design is a manual loop: a sequence is designed by hand, prediction tools are run one at a time, the results are evaluated, and the sequence is revised over many iterations. Proto automates this loop. Requirements are expressed once as constraints, an optimizer searches sequence space, and the output is a set of sequences ranked by score.

The Pipeline

Every design follows the same five-stage pipeline:
DefineSegments &ConstructsGenerateProposecandidatesEvaluateScore withconstraintsSelectKeep thebestResultsOptimizedsequencesiterate
DefineSegments &ConstructsGenerateProposecandidatesEvaluateScore withconstraintsSelectKeep thebestResultsOptimizedsequencesiterate

Core Components

Sequence

The fundamental data unit: a biological string (DNA, RNA, protein, or ligand) with type validation and rich metadata.

Segment

A region to design. Maintains dual pools of proposal and result sequences during optimization.

Construct

An ordered collection of Segments representing a complete biological design, like a gene expression cassette.

Generator

Proposes new sequences through mutation, autoregressive generation, inverse folding, or gradient-based design.

Constraint

Scores how well sequences meet requirements. Returns 0.0 (perfect) to 1.0 (worst violation).

Optimizer

Orchestrates the generate-evaluate-select loop using search strategies such as MCMC or beam search.
A Program chains multiple Optimizers together for multi-stage pipelines (e.g., coarse exploration followed by fine-tuning). See Programs.

Data Flow

1

Define the design space

Segments specify the regions to be designed, for example a 100 bp promoter, a 300-residue enzyme, or an existing sequence to optimize. They are combined into a Construct representing the full biological unit.
python
promoter = Segment(length=100, sequence_type="dna", label="promoter")
cds = Segment(length=900, sequence_type="dna", label="cds")
construct = Construct([promoter, cds])
2

Assign generators to segments

Each Generator is assigned to a Segment and proposes candidate sequences. Different generators use different strategies: random mutation, model-guided generation, or inverse folding.
python
generator = RandomNucleotideGenerator(RandomNucleotideGeneratorConfig())
generator.assign(promoter)
3

Define constraints with scoring functions

Constraints evaluate sequences and return a score from 0.0 (perfect) to 1.0 (worst). They can operate on individual segments or across multiple segments.
python
gc_constraint = Constraint(
    inputs=[promoter],
    function=gc_content_constraint,
    function_config={"min_gc": 45, "max_gc": 55}
)
4

Configure and run the optimizer

The Optimizer drives the search loop: generate proposals, score them with constraints, select the best. It repeats until a stopping criterion is reached.
python
optimizer = MCMCOptimizer(
    constructs=[construct],
    generators=[generator],
    constraints=[gc_constraint],
    config=MCMCOptimizerConfig(num_steps=1000)
)
5

Retrieve optimized sequences

After optimization, the best sequences live in each Segment’s result_sequences pool. The Construct’s joined_sequences gives the full concatenated results.
python
program = Program(optimizers=[optimizer], num_results=1)
program.run()

for seq in construct.joined_sequences:
    print(seq.sequence)
    print(seq.metadata["segments"])  # Per-segment constraint scores

The Dual Pool Architecture

A key design decision in Proto is that each Segment maintains two separate sequence pools. This separation lets optimizers explore freely without losing the best solutions found so far.
SEGMENTGeneratorsamplesConstraintsevaluatesOptimizerselectsproposal_sequencesWorking space for proposalsresult_sequencesBest sequences foundproposes intoscored byranked bypromotesbest toseedsnext round
SEGMENTGeneratorsamplesConstraintsevaluatesOptimizerselectsproposal_sequencesWorking space for proposalsresult_sequencesBest sequences foundproposes intoscored byranked bypromotesbest toseedsnext round

Proposal Pool

  • Purpose: Workspace for the optimizer
  • Populated by: Generators (mutations, new proposals)
  • Consumed by: Constraints (scoring) and Optimizer (selection)
  • Lifecycle: Rebuilt every optimization step

Result Pool

  • Purpose: Best results found so far
  • Populated by: Optimizer (after ranking proposals)
  • Consumed by: User (final output), next stage in a Program
  • Lifecycle: Persists across optimization steps and stages

Design Patterns

The simplest pattern: one segment, one constraint, one optimizer, used for a single design objective.Example: Design a 100 bp DNA promoter with 50-60% GC content.
python
from proto_language.core import Segment, Construct
from proto_language import Program, Constraint
from proto_language.generator import RandomNucleotideGenerator, RandomNucleotideGeneratorConfig
from proto_language.constraint import gc_content_constraint
from proto_language.optimizer import MCMCOptimizer, MCMCOptimizerConfig

segment = Segment(length=100, sequence_type="dna", label="promoter")
construct = Construct([segment])

generator = RandomNucleotideGenerator(RandomNucleotideGeneratorConfig())
generator.assign(segment)

constraint = Constraint(
    inputs=[segment],
    function=gc_content_constraint,
    function_config={"min_gc": 50, "max_gc": 60}
)

optimizer = MCMCOptimizer(
    constructs=[construct],
    generators=[generator],
    constraints=[constraint],
    config=MCMCOptimizerConfig(num_steps=1000)
)

program = Program(optimizers=[optimizer], num_results=1)
program.run()

Choosing a Pattern

Number of regionsbeing designed?How manyobjectives?Coarse-to-finesearch needed?Multi-SegmentCross-segment constraints if regions interactSingle ConstraintSimplest patternMulti-ConstraintWeights balance competing objectivesMulti-StageCheap constraints first, expensive laterMultipleOneOneMultipleNoYes
Number of regionsbeing designed?How manyobjectives?Coarse-to-finesearch needed?Multi-SegmentCross-segment constraints if regions interactSingle ConstraintSimplest patternMulti-ConstraintWeights balance competing objectivesMulti-StageCheap constraints first, expensive laterMultipleOneOneMultipleNoYes

Next Steps

Sequences

The fundamental data unit

Segments

Design regions and dual pools

Constraints

Scoring functions for design objectives

Quickstart

A first design, end to end