Quickstart

This tutorial builds a complete optimization pipeline that designs a 100bp DNA sequence optimized for two properties simultaneously:

GC content between 60-70% (higher than the typical ~50%, useful for thermostable organisms)
No homopolymer runs longer than 4bp (avoids synthesis errors and polymerase stalling)

It demonstrates the core Proto workflow end to end.

Overview

The result is a 100bp DNA sequence with verified properties:

Designed sequence (100bp):
GAAGCGGCCGTCACCGGCTGCGAGCTCGGAGTTACGATATGGACTTGCCGCGTGCTCCTCAGTCGGTACCGTCCTGTGGACGGAACGGCTGCTCGCGTCT

GC content: 65.0%
Longest homopolymer: 2bp

The exact sequence will differ because the process is stochastic, but the properties will be within the target ranges.

Prerequisites

Proto must be installed first, including the proto-tools submodule that the examples import.

This tutorial uses only CPU-based constraints. No GPU required.

Step-by-Step

Define the sequence

Every design starts with Segments and Constructs. A Segment is a contiguous region to be designed. A Construct groups one or more Segments into a single design unit.

python

from proto_language.core import Segment, Construct

# Create a 100bp DNA segment (random starting sequence)
segment = Segment(
    length=100,
    sequence_type="dna",
    label="my_sequence",
)

# Wrap it in a Construct
construct = Construct(segments=[segment])

total_length = sum(s.sequence_length for s in construct.segments)
print(f"Created: {total_length}bp {construct.sequence_type} construct")

Why Constructs? Real genetic designs often have multiple parts: a promoter, a coding sequence, a terminator. Each part is a Segment with its own generator and constraints. The Construct joins them for the optimizer.

Set up the generator

A Generator proposes new candidate sequences at each optimization step. RandomNucleotideGenerator introduces random point mutations; it is a baseline mutation generator for DNA/RNA sequence-level optimization.

python

from proto_language.generator import (
    RandomNucleotideGenerator,
    RandomNucleotideGeneratorConfig,
)
from proto_tools.transforms.masking import MaskingStrategy

# Mutate 3 random positions per step
generator = RandomNucleotideGenerator(
    RandomNucleotideGeneratorConfig(masking_strategy=MaskingStrategy(num_mutations=3))
)

# Link the generator to our segment
generator.assign(segment)

Choosing num_mutations: Lower values (1-2) make small, conservative changes, good for fine-tuning. Higher values (5-10) make bigger jumps, good for escaping local optima. For a 100bp sequence, 3 mutations per step is a reasonable starting point.

Define constraints

Constraints score how well each proposal sequence meets a requirement. By convention a constraint returns a score between 0.0 (perfect) and 1.0 (worst), and the optimizer minimizes these scores.

python

from proto_language.core import Constraint
from proto_language.constraint import (
    gc_content_constraint,
    max_homopolymer_constraint,
)

# Soft constraint: GC content between 60-70%
# weight=1.0 means this score contributes directly to the energy
gc_constraint = Constraint(
    inputs=[segment],
    function=gc_content_constraint,
    function_config={"min_gc": 60, "max_gc": 70},
    weight=1.0,
    label="gc_content",
)

# Hard filter: reject any sequence with homopolymers > 4bp
# threshold=0.0 means the score must be exactly 0 (no violations) to pass
homopolymer_filter = Constraint(
    inputs=[segment],
    function=max_homopolymer_constraint,
    function_config={"max_length": 4},
    threshold=0.0,
    label="homopolymer",
)

Weights vs. thresholds: two modes of constraint evaluation

weight (soft): The constraint score is multiplied by the weight and added to the total energy. Higher weight = more importance. The optimizer tries to minimize total energy.
threshold (hard filter): Proposals with scores above the threshold are rejected outright. Use this for non-negotiable requirements. A constraint cannot have both weight and threshold.

Configure the optimizer

The Optimizer searches sequence space to minimize total constraint scores. MCMC (Markov Chain Monte Carlo) is a general-purpose default; it iteratively proposes mutations and accepts improvements.

python

from proto_language.optimizer import (
    MCMCOptimizer,
    MCMCOptimizerConfig,
)

optimizer = MCMCOptimizer(
    constructs=[construct],
    generators=[generator],
    constraints=[gc_constraint, homopolymer_filter],
    config=MCMCOptimizerConfig(
        num_steps=1000,         # Run 1000 MCMC iterations
        num_results=5,          # Maintain 5 parallel trajectories
        proposals_per_result=10,  # Proposals per trajectory each step
        max_temperature=2.0,    # Start warm (more exploration)
        min_temperature=0.001,  # End cold (greedy refinement)
    ),
)

MCMC uses simulated annealing. It starts at high temperature (accepting worse solutions to escape local optima) and gradually cools (becoming greedy). The num_results parameter runs multiple independent trajectories in parallel, increasing the chance of finding good solutions.

Run the program

A Program orchestrates one or more optimizers. For this tutorial, we have a single stage. Call run() and retrieve results from the construct.

python

import itertools

from proto_language.core import Program

program = Program(optimizers=[optimizer], num_results=5)
program.run()

# Results are stored in the construct's result sequences
print(f"\nTop {len(construct.joined_sequences)} designed sequences:\n")
for i, seq in enumerate(construct.joined_sequences):
    # Verify GC content manually
    gc_count = sum(1 for nt in seq.sequence if nt in "GC")
    gc_pct = 100 * gc_count / len(seq.sequence)

    # Check homopolymers
    max_run = max(
        (sum(1 for _ in group) for _, group in itertools.groupby(seq.sequence)),
        default=0,
    )

    print(f"  [{i+1}] {seq.sequence[:50]}...")
    print(f"      GC: {gc_pct:.1f}% | Longest homopolymer: {max_run}bp")
    print()

Complete Runnable Script

Copy this entire block and run it:

python

from proto_language.core import (
    Segment, Construct, Constraint, Program,
)
from proto_language.generator import (
    RandomNucleotideGenerator, RandomNucleotideGeneratorConfig,
)
from proto_language.optimizer import (
    MCMCOptimizer, MCMCOptimizerConfig,
)
from proto_language.constraint import (
    gc_content_constraint, max_homopolymer_constraint,
)
from proto_tools.transforms.masking import MaskingStrategy

# 1. Define the sequence
segment = Segment(length=100, sequence_type="dna", label="my_sequence")
construct = Construct(segments=[segment])

# 2. Set up generator (3 random mutations per step)
generator = RandomNucleotideGenerator(
    RandomNucleotideGeneratorConfig(masking_strategy=MaskingStrategy(num_mutations=3))
)
generator.assign(segment)

# 3. Define constraints
gc_constraint = Constraint(
    inputs=[segment],
    function=gc_content_constraint,
    function_config={"min_gc": 60, "max_gc": 70},
    weight=1.0,
    label="gc_content",
)
homopolymer_filter = Constraint(
    inputs=[segment],
    function=max_homopolymer_constraint,
    function_config={"max_length": 4},
    threshold=0.0,
    label="homopolymer",
)

# 4. Configure MCMC optimizer
optimizer = MCMCOptimizer(
    constructs=[construct],
    generators=[generator],
    constraints=[gc_constraint, homopolymer_filter],
    config=MCMCOptimizerConfig(
        num_steps=1000,
        num_results=5,
        proposals_per_result=10,
        max_temperature=2.0,
        min_temperature=0.001,
    ),
)

# 5. Run!
program = Program(optimizers=[optimizer], num_results=5)
program.run()

# 6. Inspect results
import itertools

for i, seq in enumerate(construct.joined_sequences):
    gc_count = sum(1 for nt in seq.sequence if nt in "GC")
    gc_pct = 100 * gc_count / len(seq.sequence)
    max_run = max(sum(1 for _ in g) for _, g in itertools.groupby(seq.sequence))

    print(f"Sequence {i+1}: {seq.sequence}")
    print(f"  GC content: {gc_pct:.1f}%  |  Longest homopolymer: {max_run}bp\n")

Variations

Tighter GC Range
Multi-Segment Construct
Multi-Stage Pipeline

Make GC content more precise by narrowing the target range and increasing optimization steps:

python

# Target exactly 65% GC (narrow 64-66% window)
gc_constraint = Constraint(
    inputs=[segment],
    function=gc_content_constraint,
    function_config={"min_gc": 64, "max_gc": 66},
    weight=2.0,  # Higher weight = more importance
    label="gc_content",
)

optimizer = MCMCOptimizer(
    constructs=[construct],
    generators=[generator],
    constraints=[gc_constraint, homopolymer_filter],
    config=MCMCOptimizerConfig(
        num_steps=2000,   # More steps for tighter target
        num_results=10,       # More parallel trajectories
        proposals_per_result=10,  # Proposals per trajectory each step
    ),
)

Design a construct with a fixed promoter and a variable coding region:

python

# Fixed promoter (will not be mutated)
promoter = Segment(
    sequence="TTGACAATTAATCATCGAACTAGTTAACTAGTACGCAAGTTCACGTAA",
    sequence_type="dna",
    label="promoter",
)

# Variable coding region (this is what we optimize)
cds = Segment(length=300, sequence_type="dna", label="cds")

# Construct joins both
construct = Construct(segments=[promoter, cds])

# Only assign generator to the variable segment
gen = RandomNucleotideGenerator(
    RandomNucleotideGeneratorConfig(masking_strategy=MaskingStrategy(num_mutations=5))
)
gen.assign(cds)

# Constraint on the coding region only
gc = Constraint(
    inputs=[cds],
    function=gc_content_constraint,
    function_config={"min_gc": 45, "max_gc": 55},
    weight=1.0,
)

Chain optimizers: broad Rejection Sampling exploration first, then fine-tuning with MCMC. Both optimizers must reference the same construct objects. This mirrors examples/scripts/toy-multiple-optimizers.py.

python

from proto_language.optimizer import (
    RejectionSamplingOptimizer, RejectionSamplingOptimizerConfig,
    MCMCOptimizer, MCMCOptimizerConfig,
)

segment = Segment(length=100, sequence_type="dna", label="target")
construct = Construct(segments=[segment])

# Stage 1: Broad exploration with Rejection Sampling
gen1 = RandomNucleotideGenerator(
    RandomNucleotideGeneratorConfig(masking_strategy=MaskingStrategy(num_mutations=5))
)
gen1.assign(segment)
gc1 = Constraint(
    inputs=[segment], function=gc_content_constraint,
    function_config={"min_gc": 60, "max_gc": 70}, weight=1.0,
)
stage1 = RejectionSamplingOptimizer(
    constructs=[construct], generators=[gen1], constraints=[gc1],
    config=RejectionSamplingOptimizerConfig(num_samples=500, num_results=10),
)

# Stage 2: Fine-tune the best with MCMC
gen2 = RandomNucleotideGenerator(
    RandomNucleotideGeneratorConfig(masking_strategy=MaskingStrategy(num_mutations=1))
)
gen2.assign(segment)
gc2 = Constraint(
    inputs=[segment], function=gc_content_constraint,
    function_config={"min_gc": 60, "max_gc": 70}, weight=1.0,
)
stage2 = MCMCOptimizer(
    constructs=[construct], generators=[gen2], constraints=[gc2],
    config=MCMCOptimizerConfig(num_steps=500, num_results=5),
)

# Run both stages sequentially
program = Program(optimizers=[stage1, stage2], num_results=5)
program.run()

Important: Each optimizer in a program must have its own generator and constraint instances; do not share them. They must all reference the same construct objects so results pass between stages.

Key Concepts

Concept	What It Does	In This Tutorial
Segment	A contiguous sequence region to design	100bp DNA starting from random
Construct	Groups segments into a design unit	Single segment wrapper
Generator	Proposes mutations each iteration	3 random point mutations per step
Constraint	Scores sequence quality (0 = perfect)	GC content (weighted) + homopolymer (filter)
Optimizer	Searches for optimal sequences	MCMC with simulated annealing, 5 trajectories
Program	Orchestrates optimizer pipeline	Single-stage MCMC

Next Steps

Core Concepts

How segments, generators, constraints, and optimizers interact under the hood.

Symmetric Protein Design

Design proteins with structure prediction constraints using ESMFold, ESM2, and ProteinMPNN.

Available Constraints

Browse all 50+ built-in constraints: from GC content to protein folding to splice site prediction.

Worked Examples

Runnable example programs on GitHub: declarative specs in examples/jsons/ (start with toy.json) and Python pipelines in examples/scripts/ (toy.py, protein_hunter.py, toy-multiple-optimizers.py).

​Quickstart

​Overview

​Prerequisites

​Step-by-Step

​Complete Runnable Script

​Variations

​Key Concepts

​Next Steps

Core Concepts

Symmetric Protein Design

Available Constraints

Worked Examples

Quickstart

Overview

Prerequisites

Step-by-Step

Complete Runnable Script

Variations

Key Concepts

Next Steps