Skip to main content

Quickstart

This tutorial builds a complete optimization pipeline that designs a 100bp DNA sequence optimized for two properties simultaneously:
  • GC content between 60-70% (higher than the typical ~50%, useful for thermostable organisms)
  • No homopolymer runs longer than 4bp (avoids synthesis errors and polymerase stalling)
It demonstrates the core Proto workflow end to end.

Overview

The result is a 100bp DNA sequence with verified properties:
Designed sequence (100bp):
GAAGCGGCCGTCACCGGCTGCGAGCTCGGAGTTACGATATGGACTTGCCGCGTGCTCCTCAGTCGGTACCGTCCTGTGGACGGAACGGCTGCTCGCGTCT

GC content: 65.0%
Longest homopolymer: 2bp
The exact sequence will differ because the process is stochastic, but the properties will be within the target ranges.

Prerequisites

Proto must be installed first, including the proto-tools submodule that the examples import.
This tutorial uses only CPU-based constraints. No GPU required.

Step-by-Step

1

Define the sequence

Every design starts with Segments and Constructs. A Segment is a contiguous region to be designed. A Construct groups one or more Segments into a single design unit.
python
from proto_language.core import Segment, Construct

# Create a 100bp DNA segment (random starting sequence)
segment = Segment(
    length=100,
    sequence_type="dna",
    label="my_sequence",
)

# Wrap it in a Construct
construct = Construct(segments=[segment])

total_length = sum(s.sequence_length for s in construct.segments)
print(f"Created: {total_length}bp {construct.sequence_type} construct")
Why Constructs? Real genetic designs often have multiple parts: a promoter, a coding sequence, a terminator. Each part is a Segment with its own generator and constraints. The Construct joins them for the optimizer.
2

Set up the generator

A Generator proposes new candidate sequences at each optimization step. RandomNucleotideGenerator introduces random point mutations; it is a baseline mutation generator for DNA/RNA sequence-level optimization.
python
from proto_language.generator import (
    RandomNucleotideGenerator,
    RandomNucleotideGeneratorConfig,
)
from proto_tools.transforms.masking import MaskingStrategy

# Mutate 3 random positions per step
generator = RandomNucleotideGenerator(
    RandomNucleotideGeneratorConfig(masking_strategy=MaskingStrategy(num_mutations=3))
)

# Link the generator to our segment
generator.assign(segment)
Choosing num_mutations: Lower values (1-2) make small, conservative changes, good for fine-tuning. Higher values (5-10) make bigger jumps, good for escaping local optima. For a 100bp sequence, 3 mutations per step is a reasonable starting point.
3

Define constraints

Constraints score how well each proposal sequence meets a requirement. By convention a constraint returns a score between 0.0 (perfect) and 1.0 (worst), and the optimizer minimizes these scores.
python
from proto_language.core import Constraint
from proto_language.constraint import (
    gc_content_constraint,
    max_homopolymer_constraint,
)

# Soft constraint: GC content between 60-70%
# weight=1.0 means this score contributes directly to the energy
gc_constraint = Constraint(
    inputs=[segment],
    function=gc_content_constraint,
    function_config={"min_gc": 60, "max_gc": 70},
    weight=1.0,
    label="gc_content",
)

# Hard filter: reject any sequence with homopolymers > 4bp
# threshold=0.0 means the score must be exactly 0 (no violations) to pass
homopolymer_filter = Constraint(
    inputs=[segment],
    function=max_homopolymer_constraint,
    function_config={"max_length": 4},
    threshold=0.0,
    label="homopolymer",
)
Weights vs. thresholds: two modes of constraint evaluation
  • weight (soft): The constraint score is multiplied by the weight and added to the total energy. Higher weight = more importance. The optimizer tries to minimize total energy.
  • threshold (hard filter): Proposals with scores above the threshold are rejected outright. Use this for non-negotiable requirements. A constraint cannot have both weight and threshold.
4

Configure the optimizer

The Optimizer searches sequence space to minimize total constraint scores. MCMC (Markov Chain Monte Carlo) is a general-purpose default; it iteratively proposes mutations and accepts improvements.
python
from proto_language.optimizer import (
    MCMCOptimizer,
    MCMCOptimizerConfig,
)

optimizer = MCMCOptimizer(
    constructs=[construct],
    generators=[generator],
    constraints=[gc_constraint, homopolymer_filter],
    config=MCMCOptimizerConfig(
        num_steps=1000,         # Run 1000 MCMC iterations
        num_results=5,          # Maintain 5 parallel trajectories
        proposals_per_result=10,  # Proposals per trajectory each step
        max_temperature=2.0,    # Start warm (more exploration)
        min_temperature=0.001,  # End cold (greedy refinement)
    ),
)
MCMC uses simulated annealing. It starts at high temperature (accepting worse solutions to escape local optima) and gradually cools (becoming greedy). The num_results parameter runs multiple independent trajectories in parallel, increasing the chance of finding good solutions.
5

Run the program

A Program orchestrates one or more optimizers. For this tutorial, we have a single stage. Call run() and retrieve results from the construct.
python
import itertools

from proto_language.core import Program

program = Program(optimizers=[optimizer], num_results=5)
program.run()

# Results are stored in the construct's result sequences
print(f"\nTop {len(construct.joined_sequences)} designed sequences:\n")
for i, seq in enumerate(construct.joined_sequences):
    # Verify GC content manually
    gc_count = sum(1 for nt in seq.sequence if nt in "GC")
    gc_pct = 100 * gc_count / len(seq.sequence)

    # Check homopolymers
    max_run = max(
        (sum(1 for _ in group) for _, group in itertools.groupby(seq.sequence)),
        default=0,
    )

    print(f"  [{i+1}] {seq.sequence[:50]}...")
    print(f"      GC: {gc_pct:.1f}% | Longest homopolymer: {max_run}bp")
    print()

Complete Runnable Script

Copy this entire block and run it:
python
from proto_language.core import (
    Segment, Construct, Constraint, Program,
)
from proto_language.generator import (
    RandomNucleotideGenerator, RandomNucleotideGeneratorConfig,
)
from proto_language.optimizer import (
    MCMCOptimizer, MCMCOptimizerConfig,
)
from proto_language.constraint import (
    gc_content_constraint, max_homopolymer_constraint,
)
from proto_tools.transforms.masking import MaskingStrategy

# 1. Define the sequence
segment = Segment(length=100, sequence_type="dna", label="my_sequence")
construct = Construct(segments=[segment])

# 2. Set up generator (3 random mutations per step)
generator = RandomNucleotideGenerator(
    RandomNucleotideGeneratorConfig(masking_strategy=MaskingStrategy(num_mutations=3))
)
generator.assign(segment)

# 3. Define constraints
gc_constraint = Constraint(
    inputs=[segment],
    function=gc_content_constraint,
    function_config={"min_gc": 60, "max_gc": 70},
    weight=1.0,
    label="gc_content",
)
homopolymer_filter = Constraint(
    inputs=[segment],
    function=max_homopolymer_constraint,
    function_config={"max_length": 4},
    threshold=0.0,
    label="homopolymer",
)

# 4. Configure MCMC optimizer
optimizer = MCMCOptimizer(
    constructs=[construct],
    generators=[generator],
    constraints=[gc_constraint, homopolymer_filter],
    config=MCMCOptimizerConfig(
        num_steps=1000,
        num_results=5,
        proposals_per_result=10,
        max_temperature=2.0,
        min_temperature=0.001,
    ),
)

# 5. Run!
program = Program(optimizers=[optimizer], num_results=5)
program.run()

# 6. Inspect results
import itertools

for i, seq in enumerate(construct.joined_sequences):
    gc_count = sum(1 for nt in seq.sequence if nt in "GC")
    gc_pct = 100 * gc_count / len(seq.sequence)
    max_run = max(sum(1 for _ in g) for _, g in itertools.groupby(seq.sequence))

    print(f"Sequence {i+1}: {seq.sequence}")
    print(f"  GC content: {gc_pct:.1f}%  |  Longest homopolymer: {max_run}bp\n")

Variations

Make GC content more precise by narrowing the target range and increasing optimization steps:
python
# Target exactly 65% GC (narrow 64-66% window)
gc_constraint = Constraint(
    inputs=[segment],
    function=gc_content_constraint,
    function_config={"min_gc": 64, "max_gc": 66},
    weight=2.0,  # Higher weight = more importance
    label="gc_content",
)

optimizer = MCMCOptimizer(
    constructs=[construct],
    generators=[generator],
    constraints=[gc_constraint, homopolymer_filter],
    config=MCMCOptimizerConfig(
        num_steps=2000,   # More steps for tighter target
        num_results=10,       # More parallel trajectories
        proposals_per_result=10,  # Proposals per trajectory each step
    ),
)

Key Concepts

ConceptWhat It DoesIn This Tutorial
SegmentA contiguous sequence region to design100bp DNA starting from random
ConstructGroups segments into a design unitSingle segment wrapper
GeneratorProposes mutations each iteration3 random point mutations per step
ConstraintScores sequence quality (0 = perfect)GC content (weighted) + homopolymer (filter)
OptimizerSearches for optimal sequencesMCMC with simulated annealing, 5 trajectories
ProgramOrchestrates optimizer pipelineSingle-stage MCMC

Next Steps

Core Concepts

How segments, generators, constraints, and optimizers interact under the hood.

Symmetric Protein Design

Design proteins with structure prediction constraints using ESMFold, ESM2, and ProteinMPNN.

Available Constraints

Browse all 50+ built-in constraints: from GC content to protein folding to splice site prediction.

Worked Examples

Runnable example programs on GitHub: declarative specs in examples/jsons/ (start with toy.json) and Python pipelines in examples/scripts/ (toy.py, protein_hunter.py, toy-multiple-optimizers.py).