Every design starts with Segments and Constructs. A Segment is a contiguous region to be designed. A Construct groups one or more Segments into a single design unit.
python
from proto_language.core import Segment, Construct# Create a 100bp DNA segment (random starting sequence)segment = Segment( length=100, sequence_type="dna", label="my_sequence",)# Wrap it in a Constructconstruct = Construct(segments=[segment])total_length = sum(s.sequence_length for s in construct.segments)print(f"Created: {total_length}bp {construct.sequence_type} construct")
Why Constructs? Real genetic designs often have multiple parts: a promoter, a coding sequence, a terminator. Each part is a Segment with its own generator and constraints. The Construct joins them for the optimizer.
2
Set up the generator
A Generator proposes new candidate sequences at each optimization step. RandomNucleotideGenerator introduces random point mutations; it is a baseline mutation generator for DNA/RNA sequence-level optimization.
python
from proto_language.generator import ( RandomNucleotideGenerator, RandomNucleotideGeneratorConfig,)from proto_tools.transforms.masking import MaskingStrategy# Mutate 3 random positions per stepgenerator = RandomNucleotideGenerator( RandomNucleotideGeneratorConfig(masking_strategy=MaskingStrategy(num_mutations=3)))# Link the generator to our segmentgenerator.assign(segment)
Choosing num_mutations: Lower values (1-2) make small, conservative changes, good for fine-tuning. Higher values (5-10) make bigger jumps, good for escaping local optima. For a 100bp sequence, 3 mutations per step is a reasonable starting point.
3
Define constraints
Constraints score how well each proposal sequence meets a requirement. By convention a constraint returns a score between 0.0 (perfect) and 1.0 (worst), and the optimizer minimizes these scores.
python
from proto_language.core import Constraintfrom proto_language.constraint import ( gc_content_constraint, max_homopolymer_constraint,)# Soft constraint: GC content between 60-70%# weight=1.0 means this score contributes directly to the energygc_constraint = Constraint( inputs=[segment], function=gc_content_constraint, function_config={"min_gc": 60, "max_gc": 70}, weight=1.0, label="gc_content",)# Hard filter: reject any sequence with homopolymers > 4bp# threshold=0.0 means the score must be exactly 0 (no violations) to passhomopolymer_filter = Constraint( inputs=[segment], function=max_homopolymer_constraint, function_config={"max_length": 4}, threshold=0.0, label="homopolymer",)
Weights vs. thresholds: two modes of constraint evaluation
weight (soft): The constraint score is multiplied by the weight and added to the total energy. Higher weight = more importance. The optimizer tries to minimize total energy.
threshold (hard filter): Proposals with scores above the threshold are rejected outright. Use this for non-negotiable requirements. A constraint cannot have both weight and threshold.
4
Configure the optimizer
The Optimizer searches sequence space to minimize total constraint scores. MCMC (Markov Chain Monte Carlo) is a general-purpose default; it iteratively proposes mutations and accepts improvements.
python
from proto_language.optimizer import ( MCMCOptimizer, MCMCOptimizerConfig,)optimizer = MCMCOptimizer( constructs=[construct], generators=[generator], constraints=[gc_constraint, homopolymer_filter], config=MCMCOptimizerConfig( num_steps=1000, # Run 1000 MCMC iterations num_results=5, # Maintain 5 parallel trajectories proposals_per_result=10, # Proposals per trajectory each step max_temperature=2.0, # Start warm (more exploration) min_temperature=0.001, # End cold (greedy refinement) ),)
MCMC uses simulated annealing. It starts at high temperature (accepting worse solutions to escape local optima) and gradually cools (becoming greedy). The num_results parameter runs multiple independent trajectories in parallel, increasing the chance of finding good solutions.
5
Run the program
A Program orchestrates one or more optimizers. For this tutorial, we have a single stage. Call run() and retrieve results from the construct.
python
import itertoolsfrom proto_language.core import Programprogram = Program(optimizers=[optimizer], num_results=5)program.run()# Results are stored in the construct's result sequencesprint(f"\nTop {len(construct.joined_sequences)} designed sequences:\n")for i, seq in enumerate(construct.joined_sequences): # Verify GC content manually gc_count = sum(1 for nt in seq.sequence if nt in "GC") gc_pct = 100 * gc_count / len(seq.sequence) # Check homopolymers max_run = max( (sum(1 for _ in group) for _, group in itertools.groupby(seq.sequence)), default=0, ) print(f" [{i+1}] {seq.sequence[:50]}...") print(f" GC: {gc_pct:.1f}% | Longest homopolymer: {max_run}bp") print()
Make GC content more precise by narrowing the target range and increasing optimization steps:
python
# Target exactly 65% GC (narrow 64-66% window)gc_constraint = Constraint( inputs=[segment], function=gc_content_constraint, function_config={"min_gc": 64, "max_gc": 66}, weight=2.0, # Higher weight = more importance label="gc_content",)optimizer = MCMCOptimizer( constructs=[construct], generators=[generator], constraints=[gc_constraint, homopolymer_filter], config=MCMCOptimizerConfig( num_steps=2000, # More steps for tighter target num_results=10, # More parallel trajectories proposals_per_result=10, # Proposals per trajectory each step ),)
Design a construct with a fixed promoter and a variable coding region:
python
# Fixed promoter (will not be mutated)promoter = Segment( sequence="TTGACAATTAATCATCGAACTAGTTAACTAGTACGCAAGTTCACGTAA", sequence_type="dna", label="promoter",)# Variable coding region (this is what we optimize)cds = Segment(length=300, sequence_type="dna", label="cds")# Construct joins bothconstruct = Construct(segments=[promoter, cds])# Only assign generator to the variable segmentgen = RandomNucleotideGenerator( RandomNucleotideGeneratorConfig(masking_strategy=MaskingStrategy(num_mutations=5)))gen.assign(cds)# Constraint on the coding region onlygc = Constraint( inputs=[cds], function=gc_content_constraint, function_config={"min_gc": 45, "max_gc": 55}, weight=1.0,)
Chain optimizers: broad Rejection Sampling exploration first, then fine-tuning with MCMC. Both optimizers must reference the same construct objects. This mirrors examples/scripts/toy-multiple-optimizers.py.
Important: Each optimizer in a program must have its own generator and constraint instances; do not share them. They must all reference the same construct objects so results pass between stages.
How segments, generators, constraints, and optimizers interact under the hood.
Symmetric Protein Design
Design proteins with structure prediction constraints using ESMFold, ESM2, and ProteinMPNN.
Available Constraints
Browse all 50+ built-in constraints: from GC content to protein folding to splice site prediction.
Worked Examples
Runnable example programs on GitHub: declarative specs in examples/jsons/ (start with toy.json) and Python pipelines in examples/scripts/ (toy.py, protein_hunter.py, toy-multiple-optimizers.py).