Multi-Stage DNA Optimization

A program may run several optimizers in sequence on the same construct, so the result of one stage becomes the starting point of the next. This program optimizes a 20 bp DNA insert in two stages: rejection sampling first makes large moves to enrich GC content into a broad 70-100% window, then MCMC makes small moves to refine it into a tight 80-90% window. Open as a runnable notebook View as a Python script

The shared construct

A Segment is the stretch of sequence being designed and a Construct groups the segments that make up one molecule. Here a single 20 bp DNA segment (length=20, sequence_type="dna", label="insert") starts empty, leaving all positions open for the optimizers to fill. Both stages reference these same insert and construct objects by identity, so the sequences a stage accepts carry over as the starting point for the next.

python

from proto_language.core import Segment, Construct

insert = Segment(length=20, sequence_type="dna", label="insert")
construct = Construct([insert])

Stage 1: rejection sampling

RejectionSamplingOptimizer draws independent proposals and keeps the best by lowest energy; each proposal batch starts fresh with no state carried between draws. gen1 is a RandomNucleotideGenerator whose MaskingStrategy(num_mutations=10) mutates exactly ten of the twenty positions per call, and because the segment starts empty its first call fills the initial random sequence. The single gc_enrich constraint (gc_content_constraint with min_gc=70, max_gc=100) scores 0 when GC content falls inside that broad window and penalizes deviation below it. The config draws num_samples=10 proposals total and retains the top num_results=3 by lowest energy, which hand off as stage two’s starting pool.

python

from proto_tools.transforms.masking import MaskingStrategy
from proto_language.core import Constraint
from proto_language.constraint import gc_content_constraint
from proto_language.generator import RandomNucleotideGenerator, RandomNucleotideGeneratorConfig
from proto_language.optimizer import RejectionSamplingOptimizer, RejectionSamplingOptimizerConfig

gen1 = RandomNucleotideGenerator(
    RandomNucleotideGeneratorConfig(masking_strategy=MaskingStrategy(num_mutations=10))
)
gen1.assign(insert)

stage1 = RejectionSamplingOptimizer(
    constructs=[construct],
    generators=[gen1],
    constraints=[Constraint(inputs=[insert], function=gc_content_constraint,
                            function_config={"min_gc": 70, "max_gc": 100}, label="gc_enrich")],
    config=RejectionSamplingOptimizerConfig(num_samples=10, num_results=3),
)

MCMCOptimizer runs Metropolis-Hastings with simulated annealing: at each step it mutates the current sequence, scores the proposals, and accepts improvements outright while accepting worse proposals with probability exp(-dE / T) as the temperature anneals down from max_temperature. Here gen2 uses MaskingStrategy(num_mutations=1), so each move flips a single base, and the gc_refine constraint (min_gc=80, max_gc=90) rewards the tighter window. The config runs one trajectory (num_results=1) for num_steps=10 steps, drawing proposals_per_result=20 proposals per step and keeping the best by energy before the accept/reject decision, with max_temperature=2.0 as the starting temperature.

python

from proto_language.optimizer import MCMCOptimizer, MCMCOptimizerConfig

gen2 = RandomNucleotideGenerator(
    RandomNucleotideGeneratorConfig(masking_strategy=MaskingStrategy(num_mutations=1))
)
gen2.assign(insert)

stage2 = MCMCOptimizer(
    constructs=[construct],
    generators=[gen2],
    constraints=[Constraint(inputs=[insert], function=gc_content_constraint,
                            function_config={"min_gc": 80, "max_gc": 90}, label="gc_refine")],
    config=MCMCOptimizerConfig(num_results=1, proposals_per_result=20, num_steps=10, max_temperature=2.0),
)

Run both stages

The Program runs its optimizers in the order listed, stage1 then stage2. Because both share the same construct, the three high-GC candidates rejection sampling retains seed the MCMC trajectory, so the final design reflects both passes: enriched by the first, refined by the second.

python

from proto_language.core import Program

program = Program(optimizers=[stage1, stage2], num_results=1)
program.run()

Inspect the result

The final design is the construct’s joined sequence. Per-segment results live under metadata["segments"][<label>], and each constraint’s diagnostics sit under the constraints entry keyed by the label set above. Reading gc_content back out from the stage-two gc_refine entry confirms the design lands inside the 80-90% GC window.

python

best = program.constructs[0].joined_sequences[0]
gc_pct = best.metadata["segments"]["insert"]["constraints"]["gc_refine"]["data"]["gc_content"]
print(f"final sequence: {best.sequence}")
print(f"GC content:     {gc_pct:.1f}%")

final sequence: GGTCCGCCGCGGTGCACCCG
GC content:     85.0%

Multi-Stage DNA Optimization

The shared construct

Stage 1: rejection sampling

Stage 2: MCMC refinement

Run both stages

Inspect the result

Next Steps

Using Optimizers

DNA Sequence Optimization

​The shared construct

​Stage 1: rejection sampling

​Stage 2: MCMC refinement

​Run both stages

​Inspect the result

​Next Steps

Using Optimizers

DNA Sequence Optimization

The shared construct

Stage 1: rejection sampling

Stage 2: MCMC refinement

Run both stages

Inspect the result

Next Steps