DNA Sequence Optimization

This is the smallest complete proto-language program. It drives a 100 bp synthetic DNA insert toward a balanced 40-60% GC content with no long single-base runs, using Markov chain Monte Carlo. Every program is assembled from the same six pieces, introduced in turn below: a Segment, a Construct, a Generator, one or more Constraint objects, an Optimizer, and a Program. Open as a runnable notebook View as a Python script

Define the construct

A Segment is the stretch of sequence being designed; a Construct groups the segments that make up one molecule. Here there is a single segment: passing length=100 with no starting sequence leaves all 100 positions open for the optimizer to fill, sequence_type="dna" restricts them to A, C, G, and T, and label="insert" is the name its results are filed under. The Construct wraps that one segment.

python

from proto_language.core import Segment, Construct

# One variable 100 bp DNA segment.
insert = Segment(length=100, sequence_type="dna", label="insert")
construct = Construct([insert])

Assign a generator

The generator proposes new sequences for the optimizer to score. RandomNucleotideGenerator substitutes random bases at masked positions, and because the segment starts empty it also fills in the initial random sequence. generator.assign(insert) binds the generator to the segment it will mutate.

python

from proto_language.generator import RandomNucleotideGenerator, RandomNucleotideGeneratorConfig

generator = RandomNucleotideGenerator(RandomNucleotideGeneratorConfig())
generator.assign(insert)

Add constraints

Constraints score how well a sequence meets the design objective; the optimizer searches for sequences that raise these scores. Two built-ins suffice here. gc_content_constraint with min_gc=40 and max_gc=60 rewards sequences whose GC content falls in that range, and max_homopolymer_constraint with max_length=5 penalizes any single-base run longer than five bases. Each Constraint lists the segment it reads in inputs and carries a label, which is the key its scores appear under in the result metadata.

python

from proto_language.core import Constraint
from proto_language.constraint import gc_content_constraint, max_homopolymer_constraint

gc = Constraint(
    inputs=[insert],
    function=gc_content_constraint,
    function_config={"min_gc": 40, "max_gc": 60},
    label="gc_content",
)
no_homopolymers = Constraint(
    inputs=[insert],
    function=max_homopolymer_constraint,
    function_config={"max_length": 5},
    label="no_homopolymers",
)

Configure and run the optimizer

The MCMCOptimizer ties the construct, generator, and constraints together and runs Metropolis-Hastings: at each step it generates proposals, scores them, and accepts or rejects, always keeping improvements and accepting worse proposals with a probability that falls as the temperature anneals. The config sets num_steps=100 steps along a single trajectory (num_results=1, proposals_per_result=1) with max_temperature=1.0 as the starting temperature. The optional custom_logging callable receives the step number and current outputs; here it prints the GC content every 20 steps. The Program runs the optimizer and collects the result.

python

from proto_language.core import Program
from proto_language.optimizer import MCMCOptimizer, MCMCOptimizerConfig


def log_progress(step: int, outputs) -> None:
    seq = outputs[0].proposal_sequences[0]
    data = seq.metadata.get("constraints", {}).get("gc_content", {}).get("data", {})
    gc_pct = data.get("gc_content")
    tail = f"GC {gc_pct:5.1f}%" if gc_pct is not None else ""
    if step % 20 == 0:
        print(f"step {step:3d} | {tail} | {seq.sequence}")


optimizer = MCMCOptimizer(
    constructs=[construct],
    generators=[generator],
    constraints=[gc, no_homopolymers],
    config=MCMCOptimizerConfig(
        num_results=1,
        proposals_per_result=1,
        num_steps=100,
        max_temperature=1.0,
    ),
    custom_logging=log_progress,
)

program = Program(optimizers=[optimizer], num_results=1)
program.run()

step  20 | GC  54.0% | TCAGATTTAGATTGGGAGCTACCTTCTGCGCTGAGATCTGGGCGGACGTACTATGGTGGGCGTGCAGCGCGGGACTGGGAGGTTAGAAACTTATGAAGGA
step  40 | GC  50.0% | TGCATAGTGATGCGACCGTACGTGTGTATCCAATTGGTAAGCTCCAGAAGACGCTGCGAGTCGAGAGGCGCTCTACGTTTCTAAAAGTACACACACTACC
step  60 | GC  53.0% | CAGAAGATGGAGACCTATTCTTTGGAGCCAGGGTCGGCATATGCACGGGTCGGGCATTAGAGGTAAAGTGGGAATTTGGCCAGCTGTACTCGTCACAGCA
step  80 | GC  50.0% | ATCTGAGAGCGGTCACCCAGAAATCTATGATGATGTCTCTTTGCGAATTACTACCTCCTAGACGCTCCCGCCCTATGCCCTCATGGCGTTTTCATCACTC
step 100 | GC  51.0% | GCTTGTAGGCCGAAACTCAGAGGGTTTTCTCGCTATAGTCCCTGGACGCTGTAGGTTAATAGCGATAGGAAAGGGGGCAGCTGGGCTAATGCGAATGAAA

Inspect the result

The final design is the construct’s joined sequence. Per-segment results live under metadata["segments"][<label>], and each constraint’s diagnostics sit under the constraints entry keyed by the label set above. Reading gc_content and max_homopolymer_length back out confirms the design lands inside the 40-60% GC window with no run longer than five bases.

python

best = program.constructs[0].joined_sequences[0]
scores = best.metadata["segments"]["insert"]["constraints"]

print(f"final sequence: {best.sequence}")
print(f"GC content:     {scores['gc_content']['data']['gc_content']:.1f}%")
print(f"longest run:    {scores['no_homopolymers']['data']['max_homopolymer_length']} bp")

final sequence: GCTTGTAGGCCGAAACTCAGAGGGTTTTCTCGCTATAGTCCCTGGACGCTGTAGGTTAATAGCGATAGGAAAGGGGGCAGCTGGGCTAATGCGAATGAAA
GC content:     51.0%
longest run:    5 bp

Next Steps

Using Constraints

Score sequences with built-in and custom constraints.

Using Generators

Propose candidate sequences.

Using Optimizers

Run and chain optimizers.

Multi-Stage DNA Optimization

The same objective refined across two optimizer stages.

​Define the construct

​Assign a generator

​Add constraints

​Configure and run the optimizer

​Inspect the result

​Next Steps

Using Constraints

Using Generators

Using Optimizers

Multi-Stage DNA Optimization

Define the construct

Assign a generator

Add constraints

Configure and run the optimizer

Inspect the result

Next Steps