Quantify design requirements with built-in and custom constraint functions
In proto-language, a constraint is a scoring function that quantifies how well a sequence
satisfies a design requirement. During optimization, each candidate sequence is evaluated
by every constraint in a program, and the optimizer searches for the sequences that
minimize the resulting scores.This guide describes the constraint interface, the built-in constraints, the definition and
registration of custom constraints, and the validation of a constraint prior to
optimization. A small DNA design task serves as the running example: the construction of a
100 bp insert whose composition is balanced enough to synthesize reliably and to express
efficiently.Open as a runnable notebookView as a Python script
A constraint operates on a batch of proposals rather than on a single sequence. It receives
a list of sequence tuples, list[tuple[Sequence, ...]], containing one tuple per proposal,
and returns a list of ConstraintOutput objects, one per proposal. Each ConstraintOutput
carries a numeric score together with optional metadata.Scores are bounded to the interval [0.0, 1.0], and their polarity is fixed throughout the
framework: a score of 0.0 denotes a perfectly satisfied requirement, and a score of 1.0
denotes a maximally violated one. Because every constraint adheres to this convention,
constraints of different types may be combined freely, and a single optimizer can minimize
them simultaneously.
The basic unit of design is the Segment, a contiguous region of sequence with a defined
type and length. A Construct groups the segments that together form a single molecule.
The example below defines a single variable 100 bp DNA segment.
python
from proto_language.core import Segment, Construct# A Segment is a stretch of sequence with a fixed type and length.dna = Segment(length=100, sequence_type="dna", label="insert")# A Construct groups the segments that form one molecule.construct = Construct([dna])
Many common sequence properties are available as built-in constraints. A Constraint
object binds a scoring function to specific inputs: inputs identifies the segments it
reads, function_config supplies its parameters, and label names its diagnostics in the
output.A synthesizable, well-expressed insert is subject to two requirements: a balanced GC
content, constrained here to 40-60%, and the absence of long single-base runs, which are
prone to synthesis errors and can impede polymerases. These requirements correspond to the
gc_content_constraint and max_homopolymer_constraint built-ins.
python
from proto_language.core import Constraintfrom proto_language.constraint import gc_content_constraint, max_homopolymer_constraint# Keep the GC content within a synthesizable, expression-friendly window.gc = Constraint( inputs=[dna], function=gc_content_constraint, function_config={"min_gc": 40, "max_gc": 60}, label="gc_content",)# Penalize single-base runs longer than five nucleotides.no_homopolymers = Constraint( inputs=[dna], function=max_homopolymer_constraint, function_config={"max_length": 5}, label="no_homopolymers",)
A constraint does not act in isolation; an optimizer proposes sequences, which the
constraints then score. The configuration below is the minimal optimization loop, described
in detail in Using Optimizers. The relevant result
here is the information that the constraints report once the run has completed.
python
from proto_language.generator import ( RandomNucleotideGenerator, RandomNucleotideGeneratorConfig,)from proto_language.optimizer import MCMCOptimizer, MCMCOptimizerConfigfrom proto_language.core import Program# The generator proposes new bases at masked positions on the segment.generator = RandomNucleotideGenerator(RandomNucleotideGeneratorConfig())generator.assign(dna)# MCMC accepts or rejects each proposal according to its combined constraint score.optimizer = MCMCOptimizer( constructs=[construct], generators=[generator], constraints=[gc, no_homopolymers], config=MCMCOptimizerConfig( num_results=1, proposals_per_result=1, num_steps=100, max_temperature=1.0, ),)# A Program runs one or more optimizers and collects the results.program = Program(optimizers=[optimizer], num_results=1)program.run()
Each constraint records its score and diagnostics for every segment, indexed first by the
segment label and then by the constraint label. These values are retrieved from the result
sequence through metadata["segments"].
python
# The joined sequence concatenates every segment in the construct.best = program.constructs[0].joined_sequences[0]# Scores are indexed by the segment label ("insert"), then the constraint label.gc = best.metadata["segments"]["insert"]["constraints"]["gc_content"]print(f"final sequence: {best.sequence}")print(f"GC content: {gc['data']['gc_content']:.1f}% (constraint score {gc['score']:.2f})")
When no built-in constraint expresses the property of interest, a constraint can be defined
as a function that conforms to the interface. The simplest form requires neither a decorator
nor a configuration class; a function that accepts the batch and returns one
ConstraintOutput per proposal is sufficient.The following example is biologically motivated. Spurious in-frame ATG codons within an
insert can initiate off-target translation and reduce expression of the intended product.
The constraint penalizes such internal start codons, disregarding the legitimate first
codon.
python
from proto_language.core import Sequence, ConstraintOutputdef penalize_internal_atg(input_sequences, config): """Penalize in-frame internal ATG start codons. 0.0 = none, 1.0 = every codon is ATG.""" results = [] for (seq,) in input_sequences: s = seq.sequence.upper() # Read codons in frame, skipping the legitimate first codon. codons = [s[i : i + 3] for i in range(3, len(s) - 2, 3)] internal = sum(1 for c in codons if c == "ATG") # Normalize to [0, 1] so the score combines with other constraints. score = min(internal / max(len(codons), 1), 1.0) results.append(ConstraintOutput(score=score, metadata={"internal_atg": internal})) return results# The bare function is passed directly to a Constraint; registration is not required.no_internal_atg = Constraint( inputs=[dna], function=penalize_internal_atg, function_config={}, label="no_internal_atg",)
The inline form is appropriate for single use. A constraint intended for repeated use is
registered with the @constraint decorator and a BaseConfig subclass. Registration
supplies a configuration schema, with defaults, bounds, and descriptions; makes the
constraint’s parameters tunable; and allows the constraint to be instantiated from the
registry by key, in the same manner as the built-in constraints.Configuration classes derive from BaseConfig and declare their fields with ConfigField
rather than with Pydantic’s Field.
python
from proto_language.utils.base import BaseConfig, ConfigFieldfrom proto_language.constraint import constraintclass InternalStartCodonConfig(BaseConfig): """Configuration for the internal start-codon penalty.""" start_codons: list = ConfigField( default=["ATG"], title="Start Codons", description="Start codons to penalize when they appear in-frame after the first codon", )@constraint( key="no-internal-start-codon", label="No Internal Start Codons", config=InternalStartCodonConfig, description="Penalize in-frame internal start codons that can initiate off-target translation", uses_gpu=False, tools_called=[], category="sequence_composition", supported_sequence_types=["dna"],)def no_internal_start_codon(input_sequences, config): results = [] for (seq,) in input_sequences: s = seq.sequence.upper() # The penalized codons are now configurable through the config object. codons = [s[i : i + 3] for i in range(3, len(s) - 2, 3)] internal = sum(1 for c in codons if c in config.start_codons) score = min(internal / max(len(codons), 1), 1.0) results.append( ConstraintOutput(score=score, metadata={"internal_start_codons": internal}) ) return results
A constraint with an inverted polarity or an indexing error can direct an entire run toward
the wrong objective without raising an error. Before it is incorporated into an optimizer, a
constraint should be evaluated directly on sequences whose expected scores are known: a
clean insert should score near 0.0, and a degenerate one near 1.0.
python
# Two reference sequences with known answers: one clean, one degenerate.clean = Sequence(sequence="ATGAAAGCGATTATTGGTCTGGGTGCTTATCCGCAGTTT", sequence_type="dna")repeats = Sequence(sequence="ATGATGATGATGATGATGATGATGATGATGATGATGATG", sequence_type="dna")# Call the constraint directly, outside any optimizer.config = InternalStartCodonConfig(start_codons=["ATG"])scores = no_internal_start_codon([(clean,), (repeats,)], config)print(f"clean insert: {scores[0].score:.3f} ({scores[0].metadata['internal_start_codons']} internal ATG)")print(f"ATG repeat: {scores[1].score:.3f} ({scores[1].metadata['internal_start_codons']} internal ATG)")# The scores must be bounded, and the clean insert must score better.assert 0.0 <= scores[0].score <= 1.0assert scores[0].score < scores[1].score
The final score should be clamped to [0.0, 1.0] using min(max(raw, 0.0), 1.0), and the
values from which it was derived should be retained in metadata. The score directs the
optimizer, whereas the metadata records the basis for that score and can be inspected
without re-evaluating the constraint.
Scores must be finite and within [0.0, 1.0]. Returning a bare float in place of a
ConstraintOutput raises a TypeError, and an unbounded or NaN score causes the
optimizer to behave erratically, because proposals can no longer be compared.
The polarity convention is strict: 0.0 is perfect and 1.0 is worst. A constraint for which a
larger value is preferable must invert its score before returning it; otherwise it will
oppose every other constraint in the program.