Using Constraints

In proto-language, a constraint is a scoring function that quantifies how well a sequence satisfies a design requirement. During optimization, each candidate sequence is evaluated by every constraint in a program, and the optimizer searches for the sequences that minimize the resulting scores. This guide describes the constraint interface, the built-in constraints, the definition and registration of custom constraints, and the validation of a constraint prior to optimization. A small DNA design task serves as the running example: the construction of a 100 bp insert whose composition is balanced enough to synthesize reliably and to express efficiently. Open as a runnable notebook View as a Python script

The contract

A constraint operates on a batch of proposals rather than on a single sequence. It receives a list of sequence tuples, list[tuple[Sequence, ...]], containing one tuple per proposal, and returns a list of ConstraintOutput objects, one per proposal. Each ConstraintOutput carries a numeric score together with optional metadata.

A batch of sequences passes through a constraint function to a score between 0.0 (perfect) and 1.0 (worst); optimizers drive the scores toward 0

Scores are bounded to the interval [0.0, 1.0], and their polarity is fixed throughout the framework: a score of 0.0 denotes a perfectly satisfied requirement, and a score of 1.0 denotes a maximally violated one. Because every constraint adheres to this convention, constraints of different types may be combined freely, and a single optimizer can minimize them simultaneously.

Defining the sequence

The basic unit of design is the Segment, a contiguous region of sequence with a defined type and length. A Construct groups the segments that together form a single molecule. The example below defines a single variable 100 bp DNA segment.

python

from proto_language.core import Segment, Construct

# A Segment is a stretch of sequence with a fixed type and length.
dna = Segment(length=100, sequence_type="dna", label="insert")

# A Construct groups the segments that form one molecule.
construct = Construct([dna])

Applying the built-in constraints

Many common sequence properties are available as built-in constraints. A Constraint object binds a scoring function to specific inputs: inputs identifies the segments it reads, function_config supplies its parameters, and label names its diagnostics in the output. A synthesizable, well-expressed insert is subject to two requirements: a balanced GC content, constrained here to 40-60%, and the absence of long single-base runs, which are prone to synthesis errors and can impede polymerases. These requirements correspond to the gc_content_constraint and max_homopolymer_constraint built-ins.

python

from proto_language.core import Constraint
from proto_language.constraint import gc_content_constraint, max_homopolymer_constraint

# Keep the GC content within a synthesizable, expression-friendly window.
gc = Constraint(
    inputs=[dna],
    function=gc_content_constraint,
    function_config={"min_gc": 40, "max_gc": 60},
    label="gc_content",
)

# Penalize single-base runs longer than five nucleotides.
no_homopolymers = Constraint(
    inputs=[dna],
    function=max_homopolymer_constraint,
    function_config={"max_length": 5},
    label="no_homopolymers",
)

Running the optimization

A constraint does not act in isolation; an optimizer proposes sequences, which the constraints then score. The configuration below is the minimal optimization loop, described in detail in Using Optimizers. The relevant result here is the information that the constraints report once the run has completed.

python

from proto_language.generator import (
    RandomNucleotideGenerator,
    RandomNucleotideGeneratorConfig,
)
from proto_language.optimizer import MCMCOptimizer, MCMCOptimizerConfig
from proto_language.core import Program

# The generator proposes new bases at masked positions on the segment.
generator = RandomNucleotideGenerator(RandomNucleotideGeneratorConfig())
generator.assign(dna)

# MCMC accepts or rejects each proposal according to its combined constraint score.
optimizer = MCMCOptimizer(
    constructs=[construct],
    generators=[generator],
    constraints=[gc, no_homopolymers],
    config=MCMCOptimizerConfig(
        num_results=1,
        proposals_per_result=1,
        num_steps=100,
        max_temperature=1.0,
    ),
)

# A Program runs one or more optimizers and collects the results.
program = Program(optimizers=[optimizer], num_results=1)
program.run()

Each constraint records its score and diagnostics for every segment, indexed first by the segment label and then by the constraint label. These values are retrieved from the result sequence through metadata["segments"].

python

# The joined sequence concatenates every segment in the construct.
best = program.constructs[0].joined_sequences[0]

# Scores are indexed by the segment label ("insert"), then the constraint label.
gc = best.metadata["segments"]["insert"]["constraints"]["gc_content"]

print(f"final sequence: {best.sequence}")
print(f"GC content:     {gc['data']['gc_content']:.1f}%  (constraint score {gc['score']:.2f})")

Defining a custom constraint

When no built-in constraint expresses the property of interest, a constraint can be defined as a function that conforms to the interface. The simplest form requires neither a decorator nor a configuration class; a function that accepts the batch and returns one ConstraintOutput per proposal is sufficient. The following example is biologically motivated. Spurious in-frame ATG codons within an insert can initiate off-target translation and reduce expression of the intended product. The constraint penalizes such internal start codons, disregarding the legitimate first codon.

python

from proto_language.core import Sequence, ConstraintOutput


def penalize_internal_atg(input_sequences, config):
    """Penalize in-frame internal ATG start codons. 0.0 = none, 1.0 = every codon is ATG."""
    results = []

    for (seq,) in input_sequences:
        s = seq.sequence.upper()

        # Read codons in frame, skipping the legitimate first codon.
        codons = [s[i : i + 3] for i in range(3, len(s) - 2, 3)]
        internal = sum(1 for c in codons if c == "ATG")

        # Normalize to [0, 1] so the score combines with other constraints.
        score = min(internal / max(len(codons), 1), 1.0)
        results.append(ConstraintOutput(score=score, metadata={"internal_atg": internal}))

    return results


# The bare function is passed directly to a Constraint; registration is not required.
no_internal_atg = Constraint(
    inputs=[dna],
    function=penalize_internal_atg,
    function_config={},
    label="no_internal_atg",
)

Registering a constraint

The inline form is appropriate for single use. A constraint intended for repeated use is registered with the @constraint decorator and a BaseConfig subclass. Registration supplies a configuration schema, with defaults, bounds, and descriptions; makes the constraint’s parameters tunable; and allows the constraint to be instantiated from the registry by key, in the same manner as the built-in constraints. Configuration classes derive from BaseConfig and declare their fields with ConfigField rather than with Pydantic’s Field.

python

from proto_language.utils.base import BaseConfig, ConfigField
from proto_language.constraint import constraint


class InternalStartCodonConfig(BaseConfig):
    """Configuration for the internal start-codon penalty."""

    start_codons: list = ConfigField(
        default=["ATG"],
        title="Start Codons",
        description="Start codons to penalize when they appear in-frame after the first codon",
    )


@constraint(
    key="no-internal-start-codon",
    label="No Internal Start Codons",
    config=InternalStartCodonConfig,
    description="Penalize in-frame internal start codons that can initiate off-target translation",
    uses_gpu=False,
    tools_called=[],
    category="sequence_composition",
    supported_sequence_types=["dna"],
)
def no_internal_start_codon(input_sequences, config):
    results = []

    for (seq,) in input_sequences:
        s = seq.sequence.upper()

        # The penalized codons are now configurable through the config object.
        codons = [s[i : i + 3] for i in range(3, len(s) - 2, 3)]
        internal = sum(1 for c in codons if c in config.start_codons)

        score = min(internal / max(len(codons), 1), 1.0)
        results.append(
            ConstraintOutput(score=score, metadata={"internal_start_codons": internal})
        )

    return results

Validating a constraint

A constraint with an inverted polarity or an indexing error can direct an entire run toward the wrong objective without raising an error. Before it is incorporated into an optimizer, a constraint should be evaluated directly on sequences whose expected scores are known: a clean insert should score near 0.0, and a degenerate one near 1.0.

python

# Two reference sequences with known answers: one clean, one degenerate.
clean = Sequence(sequence="ATGAAAGCGATTATTGGTCTGGGTGCTTATCCGCAGTTT", sequence_type="dna")
repeats = Sequence(sequence="ATGATGATGATGATGATGATGATGATGATGATGATGATG", sequence_type="dna")

# Call the constraint directly, outside any optimizer.
config = InternalStartCodonConfig(start_codons=["ATG"])
scores = no_internal_start_codon([(clean,), (repeats,)], config)

print(f"clean insert: {scores[0].score:.3f}  ({scores[0].metadata['internal_start_codons']} internal ATG)")
print(f"ATG repeat:   {scores[1].score:.3f}  ({scores[1].metadata['internal_start_codons']} internal ATG)")

# The scores must be bounded, and the clean insert must score better.
assert 0.0 <= scores[0].score <= 1.0
assert scores[0].score < scores[1].score

Practical considerations

The final score should be clamped to [0.0, 1.0] using min(max(raw, 0.0), 1.0), and the values from which it was derived should be retained in metadata. The score directs the optimizer, whereas the metadata records the basis for that score and can be inspected without re-evaluating the constraint.

Scores must be finite and within [0.0, 1.0]. Returning a bare float in place of a ConstraintOutput raises a TypeError, and an unbounded or NaN score causes the optimizer to behave erratically, because proposals can no longer be compared.

The polarity convention is strict: 0.0 is perfect and 1.0 is worst. A constraint for which a larger value is preferable must invert its score before returning it; otherwise it will oppose every other constraint in the program.

Next Steps

Using Generators

Generate the candidate sequences that constraints evaluate.

Using Optimizers

Search sequence space to minimize constraint scores.

Constraints concept

How constraints relate to the rest of the model.

Constraint reference

The complete catalog of built-in constraints.

​The contract

​Defining the sequence

​Applying the built-in constraints

​Running the optimization

​Defining a custom constraint

​Registering a constraint

​Validating a constraint

​Practical considerations

​Next Steps

Using Generators

Using Optimizers

Constraints concept

Constraint reference

The contract

Defining the sequence

Applying the built-in constraints

Running the optimization

Defining a custom constraint

Registering a constraint

Validating a constraint

Practical considerations

Next Steps