Constraints

A constraint encodes a biological requirement, such as a GC-content range, protein folding confidence, structural similarity, or motif presence, as a scoring function that the optimizer minimizes. Every constraint answers one question about a proposal sequence: how far is it from the requirement? The answer is a score between 0.0 (perfect) and 1.0 (worst). The optimizer combines all constraint scores into a single energy value and searches for sequences that minimize it.

The Scoring Model

Proto uses a unified scoring model where all constraints return values on the same [0.0, 1.0] scale:

    Perfect                                    Worst
    |------------------------------------------|
    0.0              0.5                      1.0
    GC content      GC 5% outside           GC 20% outside
    in range         target range             target range

The optimizer combines scores into a single energy:

Energy = Sigma(weight_i x score_i)

Lower energy is better. A sequence with energy 0.0 satisfies all constraints perfectly.

python

# Example: Three constraints with different weights
# If a proposal scores 0.1 on GC, 0.3 on structure, 0.0 on homopolymer:
#
# Energy = (1.0 x 0.1) + (2.0 x 0.3) + (1.0 x 0.0) = 0.7

Two Modes: Scoring vs Filtering

Constraints operate in one of two mutually exclusive modes:

Scoring Mode (Soft)

Uses weight to control relative importance. Returns a float score that contributes to the total energy.

Guides optimization toward better solutions
Allows trade-offs between constraints
Default mode (weight=1.0)

python

# Target: GC content near 50-60%
Constraint(
    inputs=[segment],
    function=gc_content_constraint,
    function_config={"min_gc": 50, "max_gc": 60},
    weight=2.0,  # 2x importance
)

Filter Mode (Hard)

Uses threshold to create a binary pass/fail gate: a proposal passes when score <= threshold, otherwise it is rejected.

Proposals that fail are immediately rejected
Rejected proposals skip all scoring constraints
Saves compute on expensive evaluations

python

# "Homopolymers MUST be <= 4bp"
Constraint(
    inputs=[segment],
    function=max_homopolymer_constraint,
    function_config={"max_length": 4},
    threshold=0.0,  # score must be <= 0.0
)

weight and threshold are mutually exclusive. A constraint is either a weighted scorer OR a binary filter, never both. Setting both raises a ValueError.

The Evaluation Pipeline

When the optimizer calls score_energy(), constraints are evaluated in a specific order designed to reject bad proposals early and save expensive computation:

Key insight: Filter constraints run before scoring constraints. Rejected proposals skip expensive scoring entirely (like GPU-based structure prediction). Use cheap filters to screen out bad proposals before expensive scoring kicks in.

GPU memory and constraints: Constraint functions receive all passing proposals in a single batch call (the full List[Tuple[Sequence, ...]]), not one at a time. Unlike generators, constraints have no framework-level batch_size parameter; GPU memory management is handled internally by each tool. For example, ESMFold splits by total residue count (max_batch_residues), while Boltz2 and AlphaFold3 process complexes sequentially. Memory usage can be controlled via tool-specific config fields (e.g., max_batch_residues for ESMFold) rather than a constraint-level batch size.

Creating Constraints

python

from proto_language.core import Constraint
from proto_language.constraint import gc_content_constraint

constraint = Constraint(
    inputs=[segment],                        # Segments to evaluate
    function=gc_content_constraint,          # The scoring function
    function_config={"min_gc": 50, "max_gc": 60},  # Config (dict or Pydantic model)
    label="gc_content",                      # Optional label for tracking
    weight=1.0,                              # Relative importance (scoring mode)
)

Key Parameters

inputs

List[Segment]

required

The segments this constraint evaluates. Single-segment constraints get one sequence per proposal. Multi-segment constraints get a tuple of sequences (one per segment), enabling cross-segment evaluations like protein-protein interactions.

function

Callable

required

The scoring function from the constraint registry. Must accept (input_sequences: list[tuple[Sequence, ...]], config) and return list[ConstraintOutput], one per proposal, each with a score in [0.0, 1.0].

function_config

dict | BaseModel

required

Configuration for the scoring function. Can be a dictionary (auto-validated against the function’s Pydantic config class) or a Pydantic model instance.

label

str

default:"function.__name__"

Label used for metadata tracking and result export. Defaults to the function name.

weight

float

default:"1.0"

Multiplier for the raw score in energy calculation. Only used in scoring mode. Mutually exclusive with threshold.

threshold

float

default:"None"

If set, converts this to a filter constraint. Proposals with score <= threshold pass; others are rejected. Mutually exclusive with weight.

Constraint Categories

Proto provides built-in constraints organized by what they measure:

Sequence Composition

GC content, k-mer frequency, homopolymer runs, and moreDNA/RNA sequence properties

Protein Quality

Length, complexity, amino acid balance, and moreProtein sanity checks

Protein Structure

Folding confidence, structural similarity, binding strength, and more3D folding and structural similarity

Sequence Annotation

Motif search, promoter strength, sequence similarity, and moreFunctional element detection

RNA Secondary Structure

Structure similarity, property matching, and moreRNA folding patterns

RNA Splicing

Splicing prediction and tissue specificitySplicing prediction

Sequence Alignment

Alignment quality metricsAlignment scoring

See the Constraint Reference for the full list of built-in constraints and their configuration options.

Common Constraint Patterns

DNA Construct
Protein Design
Multi-Segment

A typical DNA construct optimization with sequence composition constraints:

python

from proto_language.constraint import (
    gc_content_constraint,
    max_homopolymer_constraint,
    kmer_frequency_constraint,
    specific_kmer_constraint,
)

constraints = [
    # Soft: optimize GC content toward 45-55%
    Constraint(
        inputs=[segment],
        function=gc_content_constraint,
        function_config={"min_gc": 45, "max_gc": 55},
        weight=1.0,
    ),
    # Hard filter: no homopolymer runs > 4bp
    Constraint(
        inputs=[segment],
        function=max_homopolymer_constraint,
        function_config={"max_length": 4},
        threshold=0.0,
    ),
    # Soft: keep any single 6-mer below 5% frequency
    Constraint(
        inputs=[segment],
        function=kmer_frequency_constraint,
        function_config={"k": 6, "min_value": 0.0, "max_value": 0.05},
        weight=0.5,
    ),
    # Hard filter: must NOT contain EcoRI site (GAATTC frequency must be 0)
    Constraint(
        inputs=[segment],
        function=specific_kmer_constraint,
        function_config={"kmer": "GAATTC", "min_value": 0.0, "max_value": 0.0},
        threshold=0.0,
    ),
]

A protein design workflow with structure prediction constraints:

python

from proto_language.constraint import (
    structure_plddt_constraint,
    structure_rmsd_constraint,
    protein_complexity_constraint,
    protein_repetitiveness_constraint,
)

constraints = [
    # Hard filter: reject badly folded proteins early.
    # The pLDDT constraint returns 1 - normalized pLDDT (lower is better),
    # so threshold=0.3 keeps only high-confidence folds.
    Constraint(
        inputs=[protein_segment],
        function=structure_plddt_constraint,
        function_config={"structure_tool": "esmfold"},
        threshold=0.3,
    ),
    # Soft: optimize toward target structure (most important)
    Constraint(
        inputs=[protein_segment],
        function=structure_rmsd_constraint,
        function_config={
            "target_structure": "target.pdb",
            "structure_tool": "esmfold",
        },
        weight=3.0,  # 3x importance
    ),
    # Soft: avoid low-complexity sequences
    Constraint(
        inputs=[protein_segment],
        function=protein_complexity_constraint,
        function_config={},
        weight=0.5,
    ),
    # Soft: avoid repetitive regions
    Constraint(
        inputs=[protein_segment],
        function=protein_repetitiveness_constraint,
        function_config={},
        weight=0.5,
    ),
]

Constraints that evaluate multiple segments together (e.g., protein-protein interactions):

python

from proto_language.constraint import (
    boltz_binding_strength_constraint,
    structure_plddt_constraint,
)

binder = Segment(length=80, sequence_type="protein", label="binder")
target = Segment(length=150, sequence_type="protein", label="target")

constraints = [
    # Evaluate binding between two protein segments
    Constraint(
        inputs=[binder, target],  # Multi-segment input
        function=boltz_binding_strength_constraint,
        function_config={},
        weight=2.0,
    ),
    # Each segment also gets its own folding quality check
    Constraint(
        inputs=[binder],
        function=structure_plddt_constraint,
        function_config={"structure_tool": "esmfold"},
        weight=1.0,
    ),
]

Metadata Propagation

After evaluation, constraints write detailed results back to each sequence’s metadata. This shows why a sequence got its score:

python

# After running the optimizer...
sequence = segment.result_sequences[0]

# Access constraint metadata
gc_data = sequence.metadata["constraints"]["gc_content_constraint"]

# Standard fields
gc_data["score"]           # 0.12  (raw score before weighting)
gc_data["weight"]          # 1.0
gc_data["weighted_score"]  # 0.12  (score x weight)

# Custom data from the scoring function
gc_data["data"]["gc_content"]  # 52.3  (the actual GC percentage)

Full Metadata Structure

The metadata structure varies by constraint mode and number of input segments:Single-segment scoring constraint:

python

sequence.metadata["constraints"]["gc_content_constraint"] = {
    "score": 0.12,
    "weight": 1.0,
    "weighted_score": 0.12,
    "data": {
        "gc_content": 52.3
    }
}

Multi-segment constraint (additional linking info):

python

protein_a.metadata["constraints"]["binding_constraint"] = {
    "score": 0.05,
    "weight": 2.0,
    "weighted_score": 0.10,
    "input_segments": ["construct_0.binder", "construct_0.target"],
    "position_in_inputs": 0,
    "data": {
        "binding_energy": -8.2
    }
}

The data field contains constraint-specific metrics that vary by function. It exposes the actual measured values (GC percentage, pLDDT score, RMSD in angstroms) rather than just the normalized score.

Custom Constraints

A constraint function can be defined without using the registry decorator:

python

def my_custom_constraint(input_sequences, config) -> list[ConstraintOutput]:
    """Score sequences by how close their length is to a target."""
    results = []
    for (seq,) in input_sequences:  # Single-segment: unpack 1-tuple
        actual = len(seq.sequence)
        target = config["target_length"]
        deviation = abs(actual - target) / target
        # Scoring functions return one ConstraintOutput per input: score in
        # [0.0, 1.0] (0 = perfect) plus optional diagnostic metadata.
        results.append(ConstraintOutput(score=min(deviation, 1.0), metadata={"length": actual}))
    return results

constraint = Constraint(
    inputs=[segment],
    function=my_custom_constraint,
    function_config={"target_length": 200},
)

Custom constraint functions must return a list[ConstraintOutput], one per input tuple, each with score in [0.0, 1.0] (0 = perfect) plus optional metadata. Add from proto_language import ConstraintOutput. Returning bare floats raises a TypeError at evaluation, and a non-finite score raises a ValueError.

Next Steps

Optimizers

Learn how optimizers use constraints to search for optimal sequences

Tools

The bioinformatics tools that power constraint evaluation

Constraint Reference

Full API reference for every built-in constraint

Programs

Compose multi-stage pipelines with progressive constraints

Constraints

Constraints

The Scoring Model

Two Modes: Scoring vs Filtering

Scoring Mode (Soft)

Filter Mode (Hard)

The Evaluation Pipeline

Creating Constraints

Key Parameters

Constraint Categories

Sequence Composition

Protein Quality

Protein Structure

Sequence Annotation

RNA Secondary Structure

RNA Splicing

Sequence Alignment

Common Constraint Patterns

Metadata Propagation

Custom Constraints

Next Steps

Optimizers

Tools

Constraint Reference

Programs

Constraint Catalog

Protein Quality

Protein Structure

RNA Secondary Structure

RNA Splicing

Sequence Alignment

Sequence Annotation

Sequence Composition

Sequence Scoring

​Constraints

​The Scoring Model

​Two Modes: Scoring vs Filtering

​Scoring Mode (Soft)

​Filter Mode (Hard)

​The Evaluation Pipeline

​Creating Constraints

​Key Parameters

​Constraint Categories

Sequence Composition

Protein Quality

Protein Structure

Sequence Annotation

RNA Secondary Structure

RNA Splicing

Sequence Alignment

​Common Constraint Patterns

​Metadata Propagation

​Custom Constraints

​Next Steps

Optimizers

Tools

Constraint Reference

Programs

​Constraint Catalog

​Protein Quality

​Protein Structure

​RNA Secondary Structure

​RNA Splicing

​Sequence Alignment

​Sequence Annotation

​Sequence Composition

​Sequence Scoring

Constraints

The Scoring Model

Two Modes: Scoring vs Filtering

Scoring Mode (Soft)

Filter Mode (Hard)

The Evaluation Pipeline

Creating Constraints

Key Parameters

Constraint Categories

Common Constraint Patterns

Metadata Propagation

Custom Constraints

Next Steps

Constraint Catalog

Protein Quality

Protein Structure

RNA Secondary Structure

RNA Splicing

Sequence Alignment

Sequence Annotation

Sequence Composition

Sequence Scoring