Skip to main content

Constraints

A constraint encodes a biological requirement, such as a GC-content range, protein folding confidence, structural similarity, or motif presence, as a scoring function that the optimizer minimizes. Every constraint answers one question about a proposal sequence: how far is it from the requirement? The answer is a score between 0.0 (perfect) and 1.0 (worst). The optimizer combines all constraint scores into a single energy value and searches for sequences that minimize it.

The Scoring Model

Proto uses a unified scoring model where all constraints return values on the same [0.0, 1.0] scale:
    Perfect                                    Worst
    |------------------------------------------|
    0.0              0.5                      1.0
    GC content      GC 5% outside           GC 20% outside
    in range         target range             target range
The optimizer combines scores into a single energy:
Energy = Sigma(weight_i x score_i)
Lower energy is better. A sequence with energy 0.0 satisfies all constraints perfectly.
python
# Example: Three constraints with different weights
# If a proposal scores 0.1 on GC, 0.3 on structure, 0.0 on homopolymer:
#
# Energy = (1.0 x 0.1) + (2.0 x 0.3) + (1.0 x 0.0) = 0.7

Two Modes: Scoring vs Filtering

Constraints operate in one of two mutually exclusive modes:

Scoring Mode (Soft)

Uses weight to control relative importance. Returns a float score that contributes to the total energy.
  • Guides optimization toward better solutions
  • Allows trade-offs between constraints
  • Default mode (weight=1.0)
python
# Target: GC content near 50-60%
Constraint(
    inputs=[segment],
    function=gc_content_constraint,
    function_config={"min_gc": 50, "max_gc": 60},
    weight=2.0,  # 2x importance
)

Filter Mode (Hard)

Uses threshold to create a binary pass/fail gate: a proposal passes when score <= threshold, otherwise it is rejected.
  • Proposals that fail are immediately rejected
  • Rejected proposals skip all scoring constraints
  • Saves compute on expensive evaluations
python
# "Homopolymers MUST be <= 4bp"
Constraint(
    inputs=[segment],
    function=max_homopolymer_constraint,
    function_config={"max_length": 4},
    threshold=0.0,  # score must be <= 0.0
)
weight and threshold are mutually exclusive. A constraint is either a weighted scorer OR a binary filter, never both. Setting both raises a ValueError.

The Evaluation Pipeline

When the optimizer calls score_energy(), constraints are evaluated in a specific order designed to reject bad proposals early and save expensive computation:
All ProposalsFilter 1:Homopolymer <= 4bpFilter 2:Forbid GATC(specific k-mer)Score 1:GC Content(weight=1.0)Score 2:Structure pLDDT(weight=2.0)Energy =Sigma(weight x score)Rejected(energy = inf)Rejected(energy = inf)PassPassFailFail
All ProposalsFilter 1:Homopolymer <= 4bpFilter 2:Forbid GATC(specific k-mer)Score 1:GC Content(weight=1.0)Score 2:Structure pLDDT(weight=2.0)Energy =Sigma(weight x score)Rejected(energy = inf)Rejected(energy = inf)PassPassFailFail
Key insight: Filter constraints run before scoring constraints. Rejected proposals skip expensive scoring entirely (like GPU-based structure prediction). Use cheap filters to screen out bad proposals before expensive scoring kicks in.
GPU memory and constraints: Constraint functions receive all passing proposals in a single batch call (the full List[Tuple[Sequence, ...]]), not one at a time. Unlike generators, constraints have no framework-level batch_size parameter; GPU memory management is handled internally by each tool. For example, ESMFold splits by total residue count (max_batch_residues), while Boltz2 and AlphaFold3 process complexes sequentially. Memory usage can be controlled via tool-specific config fields (e.g., max_batch_residues for ESMFold) rather than a constraint-level batch size.

Creating Constraints

python
from proto_language.core import Constraint
from proto_language.constraint import gc_content_constraint

constraint = Constraint(
    inputs=[segment],                        # Segments to evaluate
    function=gc_content_constraint,          # The scoring function
    function_config={"min_gc": 50, "max_gc": 60},  # Config (dict or Pydantic model)
    label="gc_content",                      # Optional label for tracking
    weight=1.0,                              # Relative importance (scoring mode)
)

Key Parameters

inputs
List[Segment]
required
The segments this constraint evaluates. Single-segment constraints get one sequence per proposal. Multi-segment constraints get a tuple of sequences (one per segment), enabling cross-segment evaluations like protein-protein interactions.
function
Callable
required
The scoring function from the constraint registry. Must accept (input_sequences: list[tuple[Sequence, ...]], config) and return list[ConstraintOutput], one per proposal, each with a score in [0.0, 1.0].
function_config
dict | BaseModel
required
Configuration for the scoring function. Can be a dictionary (auto-validated against the function’s Pydantic config class) or a Pydantic model instance.
label
str
default:"function.__name__"
Label used for metadata tracking and result export. Defaults to the function name.
weight
float
default:"1.0"
Multiplier for the raw score in energy calculation. Only used in scoring mode. Mutually exclusive with threshold.
threshold
float
default:"None"
If set, converts this to a filter constraint. Proposals with score <= threshold pass; others are rejected. Mutually exclusive with weight.

Constraint Categories

Proto provides built-in constraints organized by what they measure:

Sequence Composition

GC content, k-mer frequency, homopolymer runs, and moreDNA/RNA sequence properties

Protein Quality

Length, complexity, amino acid balance, and moreProtein sanity checks

Protein Structure

Folding confidence, structural similarity, binding strength, and more3D folding and structural similarity

Sequence Annotation

Motif search, promoter strength, sequence similarity, and moreFunctional element detection

RNA Secondary Structure

Structure similarity, property matching, and moreRNA folding patterns

RNA Splicing

Splicing prediction and tissue specificitySplicing prediction

Sequence Alignment

Alignment quality metricsAlignment scoring
See the Constraint Reference for the full list of built-in constraints and their configuration options.

Common Constraint Patterns

A typical DNA construct optimization with sequence composition constraints:
python
from proto_language.constraint import (
    gc_content_constraint,
    max_homopolymer_constraint,
    kmer_frequency_constraint,
    specific_kmer_constraint,
)

constraints = [
    # Soft: optimize GC content toward 45-55%
    Constraint(
        inputs=[segment],
        function=gc_content_constraint,
        function_config={"min_gc": 45, "max_gc": 55},
        weight=1.0,
    ),
    # Hard filter: no homopolymer runs > 4bp
    Constraint(
        inputs=[segment],
        function=max_homopolymer_constraint,
        function_config={"max_length": 4},
        threshold=0.0,
    ),
    # Soft: keep any single 6-mer below 5% frequency
    Constraint(
        inputs=[segment],
        function=kmer_frequency_constraint,
        function_config={"k": 6, "min_value": 0.0, "max_value": 0.05},
        weight=0.5,
    ),
    # Hard filter: must NOT contain EcoRI site (GAATTC frequency must be 0)
    Constraint(
        inputs=[segment],
        function=specific_kmer_constraint,
        function_config={"kmer": "GAATTC", "min_value": 0.0, "max_value": 0.0},
        threshold=0.0,
    ),
]

Metadata Propagation

After evaluation, constraints write detailed results back to each sequence’s metadata. This shows why a sequence got its score:
python
# After running the optimizer...
sequence = segment.result_sequences[0]

# Access constraint metadata
gc_data = sequence.metadata["constraints"]["gc_content_constraint"]

# Standard fields
gc_data["score"]           # 0.12  (raw score before weighting)
gc_data["weight"]          # 1.0
gc_data["weighted_score"]  # 0.12  (score x weight)

# Custom data from the scoring function
gc_data["data"]["gc_content"]  # 52.3  (the actual GC percentage)
The metadata structure varies by constraint mode and number of input segments:Single-segment scoring constraint:
python
sequence.metadata["constraints"]["gc_content_constraint"] = {
    "score": 0.12,
    "weight": 1.0,
    "weighted_score": 0.12,
    "data": {
        "gc_content": 52.3
    }
}
Multi-segment constraint (additional linking info):
python
protein_a.metadata["constraints"]["binding_constraint"] = {
    "score": 0.05,
    "weight": 2.0,
    "weighted_score": 0.10,
    "input_segments": ["construct_0.binder", "construct_0.target"],
    "position_in_inputs": 0,
    "data": {
        "binding_energy": -8.2
    }
}
The data field contains constraint-specific metrics that vary by function. It exposes the actual measured values (GC percentage, pLDDT score, RMSD in angstroms) rather than just the normalized score.

Custom Constraints

A constraint function can be defined without using the registry decorator:
python
def my_custom_constraint(input_sequences, config) -> list[ConstraintOutput]:
    """Score sequences by how close their length is to a target."""
    results = []
    for (seq,) in input_sequences:  # Single-segment: unpack 1-tuple
        actual = len(seq.sequence)
        target = config["target_length"]
        deviation = abs(actual - target) / target
        # Scoring functions return one ConstraintOutput per input: score in
        # [0.0, 1.0] (0 = perfect) plus optional diagnostic metadata.
        results.append(ConstraintOutput(score=min(deviation, 1.0), metadata={"length": actual}))
    return results

constraint = Constraint(
    inputs=[segment],
    function=my_custom_constraint,
    function_config={"target_length": 200},
)
Custom constraint functions must return a list[ConstraintOutput], one per input tuple, each with score in [0.0, 1.0] (0 = perfect) plus optional metadata. Add from proto_language import ConstraintOutput. Returning bare floats raises a TypeError at evaluation, and a non-finite score raises a ValueError.

Next Steps

Optimizers

Learn how optimizers use constraints to search for optimal sequences

Tools

The bioinformatics tools that power constraint evaluation

Constraint Reference

Full API reference for every built-in constraint

Programs

Compose multi-stage pipelines with progressive constraints

Constraint Catalog

Protein Quality

Protein Structure

RNA Secondary Structure

RNA Splicing

Sequence Alignment

Sequence Annotation

Sequence Composition

Sequence Scoring