Skip to main content

Segments

A Segment is a region of biological sequence to be designed or optimized. It is analogous to a gene annotation on a genome map: a bounded region with a specific function (a promoter, a coding sequence, a linker domain) that the framework fills with an optimal sequence. Each Segment maintains two pools of sequences that the optimization loop uses: one for exploring proposals, and one for preserving the best results found so far.

Creating Segments

Creates an entirely new sequence for a region from scratch. The framework fills it during optimization.
python
from proto_language.core import Segment

# Design a new 100bp DNA promoter
promoter = Segment(
    length=100,
    sequence_type="dna",
    label="promoter"
)

# Design a 250-residue protein enzyme
enzyme = Segment(
    length=250,
    sequence_type="protein",
    label="enzyme"
)
The Segment starts with an empty sequence. The Generator will populate it with proposals at the start of optimization.
Either length or sequence must be provided, but not both. If a sequence is provided, the length is inferred automatically.

Sequence Types

Sequence typeValid charactersExample
DNAA C G TSegment(length=100, sequence_type="dna")
RNAA C G USegment(length=50, sequence_type="rna")
Protein20 standard amino acidsSegment(length=300, sequence_type="protein")
LigandSMILES syntax (RDKit-validated)Segment(sequence="CCO", sequence_type="ligand")
Ligand segments must be initialized with a sequence (SMILES string), not just a length. This is because SMILES syntax cannot be randomly generated; the molecule must be chemically valid.

The Dual Pool Model

Each Segment maintains two separate lists of Sequence objects that serve different purposes during optimization:
SEGMENTGeneratorConstraintsproposal_sequencesWorking spaceMany proposals being exploredresult_sequencesResults spaceBest sequences found so farfills withnew proposalsscored byoptimizer selectsbest proposalsseeds nextgeneration
SEGMENTGeneratorConstraintsproposal_sequencesWorking spaceMany proposals being exploredresult_sequencesResults spaceBest sequences found so farfills withnew proposalsscored byoptimizer selectsbest proposalsseeds nextgeneration

proposal_sequences

The working space. Generators fill this pool with new proposals each step. Constraints score every proposal. The Optimizer ranks them and decides which survive.
  • Rebuilt every optimization step
  • Can contain many sequences (e.g., 100 proposals)
  • Internal to the optimization loop

result_sequences

The results space. The Optimizer promotes the best proposals here. This pool persists across optimization steps, and when using multi-stage Programs, carries results from one stage to the next.
  • Persists across steps and stages
  • Contains the top-K best sequences
  • User-facing output after optimization
python
# During optimization
segment.proposal_sequences   # List[Sequence] - current proposals
segment.result_sequences    # List[Sequence] - best found so far

# After optimization, read results from the result pool
for seq in segment.result_sequences:
    print(f"Sequence: {seq.sequence[:30]}...")
    print(f"Scores: {seq.metadata['constraints']}")

Properties Reference

PropertyTypeDescription
proposal_sequencesList[Sequence]Current proposals (working space)
result_sequencesList[Sequence]Best sequences found (results space)
num_proposalsintNumber of sequences in the proposal pool
num_resultsintNumber of sequences in the result pool
PropertyTypeDescription
sequence_typeSequenceType"dna", "rna", "protein", or "ligand" (read-only)
valid_charsOptional[Set[str]]Allowed characters for this segment (read-only)
sequence_lengthintExpected length of sequences in this segment
original_sequenceSequenceThe original sequence provided at construction (read-only)
has_original_sequenceboolTrue if created with a sequence (vs. just a length)
PropertyTypeDescription
populated_sequencesboolWhether segment has sequences from input or prior optimization
proposals_populatedboolWhether proposal pool has non-empty sequences
is_ligandboolWhether this is a ligand segment (ligands cannot be mutated)
labelOptional[str]Identifier for this segment (auto-assigned if not provided)
construct_labelOptional[str]Label of the parent Construct (set by Program)

Labels

Labels identify segments in multi-segment designs and appear in constraint metadata, so per-segment results are attributable to a named region.
python
promoter = Segment(length=100, sequence_type="dna", label="promoter")
cds = Segment(length=900, sequence_type="dna", label="cds")
terminator = Segment(length=50, sequence_type="dna", label="terminator")
If a label is not provided, segments are auto-labeled based on their position in the Construct: segment_0, segment_1, etc.

Custom Valid Characters

Restrict the allowed characters for specialized applications:
python
# AT-rich region (no G or C allowed)
at_rich = Segment(
    length=50,
    sequence_type="dna",
    valid_chars={"A", "T"},
    label="at_rich_spacer"
)

# Reduced amino acid alphabet for combinatorial libraries
combinatorial = Segment(
    length=100,
    sequence_type="protein",
    valid_chars={"A", "G", "S", "T", "N", "D", "E", "K"},
    label="library_region"
)
The valid_chars constraint is enforced during validation and used by Generators to only propose valid characters.

Iteration and Indexing

Segments support direct iteration and indexing into the result pool (results):
python
# Iterate over result sequences
for sequence in segment:
    print(sequence.sequence)

# Index directly
best = segment[0]
print(best.sequence)
print(best.metadata["constraints"])

# Count results
print(f"{segment.num_results} results, {segment.num_proposals} proposals")

Creation Patterns

# Optimize the middle region while keeping flanks fixed
flank_5 = Segment(sequence="ATGCATGC", sequence_type="dna", label="5_flank")
variable = Segment(length=100, sequence_type="dna", label="variable")
flank_3 = Segment(sequence="GCATGCAT", sequence_type="dna", label="3_flank")

Serialization

Segments serialize to dictionaries, preserving both pools and all metadata:
data = segment.to_dict()
# {
#     "original_sequence": { ... },
#     "sequence_length": 100,
#     "proposal_sequences": [{ ... }, { ... }],
#     "result_sequences": [{ ... }],
#     "sequence_type": "dna",
#     "valid_chars": ["A", "C", "G", "T"],
#     "label": "promoter"
# }

Next Steps

Constructs

Combine Segments into complete biological designs

Generators

How Generators propose new sequences for Segments

Sequences

The data model underlying each pool entry

Overview

See how Segments fit into the full architecture