Segments

A Segment is a region of biological sequence to be designed or optimized. It is analogous to a gene annotation on a genome map: a bounded region with a specific function (a promoter, a coding sequence, a linker domain) that the framework fills with an optimal sequence. Each Segment maintains two pools of sequences that the optimization loop uses: one for exploring proposals, and one for preserving the best results found so far.

Creating Segments

From Length (design from scratch)
From Sequence (optimize existing)

Creates an entirely new sequence for a region from scratch. The framework fills it during optimization.

python

from proto_language.core import Segment

# Design a new 100bp DNA promoter
promoter = Segment(
    length=100,
    sequence_type="dna",
    label="promoter"
)

# Design a 250-residue protein enzyme
enzyme = Segment(
    length=250,
    sequence_type="protein",
    label="enzyme"
)

The Segment starts with an empty sequence. The Generator will populate it with proposals at the start of optimization.

Starts from an existing sequence to be improved. The provided sequence becomes the starting point for the Generator.

python

from proto_language.core import Segment

# Optimize an existing promoter
promoter = Segment(
    sequence="TTGACATAAATACCACTGGCGGTGATACTGAGCAC",
    sequence_type="dna",
    label="lac_promoter"
)

# Optimize a known enzyme
enzyme = Segment(
    sequence="MSKGEELFTGVVPILVELDGDVNGHKFSVSG",
    sequence_type="protein",
    label="gfp_fragment"
)

# Length is inferred from the sequence
promoter.sequence_length  # 35
enzyme.sequence_length    # 31

Either length or sequence must be provided, but not both. If a sequence is provided, the length is inferred automatically.

Sequence Types

Sequence type	Valid characters	Example
DNA	`A` `C` `G` `T`	`Segment(length=100, sequence_type="dna")`
RNA	`A` `C` `G` `U`	`Segment(length=50, sequence_type="rna")`
Protein	20 standard amino acids	`Segment(length=300, sequence_type="protein")`
Ligand	SMILES syntax (RDKit-validated)	`Segment(sequence="CCO", sequence_type="ligand")`

Ligand segments must be initialized with a sequence (SMILES string), not just a length. This is because SMILES syntax cannot be randomly generated; the molecule must be chemically valid.

The Dual Pool Model

Each Segment maintains two separate lists of Sequence objects that serve different purposes during optimization:

proposal_sequences

The working space. Generators fill this pool with new proposals each step. Constraints score every proposal. The Optimizer ranks them and decides which survive.

Rebuilt every optimization step
Can contain many sequences (e.g., 100 proposals)
Internal to the optimization loop

result_sequences

The results space. The Optimizer promotes the best proposals here. This pool persists across optimization steps, and when using multi-stage Programs, carries results from one stage to the next.

Persists across steps and stages
Contains the top-K best sequences
User-facing output after optimization

python

# During optimization
segment.proposal_sequences   # List[Sequence] - current proposals
segment.result_sequences    # List[Sequence] - best found so far

# After optimization, read results from the result pool
for seq in segment.result_sequences:
    print(f"Sequence: {seq.sequence[:30]}...")
    print(f"Scores: {seq.metadata['constraints']}")

Properties Reference

Pool properties

Property	Type	Description
`proposal_sequences`	`List[Sequence]`	Current proposals (working space)
`result_sequences`	`List[Sequence]`	Best sequences found (results space)
`num_proposals`	`int`	Number of sequences in the proposal pool
`num_results`	`int`	Number of sequences in the result pool

Sequence properties

Property	Type	Description
`sequence_type`	`SequenceType`	`"dna"`, `"rna"`, `"protein"`, or `"ligand"` (read-only)
`valid_chars`	`Optional[Set[str]]`	Allowed characters for this segment (read-only)
`sequence_length`	`int`	Expected length of sequences in this segment
`original_sequence`	`Sequence`	The original sequence provided at construction (read-only)
`has_original_sequence`	`bool`	`True` if created with a sequence (vs. just a length)

State properties

Property	Type	Description
`populated_sequences`	`bool`	Whether segment has sequences from input or prior optimization
`proposals_populated`	`bool`	Whether proposal pool has non-empty sequences
`is_ligand`	`bool`	Whether this is a ligand segment (ligands cannot be mutated)
`label`	`Optional[str]`	Identifier for this segment (auto-assigned if not provided)
`construct_label`	`Optional[str]`	Label of the parent Construct (set by Program)

Labels

Labels identify segments in multi-segment designs and appear in constraint metadata, so per-segment results are attributable to a named region.

python

promoter = Segment(length=100, sequence_type="dna", label="promoter")
cds = Segment(length=900, sequence_type="dna", label="cds")
terminator = Segment(length=50, sequence_type="dna", label="terminator")

If a label is not provided, segments are auto-labeled based on their position in the Construct: segment_0, segment_1, etc.

Custom Valid Characters

Restrict the allowed characters for specialized applications:

python

# AT-rich region (no G or C allowed)
at_rich = Segment(
    length=50,
    sequence_type="dna",
    valid_chars={"A", "T"},
    label="at_rich_spacer"
)

# Reduced amino acid alphabet for combinatorial libraries
combinatorial = Segment(
    length=100,
    sequence_type="protein",
    valid_chars={"A", "G", "S", "T", "N", "D", "E", "K"},
    label="library_region"
)

The valid_chars constraint is enforced during validation and used by Generators to only propose valid characters.

Iteration and Indexing

Segments support direct iteration and indexing into the result pool (results):

python

# Iterate over result sequences
for sequence in segment:
    print(sequence.sequence)

# Index directly
best = segment[0]
print(best.sequence)
print(best.metadata["constraints"])

# Count results
print(f"{segment.num_results} results, {segment.num_proposals} proposals")

Creation Patterns

# Optimize the middle region while keeping flanks fixed
flank_5 = Segment(sequence="ATGCATGC", sequence_type="dna", label="5_flank")
variable = Segment(length=100, sequence_type="dna", label="variable")
flank_3 = Segment(sequence="GCATGCAT", sequence_type="dna", label="3_flank")

Serialization

Segments serialize to dictionaries, preserving both pools and all metadata:

data = segment.to_dict()
# {
#     "original_sequence": { ... },
#     "sequence_length": 100,
#     "proposal_sequences": [{ ... }, { ... }],
#     "result_sequences": [{ ... }],
#     "sequence_type": "dna",
#     "valid_chars": ["A", "C", "G", "T"],
#     "label": "promoter"
# }

Next Steps

Constructs

Combine Segments into complete biological designs

Generators

How Generators propose new sequences for Segments

Sequences

The data model underlying each pool entry

Overview

See how Segments fit into the full architecture

​Segments

​Creating Segments

​Sequence Types

​The Dual Pool Model

​proposal_sequences

​result_sequences

​Properties Reference

​Labels

​Custom Valid Characters

​Iteration and Indexing

​Creation Patterns

​Serialization

​Next Steps

Constructs

Generators

Sequences

Overview

Segments

Creating Segments

Sequence Types

The Dual Pool Model

proposal_sequences

result_sequences

Properties Reference

Labels

Custom Valid Characters

Iteration and Indexing

Creation Patterns

Serialization

Next Steps