Sequences

A Sequence is the most fundamental object in Proto. It wraps a biological string (DNA, RNA, protein, or ligand) with type validation, character enforcement, and a metadata system that tracks constraint scores and optimization history. Sequences are rarely created directly; they are created and managed by Segments during optimization. Understanding the Sequence data model is still essential for interpreting results and working with the metadata system.

What is a Sequence?

A Sequence bundles three things together:

The string is the raw biological sequence: nucleotides, amino acids, or a SMILES representation of a small molecule.

The type determines which characters are valid and how the sequence is validated. Metadata tracks constraint scores, system properties, and any custom data attached.

Creating Sequences

Direct Creation
Through Segments (typical)

python

from proto_language.core import Sequence

# DNA sequence
dna = Sequence(
    sequence="ATGCGATCGATCGATCG",
    sequence_type="dna"
)

# Protein sequence
protein = Sequence(
    sequence="MKFLILLFNILCLFPVLAAD",
    sequence_type="protein"
)

# With custom metadata
annotated = Sequence(
    sequence="AGGAGGTTTTTATG",
    sequence_type="dna",
    metadata={"source": "E. coli K-12", "region": "rbs"}
)

Most Sequences are created implicitly when a Segment is built. The Segment manages dual pools of Sequences during optimization.

python

from proto_language.core import Segment

# Segment creates internal Sequence objects
seg = Segment(sequence="ATGCGATCG", sequence_type="dna")

# Access the Sequence objects
seg.result_sequences[0].sequence      # "ATGCGATCG"
seg.result_sequences[0].sequence_type  # "dna"
seg.result_sequences[0].metadata       # {...}

Sequence Types

Type	Valid characters	Typical uses
DNA	`A` `C` `G` `T`	Promoters, coding sequences, regulatory elements, synthetic gene circuits
RNA	`A` `C` `G` `U`	mRNA, guide RNAs, ribozymes, aptamers, tRNA scaffolds
Protein	The 20 standard amino acids (`A C D E F G H I K L M N P Q R S T V W Y`)	Enzymes, antibodies, structural proteins, peptide therapeutics
Ligand	SMILES syntax (validated by RDKit)	Small molecules, drug proposals, metabolites

Character validation is enforced on creation and mutation. Invalid characters produce a warning but do not terminate the program, allowing flexible handling of edge cases like ambiguity codes.

Custom Valid Characters

The character set can be restricted for specialized applications:

python

# Only allow purines
purine_only = Sequence(
    sequence="AAGGAAGG",
    sequence_type="dna",
    valid_chars={"A", "G"}
)

# Reduced amino acid alphabet (e.g., for directed evolution libraries)
reduced = Sequence(
    sequence="AGSTNDE",
    sequence_type="protein",
    valid_chars={"A", "G", "S", "T", "N", "D", "E"}
)

The Metadata System

Every Sequence carries a metadata dictionary that tracks system properties, constraint scores, and custom user data. This is how optimization results are communicated.

Metadata Structure

python

seq.metadata
# {
#     "sequence": "ATGCGATCG...",           # System: current sequence string
#     "sequence_length": 100,                # System: length of the sequence
#     "constraints": {                       # System: constraint results
#         "GC Content": {
#             "score": 0.02,                 # Raw constraint score (0.0 = perfect)
#             "weight": 1.0,                 # Constraint weight
#             "weighted_score": 0.02,        # score * weight
#             "data": {                      # Constraint-specific data
#                 "gc_content": 52.0
#             }
#         },
#         "Structure pLDDT": {
#             "score": 0.15,
#             "weight": 2.0,
#             "weighted_score": 0.30,
#             "data": {
#                 "plddt": 85.2
#             }
#         }
#     },
#     "generators": {},                      # System: per-generator metadata (always present)
#     "source": "E. coli K-12",              # User: custom metadata
# }

System-managed keys (protected)

These keys are automatically maintained by the framework. If they are set manually, they are overridden by the computed values, and the framework logs a warning at construction.

Key	Type	Description
`sequence`	`str`	The current sequence string (kept in sync with the `.sequence` property)
`sequence_length`	`int`	Length of the current sequence
`constraints`	`dict`	Nested dictionary of constraint results, keyed by constraint label
`generators`	`dict`	Nested dictionary of per-generator metadata, keyed by generator label

Constraint metadata

After optimization, each constraint writes its results into metadata["constraints"][label]:

Field	Type	Description
`score`	`float`	Raw constraint score: 0.0 (perfect) to 1.0 (worst violation)
`weight`	`float`	Weight assigned to this constraint
`weighted_score`	`float`	`score * weight`; used for energy calculation
`data`	`Any`	Constraint-specific data (e.g., actual GC content, pLDDT value, structure)

python

# Access constraint results after optimization
for seq in segment.result_sequences:
    gc_score = seq.metadata["constraints"]["GC Content"]["score"]
    gc_actual = seq.metadata["constraints"]["GC Content"]["data"]["gc_content"]
    print(f"GC score: {gc_score:.3f}, actual GC: {gc_actual:.1f}%")

User-defined metadata

Any custom metadata can be attached when creating a Sequence. This metadata persists through optimization.

python

seq = Sequence(
    sequence="ATGCGATCG",
    sequence_type="dna",
    metadata={
        "source": "E. coli K-12",
        "experiment_id": "EXP-2024-001",
        "notes": "Wild-type promoter region"
    }
)

seq.metadata["source"]  # "E. coli K-12"

Do not use the reserved keys sequence, sequence_length, constraints, generators, logits, or structure for custom metadata. They are system-managed (the first four are recomputed and override provided values; logits and structure are first-class Sequence fields), and the framework logs a warning at construction if any of the first four are set.

Working with Sequences

String-Like Operations

Sequences support common string operations:

python

seq = Sequence(sequence="ATGCGATCGATCG", sequence_type="dna")

# Length
len(seq)           # 13

# String representation
str(seq)           # "ATGCGATCGATCG"

# Indexing and slicing
seq[0]             # "A"
seq[3:6]           # "CGA"
seq[-3:]           # "TCG"

Mutating Sequences

The .sequence property is settable. When it is updated, the metadata is automatically kept in sync:

python

seq = Sequence(sequence="ATGCGA", sequence_type="dna")

seq.sequence = "TTGCGA"  # Validated and metadata updated
seq.metadata["sequence"]         # "TTGCGA"
seq.metadata["sequence_length"]  # 6

In practice, Sequences are rarely mutated directly. Generators handle mutation during optimization.

Serialization

Sequences serialize to dictionaries for storage or transfer:

data = seq.to_dict()
# {
#     "sequence": "ATGCGATCG",
#     "sequence_type": "dna",
#     "valid_chars": ["A", "C", "G", "T"],
#     "metadata": { ... },      # user-provided metadata only
#     "constraints": { ... },   # per-constraint results (top-level sibling)
#     "generators": { ... },    # per-generator metadata (top-level sibling)
# }
# Pass include_logits=True or include_structure=True to add those keys.

Automatic Type Detection

The framework can infer sequence type from the characters present. This is useful when working with sequences of unknown origin:

python

from proto_language.core.sequence import detect_sequence_type

detect_sequence_type("ATGCGATCG")        # "dna"
detect_sequence_type("AUGCGAUCG")        # "rna"
detect_sequence_type("MKFLILLFNILC")     # "protein"
detect_sequence_type("CCO")              # "ligand" (ethanol SMILES)

Detection priority is DNA > RNA > Protein > Ligand. Ambiguous sequences (e.g., “ACGT” could be DNA or protein) default to the more specific type.

Next Steps

Segments

How Segments manage pools of Sequences during optimization

Constructs

Combining Segments into complete biological designs

Constraints

How constraints write scores into Sequence metadata

Overview

See the full architecture and design patterns

​Sequences

​What is a Sequence?

​Creating Sequences

​Sequence Types

​Custom Valid Characters

​The Metadata System

​Metadata Structure

​Working with Sequences

​String-Like Operations

​Mutating Sequences

​Serialization

​Automatic Type Detection

​Next Steps

Segments

Constructs

Constraints

Overview

Sequences

What is a Sequence?

Creating Sequences

Sequence Types

Custom Valid Characters

The Metadata System

Metadata Structure

Working with Sequences

String-Like Operations

Mutating Sequences

Serialization

Automatic Type Detection

Next Steps