Skip to main content

Sequences

A Sequence is the most fundamental object in Proto. It wraps a biological string (DNA, RNA, protein, or ligand) with type validation, character enforcement, and a metadata system that tracks constraint scores and optimization history. Sequences are rarely created directly; they are created and managed by Segments during optimization. Understanding the Sequence data model is still essential for interpreting results and working with the metadata system.

What is a Sequence?

A Sequence bundles three things together:
SequenceStringATGCGATCG…Typedna | rna | protein | ligandMetadatascores, length, constraints
SequenceStringATGCGATCG…Typedna | rna | protein | ligandMetadatascores, length, constraints
The string is the raw biological sequence: nucleotides, amino acids, or a SMILES representation of a small molecule.
The type determines which characters are valid and how the sequence is validated. Metadata tracks constraint scores, system properties, and any custom data attached.

Creating Sequences

python
from proto_language.core import Sequence

# DNA sequence
dna = Sequence(
    sequence="ATGCGATCGATCGATCG",
    sequence_type="dna"
)

# Protein sequence
protein = Sequence(
    sequence="MKFLILLFNILCLFPVLAAD",
    sequence_type="protein"
)

# With custom metadata
annotated = Sequence(
    sequence="AGGAGGTTTTTATG",
    sequence_type="dna",
    metadata={"source": "E. coli K-12", "region": "rbs"}
)

Sequence Types

TypeValid charactersTypical uses
DNAA C G TPromoters, coding sequences, regulatory elements, synthetic gene circuits
RNAA C G UmRNA, guide RNAs, ribozymes, aptamers, tRNA scaffolds
ProteinThe 20 standard amino acids (A C D E F G H I K L M N P Q R S T V W Y)Enzymes, antibodies, structural proteins, peptide therapeutics
LigandSMILES syntax (validated by RDKit)Small molecules, drug proposals, metabolites
Character validation is enforced on creation and mutation. Invalid characters produce a warning but do not terminate the program, allowing flexible handling of edge cases like ambiguity codes.

Custom Valid Characters

The character set can be restricted for specialized applications:
python
# Only allow purines
purine_only = Sequence(
    sequence="AAGGAAGG",
    sequence_type="dna",
    valid_chars={"A", "G"}
)

# Reduced amino acid alphabet (e.g., for directed evolution libraries)
reduced = Sequence(
    sequence="AGSTNDE",
    sequence_type="protein",
    valid_chars={"A", "G", "S", "T", "N", "D", "E"}
)

The Metadata System

Every Sequence carries a metadata dictionary that tracks system properties, constraint scores, and custom user data. This is how optimization results are communicated.

Metadata Structure

python
seq.metadata
# {
#     "sequence": "ATGCGATCG...",           # System: current sequence string
#     "sequence_length": 100,                # System: length of the sequence
#     "constraints": {                       # System: constraint results
#         "GC Content": {
#             "score": 0.02,                 # Raw constraint score (0.0 = perfect)
#             "weight": 1.0,                 # Constraint weight
#             "weighted_score": 0.02,        # score * weight
#             "data": {                      # Constraint-specific data
#                 "gc_content": 52.0
#             }
#         },
#         "Structure pLDDT": {
#             "score": 0.15,
#             "weight": 2.0,
#             "weighted_score": 0.30,
#             "data": {
#                 "plddt": 85.2
#             }
#         }
#     },
#     "generators": {},                      # System: per-generator metadata (always present)
#     "source": "E. coli K-12",              # User: custom metadata
# }
These keys are automatically maintained by the framework. If they are set manually, they are overridden by the computed values, and the framework logs a warning at construction.
KeyTypeDescription
sequencestrThe current sequence string (kept in sync with the .sequence property)
sequence_lengthintLength of the current sequence
constraintsdictNested dictionary of constraint results, keyed by constraint label
generatorsdictNested dictionary of per-generator metadata, keyed by generator label
After optimization, each constraint writes its results into metadata["constraints"][label]:
FieldTypeDescription
scorefloatRaw constraint score: 0.0 (perfect) to 1.0 (worst violation)
weightfloatWeight assigned to this constraint
weighted_scorefloatscore * weight; used for energy calculation
dataAnyConstraint-specific data (e.g., actual GC content, pLDDT value, structure)
python
# Access constraint results after optimization
for seq in segment.result_sequences:
    gc_score = seq.metadata["constraints"]["GC Content"]["score"]
    gc_actual = seq.metadata["constraints"]["GC Content"]["data"]["gc_content"]
    print(f"GC score: {gc_score:.3f}, actual GC: {gc_actual:.1f}%")
Any custom metadata can be attached when creating a Sequence. This metadata persists through optimization.
python
seq = Sequence(
    sequence="ATGCGATCG",
    sequence_type="dna",
    metadata={
        "source": "E. coli K-12",
        "experiment_id": "EXP-2024-001",
        "notes": "Wild-type promoter region"
    }
)

seq.metadata["source"]  # "E. coli K-12"
Do not use the reserved keys sequence, sequence_length, constraints, generators, logits, or structure for custom metadata. They are system-managed (the first four are recomputed and override provided values; logits and structure are first-class Sequence fields), and the framework logs a warning at construction if any of the first four are set.

Working with Sequences

String-Like Operations

Sequences support common string operations:
python
seq = Sequence(sequence="ATGCGATCGATCG", sequence_type="dna")

# Length
len(seq)           # 13

# String representation
str(seq)           # "ATGCGATCGATCG"

# Indexing and slicing
seq[0]             # "A"
seq[3:6]           # "CGA"
seq[-3:]           # "TCG"

Mutating Sequences

The .sequence property is settable. When it is updated, the metadata is automatically kept in sync:
python
seq = Sequence(sequence="ATGCGA", sequence_type="dna")

seq.sequence = "TTGCGA"  # Validated and metadata updated
seq.metadata["sequence"]         # "TTGCGA"
seq.metadata["sequence_length"]  # 6
In practice, Sequences are rarely mutated directly. Generators handle mutation during optimization.

Serialization

Sequences serialize to dictionaries for storage or transfer:
data = seq.to_dict()
# {
#     "sequence": "ATGCGATCG",
#     "sequence_type": "dna",
#     "valid_chars": ["A", "C", "G", "T"],
#     "metadata": { ... },      # user-provided metadata only
#     "constraints": { ... },   # per-constraint results (top-level sibling)
#     "generators": { ... },    # per-generator metadata (top-level sibling)
# }
# Pass include_logits=True or include_structure=True to add those keys.

Automatic Type Detection

The framework can infer sequence type from the characters present. This is useful when working with sequences of unknown origin:
python
from proto_language.core.sequence import detect_sequence_type

detect_sequence_type("ATGCGATCG")        # "dna"
detect_sequence_type("AUGCGAUCG")        # "rna"
detect_sequence_type("MKFLILLFNILC")     # "protein"
detect_sequence_type("CCO")              # "ligand" (ethanol SMILES)
Detection priority is DNA > RNA > Protein > Ligand. Ambiguous sequences (e.g., “ACGT” could be DNA or protein) default to the more specific type.

Next Steps

Segments

How Segments manage pools of Sequences during optimization

Constructs

Combining Segments into complete biological designs

Constraints

How constraints write scores into Sequence metadata

Overview

See the full architecture and design patterns