Sequences
A Sequence is the most fundamental object in Proto. It wraps a biological string (DNA, RNA, protein, or ligand) with type validation, character enforcement, and a metadata system that tracks constraint scores and optimization history. Sequences are rarely created directly; they are created and managed by Segments during optimization. Understanding the Sequence data model is still essential for interpreting results and working with the metadata system.What is a Sequence?
A Sequence bundles three things together:The string is the raw biological sequence: nucleotides, amino acids, or a SMILES representation of a small molecule.
The type determines which characters are valid and how the sequence is validated. Metadata tracks constraint scores, system properties, and any custom data attached.
Creating Sequences
- Direct Creation
- Through Segments (typical)
python
Sequence Types
| Type | Valid characters | Typical uses |
|---|---|---|
| DNA | A C G T | Promoters, coding sequences, regulatory elements, synthetic gene circuits |
| RNA | A C G U | mRNA, guide RNAs, ribozymes, aptamers, tRNA scaffolds |
| Protein | The 20 standard amino acids (A C D E F G H I K L M N P Q R S T V W Y) | Enzymes, antibodies, structural proteins, peptide therapeutics |
| Ligand | SMILES syntax (validated by RDKit) | Small molecules, drug proposals, metabolites |
Character validation is enforced on creation and mutation. Invalid characters produce a warning but do not terminate the program, allowing flexible handling of edge cases like ambiguity codes.
Custom Valid Characters
The character set can be restricted for specialized applications:python
The Metadata System
Every Sequence carries a metadata dictionary that tracks system properties, constraint scores, and custom user data. This is how optimization results are communicated.Metadata Structure
python
System-managed keys (protected)
System-managed keys (protected)
These keys are automatically maintained by the framework. If they are set manually, they are overridden by the computed values, and the framework logs a warning at construction.
| Key | Type | Description |
|---|---|---|
sequence | str | The current sequence string (kept in sync with the .sequence property) |
sequence_length | int | Length of the current sequence |
constraints | dict | Nested dictionary of constraint results, keyed by constraint label |
generators | dict | Nested dictionary of per-generator metadata, keyed by generator label |
Constraint metadata
Constraint metadata
After optimization, each constraint writes its results into
metadata["constraints"][label]:| Field | Type | Description |
|---|---|---|
score | float | Raw constraint score: 0.0 (perfect) to 1.0 (worst violation) |
weight | float | Weight assigned to this constraint |
weighted_score | float | score * weight; used for energy calculation |
data | Any | Constraint-specific data (e.g., actual GC content, pLDDT value, structure) |
python
User-defined metadata
User-defined metadata
Any custom metadata can be attached when creating a Sequence. This metadata persists through optimization.
python
Working with Sequences
String-Like Operations
Sequences support common string operations:python
Mutating Sequences
The.sequence property is settable. When it is updated, the metadata is automatically kept in sync:
python
In practice, Sequences are rarely mutated directly. Generators handle mutation during optimization.
Serialization
Sequences serialize to dictionaries for storage or transfer:Automatic Type Detection
The framework can infer sequence type from the characters present. This is useful when working with sequences of unknown origin:python
Detection priority is DNA > RNA > Protein > Ligand. Ambiguous sequences (e.g., “ACGT” could be DNA or protein) default to the more specific type.
Next Steps
Segments
How Segments manage pools of Sequences during optimization
Constructs
Combining Segments into complete biological designs
Constraints
How constraints write scores into Sequence metadata
Overview
See the full architecture and design patterns