Intron Design

This program designs a synthetic intron that splices efficiently when inserted into a gene. The variable intron core sits between fixed GT (donor) and AG (acceptor) dinucleotides, embedded in a kilobase target window with four kilobases of plasmid context on each side, the input format the SpliceTransformer model expects. The constraints reward strong donor and acceptor usage and a tissue-specific splicing preference; an MCMCOptimizer searches the intron core. The full script optimizes a 301 bp intron in a mScarlet reporter across several plasmid contexts for thousands of steps. This walkthrough uses a short intron, one context, and a few steps so it runs quickly. It requires a GPU for SpliceTransformer. Open as a runnable notebook View as a Python script

Runtime: this walkthrough runs real models on a GPU and takes several minutes to complete. The first run is slower because it builds the tool environment and downloads model weights.

Build the splicing window

SpliceTransformer scores each position of a 1 kb target sequence given 4 kb of flanking context on each side (a 9 kb window in total). This cell assembles that input. A fresh intron core is generated with random.choices and wrapped in the fixed GT donor and AG acceptor dinucleotides, then centered in the TARGET_LENGTH (1000 bp) target with plasmid sequence filling the remaining positions. donor_pos and acceptor_pos are the zero-indexed positions the constraints will score: SpliceTransformer scores the donor at the base just before GT and the acceptor at the base just after AG. random.seed(0) makes the generated core reproducible.

python

import random
from pathlib import Path

random.seed(0)
plasmid = (Path.cwd().parent / "data" / "intron_plasmid_context.txt").read_text().strip().replace("\n", "")

TARGET_LENGTH = 1000
CONTEXT_LENGTH = 4000
INTRON_LENGTH = 60  # GT + 56 bp core + AG

# A fresh intron core flanked by the fixed splice-site dinucleotides.
intron_seq = "GT" + "".join(random.choices("ACGT", k=INTRON_LENGTH - 4)) + "AG"

# Center the intron in the 1 kb target, filling the rest with plasmid sequence.
donor_start = (TARGET_LENGTH - INTRON_LENGTH) // 2
left_fill = plasmid[:donor_start]
right_fill = plasmid[donor_start : TARGET_LENGTH - INTRON_LENGTH]
target = left_fill + intron_seq + right_fill
assert len(target) == TARGET_LENGTH

# 4 kb of context on each side (the plasmid is reused as flanking sequence here).
left_context = plasmid[:CONTEXT_LENGTH] if len(plasmid) >= CONTEXT_LENGTH else (plasmid * 4)[:CONTEXT_LENGTH]
right_context = plasmid[-CONTEXT_LENGTH:] if len(plasmid) >= CONTEXT_LENGTH else (plasmid * 4)[:CONTEXT_LENGTH]

# SpliceTransformer scores the donor just before GT and the acceptor just after AG.
donor_pos = [donor_start - 1]
acceptor_pos = [donor_start + INTRON_LENGTH]

Segments

A Segment is a stretch of sequence; a Construct groups the segments that make up one molecule. The target is split into three segments, each carrying a starting sequence sliced from target and sequence_type="dna": a fixed left_flank ending in the GT donor, the variable intron core, and a fixed right_flank beginning with the AG acceptor. Only the intron segment is assigned a generator below, so the flanks (and the splice-site dinucleotides) stay fixed while the core is designed. The label on each segment is the key its results are filed under.

python

from proto_language.core import Segment, Construct

left_flank = Segment(sequence=target[: donor_start + 2], sequence_type="dna", label="left_flank")
intron = Segment(sequence=target[donor_start + 2 : donor_start + INTRON_LENGTH - 2],
                 sequence_type="dna", label="intron")
right_flank = Segment(sequence=target[donor_start + INTRON_LENGTH - 2 :], sequence_type="dna", label="right_flank")

construct = Construct([left_flank, intron, right_flank])

Generator and splicing constraints

The generator proposes new sequences for the optimizer to score. RandomNucleotideGenerator substitutes random bases at masked positions; MaskingStrategy(num_mutations=2) sets the exact number of positions mutated per call to two. generator.assign(intron) binds it to the core segment, so only the core is mutated. Constraints score how well a sequence meets the design objective, and the optimizer searches for sequences that raise those scores. Both constraints here read all three segments through inputs=[left_flank, intron, right_flank], which they concatenate into the 1 kb target. splice_transformer_intron_boundary scores donor and acceptor prediction at the donor_pos/acceptor_pos positions; splice_transformer_specificity scores tissue-specific splice site usage at those same positions, here with tissue="BRAIN" and direction="max" (maximize brain splicing). Each carries the 4 kb left_context and right_context SpliceTransformer requires, and a label keying its scores in the result metadata.

python

from proto_tools.transforms.masking import MaskingStrategy
from proto_language.core import Constraint
from proto_language.constraint import splice_transformer_intron_boundary, splice_transformer_specificity
from proto_language.generator import RandomNucleotideGenerator, RandomNucleotideGeneratorConfig

generator = RandomNucleotideGenerator(
    RandomNucleotideGeneratorConfig(masking_strategy=MaskingStrategy(num_mutations=2))
)
generator.assign(intron)

boundary = Constraint(
    inputs=[left_flank, intron, right_flank],
    function=splice_transformer_intron_boundary,
    function_config={"left_context": left_context, "right_context": right_context,
                     "donor_pos": donor_pos, "acceptor_pos": acceptor_pos},
    label="splice_boundary",
)
specificity = Constraint(
    inputs=[left_flank, intron, right_flank],
    function=splice_transformer_specificity,
    function_config={"left_context": left_context, "right_context": right_context,
                     "splice_pos": donor_pos + acceptor_pos, "tissue": "BRAIN", "direction": "max"},
    label="splice_brain_max",
)

Run the search

The MCMCOptimizer ties the construct, generator, and constraints together and runs Metropolis-Hastings: at each step it generates proposals, scores them against the two splice constraints, and accepts or rejects, always keeping improvements and accepting worse proposals with a probability that falls as the temperature anneals from max_temperature (1.0) to min_temperature (0.001) over num_steps. Here num_steps=3 runs a brief demonstration; the intron core evolves while the donor and acceptor sites stay fixed in the flanks. The Program runs the optimizer and collects the result.

python

from proto_language.core import Program
from proto_language.optimizer import MCMCOptimizer, MCMCOptimizerConfig

optimizer = MCMCOptimizer(
    constructs=[construct],
    generators=[generator],
    constraints=[boundary, specificity],
    config=MCMCOptimizerConfig(num_steps=3, max_temperature=1.0, min_temperature=0.001),
)

program = Program(optimizers=[optimizer], num_results=1)
program.run()

Inspect the result

The designed intron core is read back from intron.result_sequences[0], the optimized sequence for that segment. program.constructs[0].joined_sequences[0] is the full cassette with the flanks rejoined; its length confirms the assembled target is still 1000 bp, the window SpliceTransformer expects.

python

designed = program.constructs[0].joined_sequences[0]
print(f"designed intron core: {intron.result_sequences[0].sequence}")
print(f"full cassette length: {len(designed.sequence)} bp")

designed intron core: TTCCGCTCCGTGCCGCTTCTCGTGCACGGTCTCTGAGCTGACTACTAGATTCACGT
full cassette length: 1000 bp

Intron Design

Build the splicing window

Segments

Generator and splicing constraints

Run the search

Inspect the result

Next Steps

Using Constraints

Multi-Stage DNA Optimization

​Build the splicing window

​Segments

​Generator and splicing constraints

​Run the search

​Inspect the result

​Next Steps

Using Constraints

Multi-Stage DNA Optimization

Build the splicing window

Segments

Generator and splicing constraints

Run the search

Inspect the result

Next Steps