Intron Design with AlphaGenome

This program designs a synthetic intron and scores it with two complementary models. A SpliceTransformer constraint reads the donor and acceptor sites in their plasmid context, and an AlphaGenome constraint predicts how strongly those splice sites are used once the cassette is inserted into a genomic safe-harbor locus. The GT and AG boundary dinucleotides stay fixed while a random-nucleotide generator mutates the intron core, and an MCMCOptimizer keeps the proposals that raise the combined splice signal. The full script sweeps several plasmid contexts and safe-harbor loci and balances on-target against off-target cell types. This walkthrough uses one plasmid context, one safe-harbor locus (AAVS1), a neural-cell ontology term, and two MCMC steps so it runs on a single GPU in a few minutes. Open as a runnable notebook View as a Python script

Runtime: this walkthrough runs real models on a GPU and takes several minutes to complete. The first run is slower because it builds the tool environment and downloads model weights.

The intron cassette

The cassette is assembled by the base intron-design helpers: get_initial_intron seeds a GT…AG intron and process_splice_transformer_input centers it in a 1 kb target with 4 kb of plasmid context on each side (the 4 kb is SpliceTransformer’s required left/right CONTEXT_LENGTH). The cassette config records the inputs: intron_length=301 (the GT, a 297 bp core, and the AG, as the inline comment notes), the plasmid context and gene-sequence files, and the gene insertion position. The helper returns left_context, right_context, the concatenated target_seq, and the donor_start / acceptor_end offsets. Those offsets slice the target into three Segment objects (left flank, designable intron core, right flank) wrapped in a Construct, with the GT and AG dinucleotides held inside the flanks so only the core is editable. donor_eval and acceptor_eval mark the positions each scorer reads, just before the GT and just after the AG.

python

import sys
from pathlib import Path
from types import SimpleNamespace

# The AlphaGenome variant builds on the base intron-design helpers in examples/scripts.
sys.path.insert(0, str(Path.cwd().parents[1]))
from examples.scripts.program_intron_design import get_initial_intron, process_splice_transformer_input

DATA = Path.cwd().parent / "data"

cassette = SimpleNamespace(
    initialization="random",
    intron_length=301,                      # GT + 297 bp core + AG
    plasmid_context_path=str(DATA / "plasmid_context_cmv_20260308.txt"),
    gene_sequence_path=str(DATA / "mscarlet_ires_zsgreen.txt"),
    gene_insertion_pos=159 * 3,
)

initial_intron = get_initial_intron(cassette)
left_context, right_context, target_seq, _, _, donor_start, acceptor_end = process_splice_transformer_input(
    initial_intron, cassette
)

from proto_language.core import Segment, Construct

left_flank = Segment(sequence=target_seq[: donor_start + 2], sequence_type="dna", label="left_flank")
intron = Segment(sequence=target_seq[donor_start + 2 : acceptor_end - 1], sequence_type="dna", label="intron")
right_flank = Segment(sequence=target_seq[acceptor_end - 1 :], sequence_type="dna", label="right_flank")
construct = Construct([left_flank, intron, right_flank])

# Donor is scored just before the GT, acceptor just after the AG.
donor_eval, acceptor_eval = donor_start - 1, acceptor_end + 1
splice_pos = [donor_eval, acceptor_eval]

The generator and the SpliceTransformer boundary

A RandomNucleotideGenerator proposes mutations at masked positions. Its MaskingStrategy is set to num_mutations=1, the exact number of positions to change per proposal, so each step edits a single base; generator.assign(intron) binds it to the intron core, leaving the GT/AG boundaries in the flanks untouched. The first constraint, splice_transformer_intron_boundary, concatenates the three segments into the 1 kb target and runs SpliceTransformer over the target plus its 4 kb flanks. It reads the donor probability at donor_pos and the acceptor probability at acceptor_pos; the score it returns is a boundary penalty, 1 - mean(donor, acceptor) probability, so a more recognizable donor and acceptor lowers the penalty.

python

from proto_language.core import Constraint
from proto_language.constraint import splice_transformer_intron_boundary
from proto_language.generator import MaskingStrategy, RandomNucleotideGenerator, RandomNucleotideGeneratorConfig

generator = RandomNucleotideGenerator(
    RandomNucleotideGeneratorConfig(masking_strategy=MaskingStrategy(num_mutations=1))
)
generator.assign(intron)

boundary = Constraint(
    inputs=[left_flank, intron, right_flank],
    function=splice_transformer_intron_boundary,
    function_config={
        "left_context": left_context,
        "right_context": right_context,
        "donor_pos": [donor_eval],
        "acceptor_pos": [acceptor_eval],
        "splice_transformer_config": {"device": "cuda"},
    },
    label="splice_boundary",
)

The AlphaGenome splice-site-usage constraint

The second constraint concatenates the same three segments into the target, wraps it with the cassette_left_context / cassette_right_context, and integrates that cassette into the center of genomic_context, here the AAVS1 safe-harbor locus read from alphagenome_context_aavs1.txt. AlphaGenome then predicts splice-site usage, and the constraint reads it at the donor and acceptor positions in splice_pos for the ontology term CL:0002319 (neural cell). With direction="max" the score is 1 - mean(usage), so higher predicted usage at those positions lowers the score. device="cuda" runs the model on the GPU.

python

from proto_language.constraint.rna_splicing.alphagenome_splice_site_usage import (
    AlphaGenomeSpliceSiteUsageConfig,
    alphagenome_splice_site_usage,
)

genomic_context = (DATA / "alphagenome_context_aavs1.txt").read_text().strip().upper()

alphagenome = Constraint(
    inputs=[left_flank, intron, right_flank],
    function=alphagenome_splice_site_usage,
    function_config=AlphaGenomeSpliceSiteUsageConfig(
        genomic_context=genomic_context,
        cassette_left_context=left_context,
        cassette_right_context=right_context,
        ontology_terms=["CL:0002319"],   # neural cell
        splice_pos=splice_pos,
        direction="max",
        device="cuda",
    ),
    label="alphagenome_ssu_brain_max",
)

The search

MCMCOptimizer runs Metropolis-Hastings: at each of num_steps=2 steps it generates a mutated intron core, scores it with both constraints, and accepts or rejects under a cooling temperature. num_results=1 and proposals_per_result=1 keep a single trajectory with one proposal per step. clear_tool_cache=True clears the tool cache each iteration. Because both scorers call GPU tools, DeviceManager.configure(allow_multiple_per_device=True) permits multiple tool instances on one device so SpliceTransformer and AlphaGenome can both stay resident, and the ToolInstance.persist() block auto-caches each tool on first use and reuses the warm worker for the rest of the run, freeing GPU memory on exit. The Program runs the optimizer and collects the result.

python

from proto_language.core import Program
from proto_language.optimizer import MCMCOptimizer, MCMCOptimizerConfig
from proto_tools.utils import DeviceManager
from proto_tools.utils.tool_instance import ToolInstance

optimizer = MCMCOptimizer(
    constructs=[construct],
    generators=[generator],
    constraints=[boundary, alphagenome],
    config=MCMCOptimizerConfig(num_steps=2, num_results=1, proposals_per_result=1),
    clear_tool_cache=True,
)

DeviceManager.get_instance().configure(allow_multiple_per_device=True)
program = Program(optimizers=[optimizer], num_results=1)

with ToolInstance.persist():
    program.run()

Inspect the result

program.energy_scores reports the final-stage objective energy, where lower values indicate better solutions; this combines the two constraint scores for the best trajectory. intron.result_sequences[0].sequence is the designed intron core, the only editable segment; the print shows its first 60 bases.

python

print(f"objective energy: {program.energy_scores[0]:.4f}")
print(f"designed intron core: {intron.result_sequences[0].sequence[:60]}...")

objective energy: 1.9970
designed intron core: TATGCCAGTGTAGGTTAGTAACCTTATTATTACTTATAGTATCCGCATAATAGCACTTCA...

Intron Design with AlphaGenome

The intron cassette

The generator and the SpliceTransformer boundary

The AlphaGenome splice-site-usage constraint

The search

Inspect the result

Next Steps

Intron Design

Cell-Type-Specific DNA

​The intron cassette

​The generator and the SpliceTransformer boundary

​The AlphaGenome splice-site-usage constraint

​The search

​Inspect the result

​Next Steps

Intron Design

Cell-Type-Specific DNA

The intron cassette

The generator and the SpliceTransformer boundary

The AlphaGenome splice-site-usage constraint

The search

Inspect the result

Next Steps