Proto is not affiliated with UT Southwestern Medical Center and St. Jude Children’s Research Hospital. This toolkit is open source and builds on the implementations produced by these organizations. Product names, logos, and trademarks are the property of their respective owners.
Background
In 2024, Dudnyk et al. introduced Puffin, a deep learning model that explains transcription initiation in the human genome at basepair resolution from sequence alone. The model is trained against five transcription initiation assays (FANTOM CAGE, ENCODE CAGE, ENCODE RAMPAGE, GRO-cap, PRO-cap), each predicted on both strands. The output is a per-base 10-channel signal that can be interpreted asln(count_scale_signal + 1).
Puffin is structurally constrained: its first convolutional layer plays the role of a learned motif filter bank, and the model exposes per-base activation and contribution scores for nine promoter motifs (CREB, ETS, NFY, NRF1, SP, TATA, U1_snRNP, YY1, ZNF143) on each strand. A tenth Long Inr filter is used internally by the model to construct the per-base initiator-effect track but is not exposed per-motif. The minimum input is 651 bp because the model uses 325 bp of padding on each side of the predicted output span.
The wrapper accepts raw DNA strings; the upstream coordinate / region / FASTA-file CLI modes (which require an hg38 reference) are intentionally not exposed and callers extract genomic sequences themselves.
Tools
Puffin Prediction (puffin-prediction)
Runs a single forward pass through Puffin and returns per-base predictions across all 10 transcription-initiation channels (5 assays × 2 strands) at single-base resolution.API Reference
Input: PuffinPredictionInput
Input: PuffinPredictionInput
A, C, G, T, N are accepted.Config: PuffinPredictionConfig
Config: PuffinPredictionConfig
True is coerced to 1 and False to 0.None waits indefinitely.BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.Applications
Use this tool to score transcription start sites, rank candidate promoters, or measure the per-base effect of variants and edits across five capped-5’-end assays in one call. The fast path is the right choice when the question is how much signal a sequence produces rather than why.Usage Tips
- Per-base output length is
len(sequence) - 650. The model uses 325 bp of padding on each side; output coordinates run from 325 tolen(sequence) - 325in the input frame. - Channel order is mirrored across strands. The first 5 channels are FANTOM_CAGE+ → PRO_CAP+; the next 5 are PRO_CAP- → FANTOM_CAGE-. Index by name via
TRACK_NAMES.index(...)rather than memorizing positions. - Outputs are in log scale. Treat predicted values as
ln(count_scale_signal + 1). To compare two sequences, subtract — the difference is already in log space.
Puffin Interpretation (puffin-interpretation)
Runs Puffin’s gradient-based decomposition for one chosen target assay and strand. Returns the per-base prediction for that target, 18 motif-activation tracks, 18 motif-effect tracks, and per-base basepair-contribution scores both as an aggregate and decomposed two ways (contribution to the predicted signal per motif, and contribution to each motif’s activation per basepair; 18 tracks each). Summed motif, initiator, trinucleotide, and total-effect tracks are also returned.API Reference
Input: PuffinInterpretationInput
Input: PuffinInterpretationInput
A, C, G, T, N are accepted.Config: PuffinInterpretationConfig
Config: PuffinInterpretationConfig
FANTOM_CAGE, ENCODE_CAGE, ENCODE_RAMPAGE, GRO_CAP, PRO_CAPTrue, decompose the reverse-strand prediction for the chosen target instead of the forward strand.True is coerced to 1 and False to 0.None waits indefinitely.BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.Output: PuffinInterpretationOutput
Output: PuffinInterpretationOutput
Applications
Use this tool to ask which motif drives a transcription start site, how a variant changes a motif activation, or how initiator and trinucleotide context shape the predicted signal. It is substantially slower thanpuffin-prediction because it computes per-base gradient contributions, so reach for it for mechanistic follow-up on specific sequences rather than for bulk scoring.Usage Tips
target_signalpicks which assay’s prediction is decomposed. Choose the one closest to the biological question; CAGE/RAMPAGE measure capped mRNA 5’ ends, while GRO-cap/PRO-cap measure nascent transcription.reverse_strandselects which strand head to interpret. Defaults to forward; run it twice on the same input to analyze divergent or antisense promoters.- Motif dicts use strand-suffixed keys. Access
motif_activations["TATA+"]andmotif_activations["TATA-"], never the bare motif name.MOTIF_NAMESlists the 9 motif stems.
Toolkit Notes
These apply to every Puffin tool in this toolkit (puffin-prediction, puffin-interpretation).
- GPU recommended but not required. Both tools run on CPU;
puffin-interpretationis materially slower thanpuffin-predictionon either device because it backpropagates through every output position and motif. - Sequence input only. The upstream CLI’s coordinate / region / FASTA-file modes require an
hg38.fareference and are intentionally not wrapped; callers extract DNA themselves and pass it as a string. - Both tools share one persistent worker. They dispatch against the same
puffintoolkit and load the Puffin model once per worker process; switching between prediction and interpretation does not reload weights.
Infrastructure Guides
The following guides cover how to run tools efficiently and at scale.Tool Persistence
Device Management
Parallel Execution
Cloud Inference
Additional Information
References
References
- Dudnyk, K., Cai, D., Shi, C., Xu, J., Zhou, J. Sequence basis of transcription initiation in the human genome. Science 384, eadj0116 (2024). DOI: 10.1126/science.adj0116
- Upstream repository: jzhoulab/puffin

UT Southwestern Medical Center
St. Jude Children’s Research Hospital