Skip to main content
License: Puffin is licensed under Custom (UTSW Academic Software License) and has restrictions around commercial use and may require explicit attribution when utilized. Please refer to the license for full terms.

Proto is not affiliated with UT Southwestern Medical Center and St. Jude Children’s Research Hospital. This toolkit is open source and builds on the implementations produced by these organizations. Product names, logos, and trademarks are the property of their respective owners.


jzhoulab/puffin
jzhoulab/puffin
deep learning-inspired explainable sequence model for transcription initiation
104 stars
View repo
puffin.zhoulab.io
Visit website
Sequence basis of transcription initiation in the human genome
Kseniia Dudnyk, Donghong Cai, … Jian Zhou
Science (2024)
Read paper
Sequence basis of transcription initiation in human genome
Kseniia Dudnyk, Chenlai Shi and Jian Zhou
bioRxiv (2023)
Read preprint
@article{dudnyk2024puffin,
  title={Sequence basis of transcription initiation in the human genome},
  author={Dudnyk, Kseniia and Cai, Donghong and Shi, Chenlai and Xu, Jian and Zhou, Jian},
  journal={Science},
  volume={384},
  number={6694},
  year={2024},
  publisher={American Association for the Advancement of Science},
  doi={10.1126/science.adj0116},
  url={https://www.science.org/doi/10.1126/science.adj0116}
}
Copy citation
proto-bio/proto-tools/proto_tools/tools/sequence_scoring/puffin
View source
Open Notebook
Open notebook
FunctionDescription
run_puffin_interpretation()Motif-level interpretation of transcription initiation with Puffin (GPU) Docs Source
run_puffin_prediction()Basepair-resolution transcription initiation prediction with Puffin (GPU) Docs Source

Background

In 2024, Dudnyk et al. introduced Puffin, a deep learning model that explains transcription initiation in the human genome at basepair resolution from sequence alone. The model is trained against five transcription initiation assays (FANTOM CAGE, ENCODE CAGE, ENCODE RAMPAGE, GRO-cap, PRO-cap), each predicted on both strands. The output is a per-base 10-channel signal that can be interpreted as ln(count_scale_signal + 1). Puffin is structurally constrained: its first convolutional layer plays the role of a learned motif filter bank, and the model exposes per-base activation and contribution scores for nine promoter motifs (CREB, ETS, NFY, NRF1, SP, TATA, U1_snRNP, YY1, ZNF143) on each strand. A tenth Long Inr filter is used internally by the model to construct the per-base initiator-effect track but is not exposed per-motif. The minimum input is 651 bp because the model uses 325 bp of padding on each side of the predicted output span. The wrapper accepts raw DNA strings; the upstream coordinate / region / FASTA-file CLI modes (which require an hg38 reference) are intentionally not exposed and callers extract genomic sequences themselves.

Tools

Puffin Prediction (puffin-prediction)

Runs a single forward pass through Puffin and returns per-base predictions across all 10 transcription-initiation channels (5 assays × 2 strands) at single-base resolution.

API Reference

Source
sequences
List[string]
required
DNA sequence(s) at least 651 bp long. A single string is normalized to a one-item list. Only A, C, G, T, N are accepted.
Source
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cuda"
Device used for inference.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
results
List[PuffinPredictionResult]
required
Per-sequence prediction results.
track_names
List[string]
required
Names of the 10 output channels in order.

Applications

Use this tool to score transcription start sites, rank candidate promoters, or measure the per-base effect of variants and edits across five capped-5’-end assays in one call. The fast path is the right choice when the question is how much signal a sequence produces rather than why.

Usage Tips

  • Per-base output length is len(sequence) - 650. The model uses 325 bp of padding on each side; output coordinates run from 325 to len(sequence) - 325 in the input frame.
  • Channel order is mirrored across strands. The first 5 channels are FANTOM_CAGE+ → PRO_CAP+; the next 5 are PRO_CAP- → FANTOM_CAGE-. Index by name via TRACK_NAMES.index(...) rather than memorizing positions.
  • Outputs are in log scale. Treat predicted values as ln(count_scale_signal + 1). To compare two sequences, subtract — the difference is already in log space.

Puffin Interpretation (puffin-interpretation)

Runs Puffin’s gradient-based decomposition for one chosen target assay and strand. Returns the per-base prediction for that target, 18 motif-activation tracks, 18 motif-effect tracks, and per-base basepair-contribution scores both as an aggregate and decomposed two ways (contribution to the predicted signal per motif, and contribution to each motif’s activation per basepair; 18 tracks each). Summed motif, initiator, trinucleotide, and total-effect tracks are also returned.

API Reference

Source
sequences
List[string]
required
DNA sequence(s) at least 651 bp long. A single string is normalized to a one-item list. Only A, C, G, T, N are accepted.
Source
target_signal
enum
default:"FANTOM_CAGE"
Which transcription initiation assay’s predictions to decompose into motif and basepair contributions.Available options: FANTOM_CAGE, ENCODE_CAGE, ENCODE_RAMPAGE, GRO_CAP, PRO_CAP
reverse_strand
boolean
default:"False"
If True, decompose the reverse-strand prediction for the chosen target instead of the forward strand.
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cuda"
Device used for inference.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
results
List[PuffinInterpretationResult]
required
Per-sequence interpretation results.
target_signal
string
required
Target signal selected for interpretation.
reverse_strand
boolean
required
Whether the reverse-strand head was used.
motif_names
List[string]
required
The 9 learned motif names (without strand suffix) exposed in the per-motif tracks. Strand-suffixed names appear as keys in each result’s motif-keyed dicts.

Applications

Use this tool to ask which motif drives a transcription start site, how a variant changes a motif activation, or how initiator and trinucleotide context shape the predicted signal. It is substantially slower than puffin-prediction because it computes per-base gradient contributions, so reach for it for mechanistic follow-up on specific sequences rather than for bulk scoring.

Usage Tips

  • target_signal picks which assay’s prediction is decomposed. Choose the one closest to the biological question; CAGE/RAMPAGE measure capped mRNA 5’ ends, while GRO-cap/PRO-cap measure nascent transcription.
  • reverse_strand selects which strand head to interpret. Defaults to forward; run it twice on the same input to analyze divergent or antisense promoters.
  • Motif dicts use strand-suffixed keys. Access motif_activations["TATA+"] and motif_activations["TATA-"], never the bare motif name. MOTIF_NAMES lists the 9 motif stems.

Toolkit Notes

These apply to every Puffin tool in this toolkit (puffin-prediction, puffin-interpretation).
  • GPU recommended but not required. Both tools run on CPU; puffin-interpretation is materially slower than puffin-prediction on either device because it backpropagates through every output position and motif.
  • Sequence input only. The upstream CLI’s coordinate / region / FASTA-file modes require an hg38.fa reference and are intentionally not wrapped; callers extract DNA themselves and pass it as a string.
  • Both tools share one persistent worker. They dispatch against the same puffin toolkit and load the Puffin model once per worker process; switching between prediction and interpretation does not reload weights.
Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.

Additional Information

  • Dudnyk, K., Cai, D., Shi, C., Xu, J., Zhou, J. Sequence basis of transcription initiation in the human genome. Science 384, eadj0116 (2024). DOI: 10.1126/science.adj0116
  • Upstream repository: jzhoulab/puffin