Puffin - Proto

License: Puffin is licensed under Custom (UTSW Academic Software License) and has restrictions around commercial use and may require explicit attribution when utilized. Please refer to the license for full terms.

Proto is not affiliated with UT Southwestern Medical Center and St. Jude Children’s Research Hospital. This toolkit is open source and builds on the implementations produced by these organizations. Product names, logos, and trademarks are the property of their respective owners.

GitHub 104 GitHub 104 Website Website Publication Publication Preprint Preprint Cite Cite Tool Source Tool Source Open as Notebook Open as Notebook

jzhoulab/puffin

deep learning-inspired explainable sequence model for transcription initiation

Sequence basis of transcription initiation in the human genome

Kseniia Dudnyk, Donghong Cai, … Jian Zhou

Science (2024)

Read paper

Sequence basis of transcription initiation in human genome

Kseniia Dudnyk, Chenlai Shi and Jian Zhou

bioRxiv (2023)

Read preprint

@article{dudnyk2024puffin,
  title={Sequence basis of transcription initiation in the human genome},
  author={Dudnyk, Kseniia and Cai, Donghong and Shi, Chenlai and Xu, Jian and Zhou, Jian},
  journal={Science},
  volume={384},
  number={6694},
  year={2024},
  publisher={American Association for the Advancement of Science},
  doi={10.1126/science.adj0116},
  url={https://www.science.org/doi/10.1126/science.adj0116}
}

Copy citation

proto-bio/proto-tools/proto_tools/tools/sequence_scoring/puffin

View source

Open Notebook

Open notebook

Function	Description
`run_puffin_interpretation()`	Motif-level interpretation of transcription initiation with Puffin (GPU)	Docs Source
`run_puffin_prediction()`	Basepair-resolution transcription initiation prediction with Puffin (GPU)	Docs Source

Background

In 2024, Dudnyk et al. introduced Puffin, a deep learning model that explains transcription initiation in the human genome at basepair resolution from sequence alone. The model is trained against five transcription initiation assays (FANTOM CAGE, ENCODE CAGE, ENCODE RAMPAGE, GRO-cap, PRO-cap), each predicted on both strands. The output is a per-base 10-channel signal that can be interpreted as ln(count_scale_signal + 1). Puffin is structurally constrained: its first convolutional layer plays the role of a learned motif filter bank, and the model exposes per-base activation and contribution scores for nine promoter motifs (CREB, ETS, NFY, NRF1, SP, TATA, U1_snRNP, YY1, ZNF143) on each strand. A tenth Long Inr filter is used internally by the model to construct the per-base initiator-effect track but is not exposed per-motif. The minimum input is 651 bp because the model uses 325 bp of padding on each side of the predicted output span. The wrapper accepts raw DNA strings; the upstream coordinate / region / FASTA-file CLI modes (which require an hg38 reference) are intentionally not exposed and callers extract genomic sequences themselves.

Tools

Puffin Prediction (`puffin-prediction`)

Runs a single forward pass through Puffin and returns per-base predictions across all 10 transcription-initiation channels (5 assays × 2 strands) at single-base resolution.

API Reference

Source

Input: PuffinPredictionInput

sequences

List[string]

required

DNA sequence(s) at least 651 bp long. A single string is normalized to a one-item list. Only A, C, G, T, N are accepted.

Source

Config: PuffinPredictionConfig

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cuda"

Device used for inference.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.

Source

Output: PuffinPredictionOutput

results

List[PuffinPredictionResult]

required

Per-sequence prediction results.

Show PuffinPredictionResult

sequence

string

required

Input DNA sequence that was scored.

sequence_length

integer

required

Length of the input sequence.

output_length

integer

required

Number of per-base output positions (= sequence_length - 650).

output_start

integer

required

0-based sequence coordinate of the first per-base output position (always 325).

output_end

integer

required

0-based exclusive end of the per-base output span in the input sequence (= sequence_length - 325).

predictions

List[array]

required

Per-base predictions with shape [output_length, 10]. Channel order matches TRACK_NAMES.

track_names

List[string]

required

Names of the 10 output channels in order.

Applications

Use this tool to score transcription start sites, rank candidate promoters, or measure the per-base effect of variants and edits across five capped-5’-end assays in one call. The fast path is the right choice when the question is how much signal a sequence produces rather than why.

Usage Tips

Per-base output length is len(sequence) - 650. The model uses 325 bp of padding on each side; output coordinates run from 325 to len(sequence) - 325 in the input frame.
Channel order is mirrored across strands. The first 5 channels are FANTOM_CAGE+ → PRO_CAP+; the next 5 are PRO_CAP- → FANTOM_CAGE-. Index by name via TRACK_NAMES.index(...) rather than memorizing positions.
Outputs are in log scale. Treat predicted values as ln(count_scale_signal + 1). To compare two sequences, subtract — the difference is already in log space.

Puffin Interpretation (`puffin-interpretation`)

Runs Puffin’s gradient-based decomposition for one chosen target assay and strand. Returns the per-base prediction for that target, 18 motif-activation tracks, 18 motif-effect tracks, and per-base basepair-contribution scores both as an aggregate and decomposed two ways (contribution to the predicted signal per motif, and contribution to each motif’s activation per basepair; 18 tracks each). Summed motif, initiator, trinucleotide, and total-effect tracks are also returned.

API Reference

Source

Input: PuffinInterpretationInput

sequences

List[string]

required

DNA sequence(s) at least 651 bp long. A single string is normalized to a one-item list. Only A, C, G, T, N are accepted.

Source

Config: PuffinInterpretationConfig

target_signal

enum

default:"FANTOM_CAGE"

Which transcription initiation assay’s predictions to decompose into motif and basepair contributions.Available options: FANTOM_CAGE, ENCODE_CAGE, ENCODE_RAMPAGE, GRO_CAP, PRO_CAP

reverse_strand

boolean

default:"False"

If True, decompose the reverse-strand prediction for the chosen target instead of the forward strand.

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cuda"

Device used for inference.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Source

Output: PuffinInterpretationOutput

results

List[PuffinInterpretationResult]

required

Per-sequence interpretation results.

Show PuffinInterpretationResult

sequence

string

required

Input DNA sequence that was scored.

sequence_length

integer

required

Length of the input sequence.

output_length

integer

required

Number of per-base output positions (= sequence_length - 650).

output_start

integer

required

0-based sequence coordinate of the first per-base output position (always 325).

output_end

integer

required

0-based exclusive end of the per-base output span in the input sequence (= sequence_length - 325).

prediction

List[number]

required

Predicted transcription initiation signal for the selected target_signal and strand, length output_length.

motif_activations

Dict[string, List[number]]

required

Per-base motif activation scores keyed by strand-suffixed motif name (e.g. "TATA+").

motif_effects

Dict[string, List[number]]

required

Per-base motif effect scores keyed by strand-suffixed motif name.

sum_motif_effects

List[number]

required

Per-base sum of motif effects across non-initiator motifs.

sum_initiator_effects

List[number]

required

Per-base sum of initiator-motif effects (centered to zero mean).

sum_trinucleotide_effects

List[number]

required

Per-base sum of trinucleotide sequence effects (centered to zero mean).

sum_total_effects

List[number]

required

Per-base sum of all sequence pattern effects (motif + initiator + trinucleotide).

bp_contribution

List[number]

required

Per-base contribution score to transcription initiation at the target.

bp_contribution_per_motif

Dict[string, List[number]]

required

Per-base contribution to transcription, decomposed by motif name.

bp_contribution_to_motif_activation

Dict[string, List[number]]

required

Per-base contribution to motif-activation scores, decomposed by motif name.

target_signal

string

required

Target signal selected for interpretation.

reverse_strand

boolean

required

Whether the reverse-strand head was used.

motif_names

List[string]

required

The 9 learned motif names (without strand suffix) exposed in the per-motif tracks. Strand-suffixed names appear as keys in each result’s motif-keyed dicts.

Applications

Use this tool to ask which motif drives a transcription start site, how a variant changes a motif activation, or how initiator and trinucleotide context shape the predicted signal. It is substantially slower than puffin-prediction because it computes per-base gradient contributions, so reach for it for mechanistic follow-up on specific sequences rather than for bulk scoring.

Usage Tips

target_signal picks which assay’s prediction is decomposed. Choose the one closest to the biological question; CAGE/RAMPAGE measure capped mRNA 5’ ends, while GRO-cap/PRO-cap measure nascent transcription.
reverse_strand selects which strand head to interpret. Defaults to forward; run it twice on the same input to analyze divergent or antisense promoters.
Motif dicts use strand-suffixed keys. Access motif_activations["TATA+"] and motif_activations["TATA-"], never the bare motif name. MOTIF_NAMES lists the 9 motif stems.

Toolkit Notes

These apply to every Puffin tool in this toolkit (puffin-prediction, puffin-interpretation).

GPU recommended but not required. Both tools run on CPU; puffin-interpretation is materially slower than puffin-prediction on either device because it backpropagates through every output position and motif.
Sequence input only. The upstream CLI’s coordinate / region / FASTA-file modes require an hg38.fa reference and are intentionally not wrapped; callers extract DNA themselves and pass it as a string.
Both tools share one persistent worker. They dispatch against the same puffin toolkit and load the Puffin model once per worker process; switching between prediction and interpretation does not reload weights.

Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.

Additional Information

References

Dudnyk, K., Cai, D., Shi, C., Xu, J., Zhou, J. Sequence basis of transcription initiation in the human genome. Science 384, eadj0116 (2024). DOI: 10.1126/science.adj0116
Upstream repository: jzhoulab/puffin

​Background

​Tools

​Puffin Prediction (puffin-prediction)

​API Reference

​Applications

​Usage Tips

​Puffin Interpretation (puffin-interpretation)

​API Reference

​Applications

​Usage Tips

​Toolkit Notes

​Infrastructure Guides

Tool Persistence

Device Management

Parallel Execution

Cloud Inference

​Additional Information

Background

Tools

Puffin Prediction (`puffin-prediction`)

API Reference

Applications

Usage Tips

Puffin Interpretation (`puffin-interpretation`)

API Reference

Applications

Usage Tips

Toolkit Notes

Infrastructure Guides

Additional Information