
This toolkit is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.
Background
Pre-mRNA splicing removes introns and joins exons, with the spliceosome recognizing the donor (5’) and acceptor (3’) splice sites that bound each intron. Which sites are used, and how often, varies across tissues and underlies much of the transcriptome’s alternative splicing diversity. Variants that create or destroy splice sites are a major and frequently under-recognized cause of genetic disease, which motivates models that can read splicing regulation straight from sequence. Pangolin extends the SpliceAI dilated-CNN approach in two directions (Zeng & Li, 2022). First, it is trained on quantitative, tissue-specific splicing measurements (including the fraction of transcripts that use a given site), and emits a per-tissue splice-site probability rather than SpliceAI’s single tissue-agnostic score. Second, it is trained across four tissues - heart, liver, brain, and testis - and across multiple species (human, rhesus, mouse, rat), which improves generalization and lets the model report tissue-specific predictions. The released model is an ensemble of checkpoints; predictions for a tissue average the relevant ensemble members. As with SpliceAI, the network consumes a wide window of flanking sequence to capture the long-range context that governs splice-site choice. These tools expose the per-tissue splice-site probability score — the same P(splice) head Pangolin’s reference CLI uses for variant scoring (not the separate transcript-usage head). Variant scoring reduces that score across the selected tissues into a per-position splice gain and loss.Learning Resources
- Pangolin repository (Zeng & Li, University of Chicago) - source, pretrained ensemble weights, and the reference CLI this wrapper mirrors.
- Predicting RNA splicing from DNA sequence using Pangolin (Genome Biology, 2022) - the primary publication, with training setup, tissue/usage formulation, and variant-scoring benchmarks.
- SpliceAI (Jaganathan et al., 2019) - the dilated-CNN splice-prediction model that Pangolin builds on.
Tools
Pangolin Splice-Site Prediction (pangolin-predict)
Predicts per-position, tissue-specific splice-site probability scores along one or more DNA sequences.API Reference
Input: PangolinPredictInput
Input: PangolinPredictInput
len - 10000 positions; a single string is wrapped to a list.Config: PangolinPredictConfig
Config: PangolinPredictConfig
True is coerced to 1 and False to 0.BaseConfig.device because Pangolin is a GPU tool (default cuda).None waits indefinitely.BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.Output: PangolinPredictOutput
Output: PangolinPredictOutput
Applications
Use this to scan a gene, transcript, or designed sequence for where splice sites are predicted and how strongly, resolved by tissue. Typical workflows include mapping the donor/acceptor splice-score landscape of a locus, comparing predicted scores across heart/liver/brain/testis to find tissue-specific sites, and generating per-position tracks for downstream visualization or differential analysis.Usage Tips
- Each sequence needs 5,000 bp of flanking context on each side (
PANGOLIN_FLANK). A length-Nsequence yields predictions for the centralN - 10000positions, so the minimum input is 10,001 bp. Theoutput_startfield reports the input index (always5000) of the first scored position. tissuesselects which ofHEART,LIVER,BRAIN,TESTISare ensembled (default: all four). The score columns are emitted in the order given bytissues, so request only the tissues you need and read columns by that order.- Inputs accept a single sequence string (auto-wrapped) or a list; outputs are 1:1 with inputs. Sequences are validated as DNA (A/C/G/T/N, uppercased) before scoring.
Pangolin Variant Splice Scoring (pangolin-score-variants)
Scores the splicing gain/loss effect of variants by comparing the predicted splice-site probability between the reference and alternate sequence.API Reference
Input: PangolinScoreVariantsInput
Input: PangolinScoreVariantsInput
Config: PangolinScoreVariantsConfig
Config: PangolinScoreVariantsConfig
True is coerced to 1 and False to 0.BaseConfig.device because Pangolin is a GPU tool (default cuda).None waits indefinitely.BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.Output: PangolinScoreVariantsOutput
Output: PangolinScoreVariantsOutput
results item)| Metric | Type | Range | Availability |
|---|---|---|---|
max_gain | float | -1.0 to 1.0 | always |
max_loss | float | -1.0 to 1.0 | always |
Applications
Use this to prioritize candidate splice-altering variants - SNVs and simple indels - by how much they are predicted to increase (gain) or decrease (loss) the splice-site probability near the variant. It suits variant-interpretation pipelines and saturation/screen analyses where each variant is supplied with its local reference window.Usage Tips
- Variant scoring is sequence-centric: no genome FASTA is required. Provide each variant’s reference window (
sequence), the 0-basedvariant_position, and thereference_bases/alternate_basesalleles. The reference allele must match the window at that position, and the variant needs 5,000 bp of flank on each side (PANGOLIN_FLANK). distance(default50) sets the ± reporting window around the variant. To report scores over the full window the sequence should providePANGOLIN_FLANK + distancebp of flank on each side; with less context the reporting window is clipped to the available flank.tissuesbehaves as in prediction: gain and loss are reduced (max increase / max decrease) across the selected tissues.max_gain/max_losssummary metrics and theincrease_position/decrease_positionpeaks are reported relative to the variant in bp.- Annotation-based score masking (the upstream CLI
--maskoption) is not supported, because it requires exon annotations; raw gain/loss scores are returned.
Toolkit Notes
These apply to both tools in this toolkit (pangolin-predict, pangolin-score-variants).
- GPU recommended. Pangolin runs on GPU (default
device="cuda") for practical throughput; CPU works but is slow, especially for long sequences or many variants. - Model weights ship inside the pip package (~180 MB) and are installed automatically with the standalone environment - no separate weight download or gated access is required.
- Deterministic outputs. Pangolin inference is deterministic: the same sequence and tissue selection produce the same scores, so results are cacheable and reproducible.