Skip to main content
Pangolin
License: Pangolin has a GPL-3.0 license and may require explicit attribution when utilized. Please refer to the license for full terms.

This toolkit is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.


tkzeng/Pangolin
tkzeng/Pangolin
Pangolin is a deep-learning method for predicting splice site strengths.
89 stars
View repo
Predicting RNA splicing from DNA sequence using Pangolin
Tony Zeng and Yang I. Li
Genome Biology (2022)
Read paper
@article{zeng_2022_pangolin,
  title={Predicting RNA splicing from DNA sequence using Pangolin},
  author={Zeng, Tony and Li, Yang I.},
  journal={Genome Biology},
  volume={23},
  number={1},
  pages={103},
  year={2022},
  doi={10.1186/s13059-022-02664-4}
}
Copy citation
proto-bio/proto-tools/proto_tools/tools/rna_splicing/pangolin
View source
Open Notebook
Open notebook
Coming soon!
Run this tool directly in Proto with no setup required.
FunctionDescription
run_pangolin_predict()Per-position tissue-specific splice-site probability prediction using Pangolin (GPU) Docs Source
run_pangolin_score_variants()Score the splicing effect (gain/loss) of variants using Pangolin (GPU) Docs Source

Background

Pre-mRNA splicing removes introns and joins exons, with the spliceosome recognizing the donor (5’) and acceptor (3’) splice sites that bound each intron. Which sites are used, and how often, varies across tissues and underlies much of the transcriptome’s alternative splicing diversity. Variants that create or destroy splice sites are a major and frequently under-recognized cause of genetic disease, which motivates models that can read splicing regulation straight from sequence. Pangolin extends the SpliceAI dilated-CNN approach in two directions (Zeng & Li, 2022). First, it is trained on quantitative, tissue-specific splicing measurements (including the fraction of transcripts that use a given site), and emits a per-tissue splice-site probability rather than SpliceAI’s single tissue-agnostic score. Second, it is trained across four tissues - heart, liver, brain, and testis - and across multiple species (human, rhesus, mouse, rat), which improves generalization and lets the model report tissue-specific predictions. The released model is an ensemble of checkpoints; predictions for a tissue average the relevant ensemble members. As with SpliceAI, the network consumes a wide window of flanking sequence to capture the long-range context that governs splice-site choice. These tools expose the per-tissue splice-site probability score — the same P(splice) head Pangolin’s reference CLI uses for variant scoring (not the separate transcript-usage head). Variant scoring reduces that score across the selected tissues into a per-position splice gain and loss.

Learning Resources

  • Pangolin repository (Zeng & Li, University of Chicago) - source, pretrained ensemble weights, and the reference CLI this wrapper mirrors.
  • Predicting RNA splicing from DNA sequence using Pangolin (Genome Biology, 2022) - the primary publication, with training setup, tissue/usage formulation, and variant-scoring benchmarks.
  • SpliceAI (Jaganathan et al., 2019) - the dilated-CNN splice-prediction model that Pangolin builds on.

Tools

Pangolin Splice-Site Prediction (pangolin-predict)

Predicts per-position, tissue-specific splice-site probability scores along one or more DNA sequences.

API Reference

Source
sequences
List[string]
required
DNA sequence(s) to score, each >= 10,001 bp (5,000 bp of flank on each side). Scores cover the central len - 10000 positions; a single string is wrapped to a list.
Source
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cuda"
Device to run the model on. Override of BaseConfig.device because Pangolin is a GPU tool (default cuda).
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
tissues
List[string]
default:"['HEART', 'LIVER', 'BRAIN', 'TESTIS']"
Tissues whose splice predictions are ensembled. Defaults to all four Pangolin tissues.
Source
results
List[PangolinPrediction]
required
Per-sequence predictions, 1:1 with the input sequences.

Applications

Use this to scan a gene, transcript, or designed sequence for where splice sites are predicted and how strongly, resolved by tissue. Typical workflows include mapping the donor/acceptor splice-score landscape of a locus, comparing predicted scores across heart/liver/brain/testis to find tissue-specific sites, and generating per-position tracks for downstream visualization or differential analysis.

Usage Tips

  • Each sequence needs 5,000 bp of flanking context on each side (PANGOLIN_FLANK). A length-N sequence yields predictions for the central N - 10000 positions, so the minimum input is 10,001 bp. The output_start field reports the input index (always 5000) of the first scored position.
  • tissues selects which of HEART, LIVER, BRAIN, TESTIS are ensembled (default: all four). The score columns are emitted in the order given by tissues, so request only the tissues you need and read columns by that order.
  • Inputs accept a single sequence string (auto-wrapped) or a list; outputs are 1:1 with inputs. Sequences are validated as DNA (A/C/G/T/N, uppercased) before scoring.

Pangolin Variant Splice Scoring (pangolin-score-variants)

Scores the splicing gain/loss effect of variants by comparing the predicted splice-site probability between the reference and alternate sequence.

API Reference

Source
variants
List[PangolinVariant]
required
Variants to score. A single variant is auto-wrapped into a list.
Source
distance
integer
default:"50"
Number of bp on each side of the variant included in the reporting window. Defaults to 50 (matching the Pangolin CLI).
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cuda"
Device to run the model on. Override of BaseConfig.device because Pangolin is a GPU tool (default cuda).
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
tissues
List[string]
default:"['HEART', 'LIVER', 'BRAIN', 'TESTIS']"
Tissues whose splice predictions are ensembled. Defaults to all four Pangolin tissues.
Source
results
List[PangolinVariantEffect]
required
Per-variant splice-effect scores,
Metrics (one set per results item)
MetricTypeRangeAvailability
max_gainfloat-1.0 to 1.0always
max_lossfloat-1.0 to 1.0always

Applications

Use this to prioritize candidate splice-altering variants - SNVs and simple indels - by how much they are predicted to increase (gain) or decrease (loss) the splice-site probability near the variant. It suits variant-interpretation pipelines and saturation/screen analyses where each variant is supplied with its local reference window.

Usage Tips

  • Variant scoring is sequence-centric: no genome FASTA is required. Provide each variant’s reference window (sequence), the 0-based variant_position, and the reference_bases/alternate_bases alleles. The reference allele must match the window at that position, and the variant needs 5,000 bp of flank on each side (PANGOLIN_FLANK).
  • distance (default 50) sets the ± reporting window around the variant. To report scores over the full window the sequence should provide PANGOLIN_FLANK + distance bp of flank on each side; with less context the reporting window is clipped to the available flank.
  • tissues behaves as in prediction: gain and loss are reduced (max increase / max decrease) across the selected tissues. max_gain/max_loss summary metrics and the increase_position/decrease_position peaks are reported relative to the variant in bp.
  • Annotation-based score masking (the upstream CLI --mask option) is not supported, because it requires exon annotations; raw gain/loss scores are returned.

Toolkit Notes

These apply to both tools in this toolkit (pangolin-predict, pangolin-score-variants).
  • GPU recommended. Pangolin runs on GPU (default device="cuda") for practical throughput; CPU works but is slow, especially for long sequences or many variants.
  • Model weights ship inside the pip package (~180 MB) and are installed automatically with the standalone environment - no separate weight download or gated access is required.
  • Deterministic outputs. Pangolin inference is deterministic: the same sequence and tissue selection produce the same scores, so results are cacheable and reproducible.
Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.