Pangolin - Proto

License: Pangolin has a GPL-3.0 license and may require explicit attribution when utilized. Please refer to the license for full terms.

This toolkit is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.

GitHub 89 GitHub 89 Publication Publication Cite Cite Tool Source Tool Source Open as Notebook Open as Notebook Open on Proto Open on Proto

tkzeng/Pangolin

Pangolin is a deep-learning method for predicting splice site strengths.

89 stars

View repo

Predicting RNA splicing from DNA sequence using Pangolin

Tony Zeng and Yang I. Li

Genome Biology (2022)

Read paper

@article{zeng_2022_pangolin,
  title={Predicting RNA splicing from DNA sequence using Pangolin},
  author={Zeng, Tony and Li, Yang I.},
  journal={Genome Biology},
  volume={23},
  number={1},
  pages={103},
  year={2022},
  doi={10.1186/s13059-022-02664-4}
}

Copy citation

proto-bio/proto-tools/proto_tools/tools/rna_splicing/pangolin

View source

Open Notebook

Open notebook

Coming soon!

Run this tool directly in Proto with no setup required.

Function	Description
`run_pangolin_predict()`	Per-position tissue-specific splice-site probability prediction using Pangolin (GPU)	Docs Source
`run_pangolin_score_variants()`	Score the splicing effect (gain/loss) of variants using Pangolin (GPU)	Docs Source

Background

Pre-mRNA splicing removes introns and joins exons, with the spliceosome recognizing the donor (5’) and acceptor (3’) splice sites that bound each intron. Which sites are used, and how often, varies across tissues and underlies much of the transcriptome’s alternative splicing diversity. Variants that create or destroy splice sites are a major and frequently under-recognized cause of genetic disease, which motivates models that can read splicing regulation straight from sequence. Pangolin extends the SpliceAI dilated-CNN approach in two directions (Zeng & Li, 2022). First, it is trained on quantitative, tissue-specific splicing measurements (including the fraction of transcripts that use a given site), and emits a per-tissue splice-site probability rather than SpliceAI’s single tissue-agnostic score. Second, it is trained across four tissues - heart, liver, brain, and testis - and across multiple species (human, rhesus, mouse, rat), which improves generalization and lets the model report tissue-specific predictions. The released model is an ensemble of checkpoints; predictions for a tissue average the relevant ensemble members. As with SpliceAI, the network consumes a wide window of flanking sequence to capture the long-range context that governs splice-site choice. These tools expose the per-tissue splice-site probability score — the same P(splice) head Pangolin’s reference CLI uses for variant scoring (not the separate transcript-usage head). Variant scoring reduces that score across the selected tissues into a per-position splice gain and loss.

Learning Resources

Pangolin repository (Zeng & Li, University of Chicago) - source, pretrained ensemble weights, and the reference CLI this wrapper mirrors.
Predicting RNA splicing from DNA sequence using Pangolin (Genome Biology, 2022) - the primary publication, with training setup, tissue/usage formulation, and variant-scoring benchmarks.
SpliceAI (Jaganathan et al., 2019) - the dilated-CNN splice-prediction model that Pangolin builds on.

Tools

Pangolin Splice-Site Prediction (`pangolin-predict`)

Predicts per-position, tissue-specific splice-site probability scores along one or more DNA sequences.

API Reference

Source

Input: PangolinPredictInput

sequences

List[string]

required

DNA sequence(s) to score, each >= 10,001 bp (5,000 bp of flank on each side). Scores cover the central len - 10000 positions; a single string is wrapped to a list.

Source

Config: PangolinPredictConfig

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cuda"

Device to run the model on. Override of BaseConfig.device because Pangolin is a GPU tool (default cuda).

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.

tissues

List[string]

default:"['HEART', 'LIVER', 'BRAIN', 'TESTIS']"

Tissues whose splice predictions are ensembled. Defaults to all four Pangolin tissues.

Source

Output: PangolinPredictOutput

results

List[PangolinPrediction]

required

Per-sequence predictions, 1:1 with the input sequences.

Show PangolinPrediction

scores

List[array]

required

Per-position splice-site probability scores (the per-tissue P(splice) head) with shape [len(sequence) - 2 * PANGOLIN_FLANK][len(tissues)]. Column order matches tissues.

tissues

List[string]

required

Tissue order of the score columns.

output_start

integer

required

Index in the input sequence of the first scored position (always PANGOLIN_FLANK).

Applications

Use this to scan a gene, transcript, or designed sequence for where splice sites are predicted and how strongly, resolved by tissue. Typical workflows include mapping the donor/acceptor splice-score landscape of a locus, comparing predicted scores across heart/liver/brain/testis to find tissue-specific sites, and generating per-position tracks for downstream visualization or differential analysis.

Usage Tips

Each sequence needs 5,000 bp of flanking context on each side (PANGOLIN_FLANK). A length-N sequence yields predictions for the central N - 10000 positions, so the minimum input is 10,001 bp. The output_start field reports the input index (always 5000) of the first scored position.
tissues selects which of HEART, LIVER, BRAIN, TESTIS are ensembled (default: all four). The score columns are emitted in the order given by tissues, so request only the tissues you need and read columns by that order.
Inputs accept a single sequence string (auto-wrapped) or a list; outputs are 1:1 with inputs. Sequences are validated as DNA (A/C/G/T/N, uppercased) before scoring.

Pangolin Variant Splice Scoring (`pangolin-score-variants`)

Scores the splicing gain/loss effect of variants by comparing the predicted splice-site probability between the reference and alternate sequence.

API Reference

Source

Input: PangolinScoreVariantsInput

variants

List[PangolinVariant]

required

Variants to score. A single variant is auto-wrapped into a list.

Show PangolinVariant

sequence

string

required

Reference DNA window containing the variant.

variant_position

integer

required

0-based index of the variant in sequence.

reference_bases

string

required

Reference allele, e.g. 'A' or 'AC'.

alternate_bases

string

required

Alternate allele, e.g. 'G' or 'GTT'.

strand

enum

default:"+"

Strand to score on. Defaults to '+'.Available options: +, -

Source

Config: PangolinScoreVariantsConfig

distance

integer

default:"50"

Number of bp on each side of the variant included in the reporting window. Defaults to 50 (matching the Pangolin CLI).

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cuda"

Device to run the model on. Override of BaseConfig.device because Pangolin is a GPU tool (default cuda).

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

tissues

List[string]

default:"['HEART', 'LIVER', 'BRAIN', 'TESTIS']"

Tissues whose splice predictions are ensembled. Defaults to all four Pangolin tissues.

Source

Output: PangolinScoreVariantsOutput

results

List[PangolinVariantEffect]

required

Per-variant splice-effect scores,

Show PangolinVariantEffect

loss_scores

List[number]

required

Per-position splice-loss scores over the window.

gain_scores

List[number]

required

Per-position splice-gain scores over the window.

increase_position

integer

required

Offset in bp from the variant of the largest increase.

increase_score

number

required

Score at increase_position.

decrease_position

integer

required

Offset in bp from the variant of the largest decrease.

decrease_score

number

required

Score at decrease_position.

metrics

PangolinVariantMetrics

required

Scalar splice-effect summary metrics.

Metrics (one set per results item)

Metric	Type	Range	Availability
`max_gain`	float	-1.0 to 1.0	always
`max_loss`	float	-1.0 to 1.0	always

Applications

Use this to prioritize candidate splice-altering variants - SNVs and simple indels - by how much they are predicted to increase (gain) or decrease (loss) the splice-site probability near the variant. It suits variant-interpretation pipelines and saturation/screen analyses where each variant is supplied with its local reference window.

Usage Tips

Variant scoring is sequence-centric: no genome FASTA is required. Provide each variant’s reference window (sequence), the 0-based variant_position, and the reference_bases/alternate_bases alleles. The reference allele must match the window at that position, and the variant needs 5,000 bp of flank on each side (PANGOLIN_FLANK).
distance (default 50) sets the ± reporting window around the variant. To report scores over the full window the sequence should provide PANGOLIN_FLANK + distance bp of flank on each side; with less context the reporting window is clipped to the available flank.
tissues behaves as in prediction: gain and loss are reduced (max increase / max decrease) across the selected tissues. max_gain/max_loss summary metrics and the increase_position/decrease_position peaks are reported relative to the variant in bp.
Annotation-based score masking (the upstream CLI --mask option) is not supported, because it requires exon annotations; raw gain/loss scores are returned.

Toolkit Notes

These apply to both tools in this toolkit (pangolin-predict, pangolin-score-variants).

GPU recommended. Pangolin runs on GPU (default device="cuda") for practical throughput; CPU works but is slow, especially for long sequences or many variants.
Model weights ship inside the pip package (~180 MB) and are installed automatically with the standalone environment - no separate weight download or gated access is required.
Deterministic outputs. Pangolin inference is deterministic: the same sequence and tissue selection produce the same scores, so results are cacheable and reproducible.

Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.

​Background

​Learning Resources

​Tools

​Pangolin Splice-Site Prediction (pangolin-predict)

​API Reference

​Applications

​Usage Tips

​Pangolin Variant Splice Scoring (pangolin-score-variants)

​API Reference

​Applications

​Usage Tips

​Toolkit Notes

​Infrastructure Guides

Tool Persistence

Device Management

Parallel Execution

Cloud Inference

Background

Learning Resources

Tools

Pangolin Splice-Site Prediction (`pangolin-predict`)

API Reference

Applications

Usage Tips

Pangolin Variant Splice Scoring (`pangolin-score-variants`)

API Reference

Applications

Usage Tips

Toolkit Notes

Infrastructure Guides