SpliceAI - Proto

License: SpliceAI uses Custom (PolyForm Strict License 1.0.0) for code and CC-BY-NC-4.0 for model weights and has restrictions around commercial use and may require explicit attribution when utilized. Please refer to the code license and model weights license for full terms.

This toolkit is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.

GitHub 501 GitHub 501 Publication Publication Cite Cite Tool Source Tool Source Open as Notebook Open as Notebook

Illumina/SpliceAI

A deep learning-based tool to identify splice variants

501 stars

View repo

Predicting Splicing from Primary Sequence with Deep Learning

Kishore Jaganathan, Sofia Kyriazopoulou Panagiotopoulou, … Kyle Kai-How Farh

Cell (2019)

Read paper

@article{jaganathan2019predicting,
  title = {Predicting Splicing from Primary Sequence with Deep Learning},
  author = {Jaganathan, Kishore and Kyriazopoulou Panagiotopoulou, Sofia and McRae, Jeremy F. and Darbandi, Siavash Fazel and Knowles, David and Li, Yang I. and Kosmicki, Jack A. and Arbelaez, Juan and Cui, Wenwu and Schwartz, Grace B. and Chow, Eric D. and Kanterakis, Efstathios and Gao, Hong and Kia, Amirali and Batzoglou, Serafim and Sanders, Stephan J. and Farh, Kyle Kai-How},
  journal = {Cell},
  volume = {176},
  number = {3},
  pages = {535--548.e24},
  year = {2019},
  publisher = {Elsevier},
  doi = {10.1016/j.cell.2018.12.015}
}

Copy citation

proto-bio/proto-tools/proto_tools/tools/rna_splicing/spliceai

View source

Open Notebook

Open notebook

Function	Description
`run_spliceai_predict()`	Predict per-position acceptor/donor splice-site probabilities from DNA sequence with SpliceAI (GPU)	Docs Source
`run_spliceai_score()`	Score variants for splice-altering effects (delta scores/positions) with SpliceAI (GPU)	Docs Source

Background

RNA splicing removes introns from pre-mRNA and joins exons, guided by sequence motifs at the donor (5’) and acceptor (3’) splice sites. Variants that create or disrupt these motifs can cause exon skipping, intron retention, or cryptic splicing, and are a major and frequently overlooked class of disease-causing mutations. SpliceAI (Jaganathan et al., 2019) is a deep dilated residual convolutional network that reads 10,000 bp of flanking context (5,000 bp per side) and outputs, for every position, the probability of being an acceptor, a donor, or neither. For variant interpretation, SpliceAI compares predictions for the reference and alternate sequences and reports four delta scores in [0, 1] — acceptor gain (DS_AG), acceptor loss (DS_AL), donor gain (DS_DG), and donor loss (DS_DL) — together with the delta positions (DP_*) of the affected sites relative to the variant. The maximum delta score is the headline number: the paper characterizes cutoffs of 0.2 (high recall), 0.5 (recommended), and 0.8 (high precision). The shipped model is an ensemble of five models whose per-position outputs are averaged. All variant coordinates follow the 1-based VCF convention.

Learning Resources

SpliceAI repository (Illumina) - the canonical CLI, the Annotator/get_delta_scores Python API, and the bundled GENCODE annotations and ensemble weights.
Jaganathan et al., 2019 (Cell) - the original paper describing the architecture, training data, and clinical validation of delta scores.

Tools

SpliceAI Variant Scoring (`spliceai-score`)

Scores genetic variants (chromosome / 1-based position / ref / alt) for splice-altering effects, returning per-gene delta scores and delta positions for acceptor and donor gain/loss. Requires a reference genome FASTA and a gene annotation (the bundled grch37/grch38, or a custom file).

API Reference

Source

Input: SpliceAIScoreInput

variants

List[SpliceAIVariant]

required

Variants to score. A single variant is auto-wrapped into a list.

Show SpliceAIVariant

chromosome

string

required

Chromosome identifier, matching the reference FASTA and annotation (e.g. 'chr1' or '1' — be consistent across all three).

position

integer

required

Variant position, 1-based (VCF convention).

ref

string

required

Reference allele, e.g. 'A' or 'AC' (DNA bases A/C/G/T/N).

alt

string

required

Alternate allele, e.g. 'G' or 'GTT' (DNA bases A/C/G/T/N).

Source

Config: SpliceAIScoreConfig

reference_fasta

string

Path (or AssetRef) to the reference genome FASTA. Required at call time — SpliceAI extracts the wild-type sequence around each variant from this genome. None raises a MissingAssetError so un-provisioned hosts skip cleanly.

annotation

string

default:"grch38"

Gene annotation source: 'grch37' or 'grch38' (GENCODE files bundled with SpliceAI) or a path to a custom tab-separated annotation file.

max_distance

integer

default:"50"

Maximum distance (bp) between the variant and a gained/lost splice site to report (the SpliceAI -D flag).

mask

boolean

default:"False"

Mask scores for annotated acceptor/donor gain and unannotated acceptor/donor loss (the SpliceAI -M flag).

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cuda"

Device to run inference on. SpliceAI (TensorFlow) auto-falls-back to CPU when no GPU is visible.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.

Source

Output: SpliceAIScoreOutput

results

List[SpliceAIVariantResult]

required

Per-variant scores, 1:1 with the input variants and in the same order.

Show SpliceAIVariantResult

chromosome

string

required

Variant chromosome.

position

integer

required

Variant position (1-based).

ref

string

required

Reference allele.

alt

string

required

Alternate allele.

scores

List[SpliceAIGeneScore]

required

One record per gene the variant overlaps (empty if it overlaps no annotated gene).

metrics

SpliceAIScoreMetrics

required

Per-variant scalar metric (max delta score).

Metrics (one set per results item)

Metric	Type	Range	Availability
`max_delta_score`	float	0.0 to 1.0	present for scored variants overlapping an annotated gene

Applications

Use this to triage candidate variants from a sequencing study for splicing impact, to annotate a VCF with SpliceAI predictions, or to prioritize variants of uncertain significance where a coding effect is absent but a splicing effect is plausible. The max_delta_score metric supports threshold-based filtering at the recommended 0.2 / 0.5 / 0.8 cutoffs.

Usage Tips

reference_fasta is required and position is 1-based. SpliceAI extracts the wild-type window around each variant from the genome you supply, so the FASTA, the annotation, and each variant’s chromosome must use consistent identifiers. Note this is the opposite of AlphaGenome, whose coordinates are 0-based.
annotation selects the gene model. grch37 and grch38 load the GENCODE files bundled with SpliceAI; pass a path to score against a custom tab-separated annotation. Changing it restarts the worker.
max_distance (default 50) and mask mirror the SpliceAI -D/-M flags. Widen max_distance to report splice sites farther from the variant; enable mask to suppress scores for annotated-gain and unannotated-loss positions.

SpliceAI Splice-Site Prediction (`spliceai-predict`)

Predicts per-position [neither, acceptor, donor] probabilities directly from one or more DNA sequences. No reference genome is needed — the model runs on the sequence as given, padding 5,000 bp of context per side internally.

API Reference

Source

Input: SpliceAIPredictInput

sequences

List[string]

required

DNA sequence(s) to predict on. A single string is auto-wrapped into a list. Sequences may be any length; SpliceAI pads 5000 bp of context on each side internally, so predictions cover every input position.

Source

Config: SpliceAIPredictConfig

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cuda"

Device to run inference on. SpliceAI (TensorFlow) auto-falls-back to CPU when no GPU is visible.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Source

Output: SpliceAIPredictOutput

predictions

List[array]

required

Per sequence, per position, a [neither, acceptor, donor] probability triple. Outer length and order match the input sequences; each inner length equals the corresponding input sequence’s length.

Applications

Use this to scan an engineered construct, a minigene, or a transcript for latent splice sites, to visualize the acceptor/donor probability landscape across a region of interest, or to compare splice-site usage between designed sequence variants without assembling a genome and annotation.

Usage Tips

Output channels are [neither, acceptor, donor]. Index channel 1 for acceptor and channel 2 for donor probabilities; each per-sequence array has the same length as the corresponding input sequence.
Sequences may differ in length. They are scored independently (per-item caching applies), so batching ragged sequences is fine; very short sequences still receive the full 10,000 bp N-padded context.

Toolkit Notes

These apply to both SpliceAI tools in this toolkit (spliceai-score, spliceai-predict).

Runs on GPU or CPU via TensorFlow. SpliceAI is the only TensorFlow tool in the catalog; the standalone env pins TensorFlow 2.15 (Keras 2) so the bundled .h5 models load, which constrains the runtime to Python 3.11. TensorFlow falls back to CPU automatically when no GPU is visible.
Weights and annotations ship with the package. The five ensemble models and the GENCODE grch37/grch38 annotations are bundled in pip install spliceai, so no weight download is needed. The reference genome FASTA for spliceai-score is user-supplied at call time.
Non-commercial license. SpliceAI’s code is PolyForm Strict and its bundled models are CC-BY-NC-4.0 — both noncommercial, so the toolkit is not hostable on Proto; commercial use requires a license from Illumina.

Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.

​Background

​Learning Resources

​Tools

​SpliceAI Variant Scoring (spliceai-score)

​API Reference

​Applications

​Usage Tips

​SpliceAI Splice-Site Prediction (spliceai-predict)

​API Reference

​Applications

​Usage Tips

​Toolkit Notes

​Infrastructure Guides

Tool Persistence

Device Management

Parallel Execution

Cloud Inference

Background

Learning Resources

Tools

SpliceAI Variant Scoring (`spliceai-score`)

API Reference

Applications

Usage Tips

SpliceAI Splice-Site Prediction (`spliceai-predict`)

API Reference

Applications

Usage Tips

Toolkit Notes

Infrastructure Guides