Skip to main content
SpliceAI
License: SpliceAI uses Custom (PolyForm Strict License 1.0.0) for code and CC-BY-NC-4.0 for model weights and has restrictions around commercial use and may require explicit attribution when utilized. Please refer to the code license and model weights license for full terms.

This toolkit is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.


Illumina/SpliceAI
Illumina/SpliceAI
A deep learning-based tool to identify splice variants
501 stars
View repo
Predicting Splicing from Primary Sequence with Deep Learning
Kishore Jaganathan, Sofia Kyriazopoulou Panagiotopoulou, … Kyle Kai-How Farh
Cell (2019)
Read paper
@article{jaganathan2019predicting,
  title = {Predicting Splicing from Primary Sequence with Deep Learning},
  author = {Jaganathan, Kishore and Kyriazopoulou Panagiotopoulou, Sofia and McRae, Jeremy F. and Darbandi, Siavash Fazel and Knowles, David and Li, Yang I. and Kosmicki, Jack A. and Arbelaez, Juan and Cui, Wenwu and Schwartz, Grace B. and Chow, Eric D. and Kanterakis, Efstathios and Gao, Hong and Kia, Amirali and Batzoglou, Serafim and Sanders, Stephan J. and Farh, Kyle Kai-How},
  journal = {Cell},
  volume = {176},
  number = {3},
  pages = {535--548.e24},
  year = {2019},
  publisher = {Elsevier},
  doi = {10.1016/j.cell.2018.12.015}
}
Copy citation
proto-bio/proto-tools/proto_tools/tools/rna_splicing/spliceai
View source
Open Notebook
Open notebook
FunctionDescription
run_spliceai_predict()Predict per-position acceptor/donor splice-site probabilities from DNA sequence with SpliceAI (GPU) Docs Source
run_spliceai_score()Score variants for splice-altering effects (delta scores/positions) with SpliceAI (GPU) Docs Source

Background

RNA splicing removes introns from pre-mRNA and joins exons, guided by sequence motifs at the donor (5’) and acceptor (3’) splice sites. Variants that create or disrupt these motifs can cause exon skipping, intron retention, or cryptic splicing, and are a major and frequently overlooked class of disease-causing mutations. SpliceAI (Jaganathan et al., 2019) is a deep dilated residual convolutional network that reads 10,000 bp of flanking context (5,000 bp per side) and outputs, for every position, the probability of being an acceptor, a donor, or neither. For variant interpretation, SpliceAI compares predictions for the reference and alternate sequences and reports four delta scores in [0, 1] — acceptor gain (DS_AG), acceptor loss (DS_AL), donor gain (DS_DG), and donor loss (DS_DL) — together with the delta positions (DP_*) of the affected sites relative to the variant. The maximum delta score is the headline number: the paper characterizes cutoffs of 0.2 (high recall), 0.5 (recommended), and 0.8 (high precision). The shipped model is an ensemble of five models whose per-position outputs are averaged. All variant coordinates follow the 1-based VCF convention.

Learning Resources

  • SpliceAI repository (Illumina) - the canonical CLI, the Annotator/get_delta_scores Python API, and the bundled GENCODE annotations and ensemble weights.
  • Jaganathan et al., 2019 (Cell) - the original paper describing the architecture, training data, and clinical validation of delta scores.

Tools

SpliceAI Variant Scoring (spliceai-score)

Scores genetic variants (chromosome / 1-based position / ref / alt) for splice-altering effects, returning per-gene delta scores and delta positions for acceptor and donor gain/loss. Requires a reference genome FASTA and a gene annotation (the bundled grch37/grch38, or a custom file).

API Reference

Source
variants
List[SpliceAIVariant]
required
Variants to score. A single variant is auto-wrapped into a list.
Source
reference_fasta
string
Path (or AssetRef) to the reference genome FASTA. Required at call time — SpliceAI extracts the wild-type sequence around each variant from this genome. None raises a MissingAssetError so un-provisioned hosts skip cleanly.
annotation
string
default:"grch38"
Gene annotation source: 'grch37' or 'grch38' (GENCODE files bundled with SpliceAI) or a path to a custom tab-separated annotation file.
max_distance
integer
default:"50"
Maximum distance (bp) between the variant and a gained/lost splice site to report (the SpliceAI -D flag).
mask
boolean
default:"False"
Mask scores for annotated acceptor/donor gain and unannotated acceptor/donor loss (the SpliceAI -M flag).
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cuda"
Device to run inference on. SpliceAI (TensorFlow) auto-falls-back to CPU when no GPU is visible.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
results
List[SpliceAIVariantResult]
required
Per-variant scores, 1:1 with the input variants and in the same order.
Metrics (one set per results item)
MetricTypeRangeAvailability
max_delta_scorefloat0.0 to 1.0present for scored variants overlapping an annotated gene

Applications

Use this to triage candidate variants from a sequencing study for splicing impact, to annotate a VCF with SpliceAI predictions, or to prioritize variants of uncertain significance where a coding effect is absent but a splicing effect is plausible. The max_delta_score metric supports threshold-based filtering at the recommended 0.2 / 0.5 / 0.8 cutoffs.

Usage Tips

  • reference_fasta is required and position is 1-based. SpliceAI extracts the wild-type window around each variant from the genome you supply, so the FASTA, the annotation, and each variant’s chromosome must use consistent identifiers. Note this is the opposite of AlphaGenome, whose coordinates are 0-based.
  • annotation selects the gene model. grch37 and grch38 load the GENCODE files bundled with SpliceAI; pass a path to score against a custom tab-separated annotation. Changing it restarts the worker.
  • max_distance (default 50) and mask mirror the SpliceAI -D/-M flags. Widen max_distance to report splice sites farther from the variant; enable mask to suppress scores for annotated-gain and unannotated-loss positions.

SpliceAI Splice-Site Prediction (spliceai-predict)

Predicts per-position [neither, acceptor, donor] probabilities directly from one or more DNA sequences. No reference genome is needed — the model runs on the sequence as given, padding 5,000 bp of context per side internally.

API Reference

Source
sequences
List[string]
required
DNA sequence(s) to predict on. A single string is auto-wrapped into a list. Sequences may be any length; SpliceAI pads 5000 bp of context on each side internally, so predictions cover every input position.
Source
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cuda"
Device to run inference on. SpliceAI (TensorFlow) auto-falls-back to CPU when no GPU is visible.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
predictions
List[array]
required
Per sequence, per position, a [neither, acceptor, donor] probability triple. Outer length and order match the input sequences; each inner length equals the corresponding input sequence’s length.

Applications

Use this to scan an engineered construct, a minigene, or a transcript for latent splice sites, to visualize the acceptor/donor probability landscape across a region of interest, or to compare splice-site usage between designed sequence variants without assembling a genome and annotation.

Usage Tips

  • Output channels are [neither, acceptor, donor]. Index channel 1 for acceptor and channel 2 for donor probabilities; each per-sequence array has the same length as the corresponding input sequence.
  • Sequences may differ in length. They are scored independently (per-item caching applies), so batching ragged sequences is fine; very short sequences still receive the full 10,000 bp N-padded context.

Toolkit Notes

These apply to both SpliceAI tools in this toolkit (spliceai-score, spliceai-predict).
  • Runs on GPU or CPU via TensorFlow. SpliceAI is the only TensorFlow tool in the catalog; the standalone env pins TensorFlow 2.15 (Keras 2) so the bundled .h5 models load, which constrains the runtime to Python 3.11. TensorFlow falls back to CPU automatically when no GPU is visible.
  • Weights and annotations ship with the package. The five ensemble models and the GENCODE grch37/grch38 annotations are bundled in pip install spliceai, so no weight download is needed. The reference genome FASTA for spliceai-score is user-supplied at call time.
  • Non-commercial license. SpliceAI’s code is PolyForm Strict and its bundled models are CC-BY-NC-4.0 — both noncommercial, so the toolkit is not hostable on Proto; commercial use requires a license from Illumina.
Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.