Skip to main content
SpliceTransformer
License: SpliceTransformer is open source and free for academic and commercial use under an Apache-2.0 license. Please refer to the license for full terms.

This toolkit is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.


ShenLab-Genomics/SpliceTransformer
ShenLab-Genomics/SpliceTransformer
SpliceTransformer(SpTransformer) is a deep learning tool to predict tissue specific splicing site from pre-mRNA sequence
32 stars
View repo
SpliceTransformer predicts tissue-specific splicing linked to human diseases
Ningyuan You, Chang Liu, … Ning Shen
Nature Communications (2024)
Read paper
@article{you2024splicetransformer,
  title={SpliceTransformer predicts tissue-specific splicing linked to human diseases},
  author={You, Ningyuan and Liu, Chang and Gu, Yuxin and Wang, Rong and Jia, Hanying and Zhang, Tianyun and Jiang, Song and Shi, Jinsong and Chen, Ming and Guan, Min-Xin and Sun, Siqi and Pei, Shanshan and Liu, Zhihong and Shen, Ning},
  journal={Nature Communications},
  volume={15},
  number={1},
  pages={9129},
  year={2024},
  doi={10.1038/s41467-024-53088-6}
}
Copy citation
proto-bio/proto-tools/proto_tools/tools/rna_splicing/splice_transformer
View source
Open Notebook
Open notebook
Coming soon!
Run this tool directly in Proto with no setup required.
FunctionDescription
run_splice_transformer()Tissue-specific splicing prediction using SpliceTransformer (GPU) Docs Source

Background

SpliceTransformer (You et al., 2024) is a deep learning framework that predicts tissue-specific splicing from genomic sequence. The architecture combines convolutional encoders in the style of SpliceAI with a Sinkhorn transformer attention module, which lets the model capture the long-range sequence interactions that influence splice site selection. The published work reports that SpliceTransformer outperforms all previous methods on splicing prediction. Applied to roughly 1.3 million variants in the ClinVar database, it attributes 60 percent of intronic and synonymous pathogenic mutations to splicing alterations, and it connects tissue-specific splicing changes to human disease in validations spanning brain disease cohorts and a diabetic nephropathy dataset. The model evaluates a 1,000 nucleotide target region flanked by 4,000 nucleotides of genomic context on each side, and it returns a prediction for every position in the target region. The first three output channels form a softmax over the splice-site class of the position, namely neither site, acceptor, or donor. The remaining fifteen channels report the predicted usage of the position as a splice site in each of fifteen human tissues, derived from Genotype-Tissue Expression (GTEx) data. The tissue channels are produced by an independent sigmoid for each tissue, so a position can be a confident splice site overall while being used in only a subset of tissues.

Learning Resources

Tools

SpliceTransformer Splicing Prediction (splice-transformer-prediction)

Predicts splice sites at single-nucleotide resolution across a batch of target sequences and reports tissue-specific usage for each predicted site. The tool accepts one or more 1,000 nucleotide target sequences, each paired with a 4,000 nucleotide left context and a 4,000 nucleotide right context drawn from the same genomic locus, and returns a probability tensor of shape [batch, 1000, 18]. The first three channels give the probability that a position is neither a splice site, an acceptor, or a donor, and the remaining fifteen channels give the predicted usage of that position as a splice site in each tissue.

API Reference

Source
target_seqs
List[string]
required
RNA or DNA sequence(s) on which to make splicing predictions. These are the central sequences where splice sites will be predicted at single-nucleotide resolution. All sequences in the batch should have the same length (typically 1000bp).
left_contexts
List[string]
required
Sequence(s) providing left (5’) context for each target sequence. Must have the same number of sequences as target_seqs. All left context sequences must have the same length (typically 4000bp) to provide sufficient context for accurate prediction.
right_contexts
List[string]
required
Sequence(s) providing right (3’) context for each target sequence. Must have the same number of sequences as target_seqs. All right context sequences must have the same length (typically 4000bp) matching the left context.
Source
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cuda"
Device to run the model on. Override of BaseConfig.device because SpliceTransformer is a GPU tool (default cuda).
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
prediction
List[array]
required
Prediction tensor of shape [batch, target_length, 18]

Applications

This tool is appropriate for any analysis that begins with a genomic locus and asks where the splice sites are and how their usage differs between tissues. Representative applications include annotating candidate splice sites in a newly characterised gene, comparing predicted acceptor and donor usage between tissues such as brain and liver to identify tissue-specific isoforms, and screening a region for positions whose splicing behaviour is restricted to a particular tissue. The fifteen tissue channels make the tool well suited to studies of alternative splicing where the tissue of interest is known in advance.

Usage Tips

  • Sequence lengths are fixed by the published model and are enforced on input. Every target sequence must be exactly 1,000 nucleotides and every left and right context must be exactly 4,000 nucleotides. Inputs of any other length are rejected before the model runs.
  • The target and its two contexts must come from the same genomic locus and be supplied in genomic order. The model concatenates the left context, the target, and the right context into a single window, so a context drawn from a different region or assembled in the wrong order produces predictions that do not correspond to the intended locus.
  • The three input lists must contain the same number of sequences. Each target is paired by position with one left context and one right context, and a mismatch in list length is rejected on input.
  • Acceptor and donor probabilities are most informative when read against the canonical GT-AG rule. Confident donor predictions are expected at GT dinucleotides and confident acceptor predictions at AG dinucleotides, so inspecting these positions helps confirm that a high score reflects a genuine splice site.
  • Tissue channels report usage rather than the presence of a splice site. A position can carry a high acceptor or donor probability while showing strong usage in only a few tissues, so differential analysis across the tissue channels is the basis for identifying tissue-specific splicing.
  • The model was trained on human sequence and is intended for human loci. Predictions on sequences from other species are not supported by the training data and should not be interpreted as reliable.

Toolkit Notes

These apply to every SpliceTransformer tool in this toolkit (splice-transformer-prediction).
  • The eighteen output channels follow a fixed order. Channel 0 is the probability of neither site, channel 1 the acceptor probability, and channel 2 the donor probability, and these three form a softmax that sums to one. Channels 3 through 17 carry the per-tissue usage in the order adipose tissue, blood, blood vessel, brain, colon, heart, kidney, liver, lung, muscle, nerve, small intestine, skin, spleen, and stomach. The SPLICE_TISSUE_CHANNEL_INDEX mapping exported by the toolkit resolves a tissue name to its channel.
  • The prediction is returned as a nested list and is most convenient to work with as an array. The prediction field has shape [batch, 1000, 18] and can be converted with numpy.array(...) for slicing by position or channel. Results can be exported to NumPy .npy or JSON through the standard export method.
Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.