
This toolkit is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.
Background
SpliceTransformer (You et al., 2024) is a deep learning framework that predicts tissue-specific splicing from genomic sequence. The architecture combines convolutional encoders in the style of SpliceAI with a Sinkhorn transformer attention module, which lets the model capture the long-range sequence interactions that influence splice site selection. The published work reports that SpliceTransformer outperforms all previous methods on splicing prediction. Applied to roughly 1.3 million variants in the ClinVar database, it attributes 60 percent of intronic and synonymous pathogenic mutations to splicing alterations, and it connects tissue-specific splicing changes to human disease in validations spanning brain disease cohorts and a diabetic nephropathy dataset. The model evaluates a 1,000 nucleotide target region flanked by 4,000 nucleotides of genomic context on each side, and it returns a prediction for every position in the target region. The first three output channels form a softmax over the splice-site class of the position, namely neither site, acceptor, or donor. The remaining fifteen channels report the predicted usage of the position as a splice site in each of fifteen human tissues, derived from Genotype-Tissue Expression (GTEx) data. The tissue channels are produced by an independent sigmoid for each tissue, so a position can be a confident splice site overall while being used in only a subset of tissues.Learning Resources
- ShenLab-Genomics/SpliceTransformer (Shen Lab). Official repository, containing the reference model definition and usage examples that this toolkit follows.
Tools
SpliceTransformer Splicing Prediction (splice-transformer-prediction)
Predicts splice sites at single-nucleotide resolution across a batch of target sequences and reports tissue-specific usage for each predicted site. The tool accepts one or more 1,000 nucleotide target sequences, each paired with a 4,000 nucleotide left context and a 4,000 nucleotide right context drawn from the same genomic locus, and returns a probability tensor of shape [batch, 1000, 18]. The first three channels give the probability that a position is neither a splice site, an acceptor, or a donor, and the remaining fifteen channels give the predicted usage of that position as a splice site in each tissue.API Reference
Input: SpliceTransformerInput
Input: SpliceTransformerInput
target_seqs. All left context sequences must have the same length (typically 4000bp) to provide sufficient context for accurate prediction.target_seqs. All right context sequences must have the same length (typically 4000bp) matching the left context.Config: SpliceTransformerConfig
Config: SpliceTransformerConfig
True is coerced to 1 and False to 0.BaseConfig.device because SpliceTransformer is a GPU tool (default cuda).None waits indefinitely.BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.Output: SpliceTransformerOutput
Output: SpliceTransformerOutput
[batch, target_length, 18]Applications
This tool is appropriate for any analysis that begins with a genomic locus and asks where the splice sites are and how their usage differs between tissues. Representative applications include annotating candidate splice sites in a newly characterised gene, comparing predicted acceptor and donor usage between tissues such as brain and liver to identify tissue-specific isoforms, and screening a region for positions whose splicing behaviour is restricted to a particular tissue. The fifteen tissue channels make the tool well suited to studies of alternative splicing where the tissue of interest is known in advance.Usage Tips
- Sequence lengths are fixed by the published model and are enforced on input. Every target sequence must be exactly 1,000 nucleotides and every left and right context must be exactly 4,000 nucleotides. Inputs of any other length are rejected before the model runs.
- The target and its two contexts must come from the same genomic locus and be supplied in genomic order. The model concatenates the left context, the target, and the right context into a single window, so a context drawn from a different region or assembled in the wrong order produces predictions that do not correspond to the intended locus.
- The three input lists must contain the same number of sequences. Each target is paired by position with one left context and one right context, and a mismatch in list length is rejected on input.
- Acceptor and donor probabilities are most informative when read against the canonical GT-AG rule. Confident donor predictions are expected at GT dinucleotides and confident acceptor predictions at AG dinucleotides, so inspecting these positions helps confirm that a high score reflects a genuine splice site.
- Tissue channels report usage rather than the presence of a splice site. A position can carry a high acceptor or donor probability while showing strong usage in only a few tissues, so differential analysis across the tissue channels is the basis for identifying tissue-specific splicing.
- The model was trained on human sequence and is intended for human loci. Predictions on sequences from other species are not supported by the training data and should not be interpreted as reliable.
Toolkit Notes
These apply to every SpliceTransformer tool in this toolkit (splice-transformer-prediction).
- The eighteen output channels follow a fixed order. Channel 0 is the probability of neither site, channel 1 the acceptor probability, and channel 2 the donor probability, and these three form a softmax that sums to one. Channels 3 through 17 carry the per-tissue usage in the order adipose tissue, blood, blood vessel, brain, colon, heart, kidney, liver, lung, muscle, nerve, small intestine, skin, spleen, and stomach. The
SPLICE_TISSUE_CHANNEL_INDEXmapping exported by the toolkit resolves a tissue name to its channel. - The prediction is returned as a nested list and is most convenient to work with as an array. The
predictionfield has shape[batch, 1000, 18]and can be converted withnumpy.array(...)for slicing by position or channel. Results can be exported to NumPy.npyor JSON through the standard export method.