Skip to main content
License: Borzoi uses Apache-2.0 for code and CC-BY-4.0 for model weights and may require explicit attribution when utilized. Please refer to the code license and model weights license for full terms.

Proto is not affiliated with Calico. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.


calico/borzoi
calico/borzoi
RNA-seq prediction with deep convolutional neural networks.
234 stars
View repo
johahi/borzoi-models
johahi/borzoi-models
View model
Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation
Johannes Linder, Divyanshi Srivastava, … David R Kelley
Nature Genetics (2025)
Read paper
@article{linder2025borzoi,
  title={Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation},
  author={Linder, Johannes and Srivastava, Divyanshi and Yuan, Han and Agarwal, Vikram and Kelley, David R},
  journal={Nature Genetics},
  volume={57},
  number={4},
  pages={949--961},
  year={2025},
  publisher={Nature Publishing Group},
  doi={10.1038/s41588-024-02053-6}
}
Copy citation
proto-bio/proto-tools/proto_tools/tools/sequence_scoring/borzoi
View source
Open Notebook
Open notebook
Coming soon!
Run this tool directly in Proto with no setup required.
FunctionDescription
run_borzoi_ensemble()Regulatory activity prediction using all 4 Borzoi replicates (GPU) Docs Source
run_borzoi()Regulatory activity prediction using a single Borzoi replicate (GPU) Docs Source

Background

Gene regulation acts across a wide range of genomic distances. Most promoter-proximal elements operate within a few kilobases, yet enhancers can influence genes from more than 100 kb away, and topologically associating domains organize chromatin contacts over megabase scales. Sequence-to-function models that aim to relate noncoding variation to molecular phenotype therefore require an input window broad enough to capture these long-range relationships at fine spatial resolution. Borzoi (Linder et al., 2025) learns to predict cell- and tissue-specific RNA-seq coverage from DNA sequence, serving as a unifying model of gene regulation. Using statistics computed from its predicted coverage, Borzoi isolates and accurately scores the DNA regulatory elements that modulate transcriptional processes including transcription, splicing, and polyadenylation, with greater accuracy than comparable models. Benchmarked against state-of-the-art models, it accurately predicts the influence of variants on RNA expression and splicing and recapitulates the causal variants underlying molecular quantitative trait loci. Alongside RNA-seq, the model predicts CAGE, DNase-seq, ATAC-seq, and histone modification tracks, making it a broad regulatory genomics predictor. The published model couples a convolutional sequence encoder with transformer-style attention to process the full 524,288 bp window, and separate output heads produce human and mouse track predictions. The Borzoi authors trained four model replicates from independent initializations, which this toolkit exposes both as a single-replicate tool and as a four-replicate ensemble. A separate FlashAttention-based distillation of Borzoi, named Flashzoi, reaches comparable accuracy at substantially higher speed. The human checkpoints exposed by this toolkit use the Flashzoi distillation, and the mouse checkpoints use the standard Borzoi architecture.

Learning Resources

  • calico/borzoi (Calico Life Sciences). Official repository with the reference model code, training data references, and usage documentation.
  • Borzoi PyTorch weights (Hugging Face). The PyTorch-converted Borzoi and Flashzoi checkpoints that this toolkit loads at inference time.

Tools

Borzoi Prediction (borzoi-prediction)

Predicts regulatory track activity for one or more DNA sequences using a single Borzoi replicate. Each sequence may be supplied as an exact 524,288 bp model window, or as a longer source sequence paired with a sequence-relative target range that the tool uses to extract the fixed model window. For every input, the tool returns a per-bin activity matrix together with the source-sequence coordinates of the model input window and the output-bin span, so predictions can be mapped back onto the original sequence.

API Reference

Source
sequences
List[SequenceWindow]
required
DNA sequence(s) for Borzoi inference. Each item is a sequence with an optional target_range, and a bare string is accepted. Without a target_range the sequence must already be the model context length. With one, the source must be long enough to extract a full window (no padding).
Source
output_tracks
List[integer]
default:"[0]"
Track indices to extract from model output.
species
enum
default:"human"
Species model to use.Available options: human, mouse
replicate
enum
default:"0"
Replicate ID to run.Available options: 0, 1, 2, 3
avg_output_tracks
boolean
default:"True"
Whether to average selected tracks.
batch_size
integer
default:"1"
Number of sequences to process in each GPU batch.
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cuda"
Device used for inference (inherited).
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
results
List[BorzoiPredictionResult]
required
Per-sequence prediction results.
output_tracks
List[integer]
required
Track indices used for prediction.
species
string
required
Species used for prediction ("human" or "mouse").
replicate
string
required
Borzoi replicate used ("0" through "3").
avg_output_tracks
boolean
required
Whether requested tracks were averaged.

Applications

This tool is appropriate for high-throughput screening and iterative sequence design, where a single forward pass per sequence keeps the analysis fast. Representative applications include predicting RNA-seq, CAGE, and chromatin-accessibility profiles for a locus of interest, comparing reference and alternate alleles to estimate the regulatory effect of a noncoding variant, and ranking candidate regulatory sequences inside an optimization loop. The single-replicate setting is well suited to the inner iterations of a design campaign before a final ensemble assessment.

Usage Tips

  • Exact-window inputs must be exactly 524,288 bp. When no target range is supplied, the provided sequence is treated as the literal model input and is rejected unless it matches the model context length. A longer genomic region should instead be paired with a target range so the tool can extract the fixed window.
  • A target range places a region of interest inside the output bins. Extraction is aligned to the start of the requested range rather than centered, and the window shifts left near the right edge of the source sequence so the full range remains covered. The returned context and output coordinates report where the model window landed in source coordinates.
  • The region of interest is most informative near the center of the input window. Predictions degrade toward the edges of the 524,288 bp context, so a target gene or variant is best positioned close to the midpoint of the supplied window.
  • The species setting selects the checkpoint family. A value of "human" loads the FlashAttention Flashzoi checkpoints and requires a CUDA device, while "mouse" loads the standard Borzoi checkpoints. The two heads predict different track panels, so the species must match the organism of the input sequence.
  • output_tracks selects which assays are returned. Track indices address the full Borzoi output panel (7611 human, 2608 mouse). Selecting a small set of relevant tracks is appropriate when only specific assays inform the analysis.
  • avg_output_tracks=True collapses the selected tracks into a single composite signal. This default is appropriate when a single objective is needed, for example when combining related assays into one optimization score. A value of False returns one row per requested track when per-assay resolution is required.

Borzoi Ensemble (borzoi-ensemble)

Predicts regulatory track activity using all four Borzoi replicates and returns the per-replicate predictions stacked together for each input sequence. The four replicates are evaluated in sequence and share the input handling, coordinate reporting, and track-selection behavior of the single-replicate tool. The spread of predictions across replicates provides a measure of model confidence at each bin.

API Reference

Source
sequences
List[SequenceWindow]
required
DNA sequence(s) for Borzoi inference. Each item is a sequence with an optional target_range, and a bare string is accepted. Without a target_range the sequence must already be the model context length. With one, the source must be long enough to extract a full window (no padding).
Source
output_tracks
List[integer]
default:"[0]"
Track indices to extract from model output.
species
enum
default:"human"
Species model to use.Available options: human, mouse
avg_output_tracks
boolean
default:"True"
Whether to average selected tracks.
batch_size
integer
default:"1"
Number of sequences to process in each GPU batch.
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cuda"
Device used for inference.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
results
List[BorzoiEnsemblePredictionResult]
required
Per-sequence ensemble prediction results.
output_tracks
List[integer]
required
Track indices used for prediction.
species
string
required
Species used for prediction ("human" or "mouse").
avg_output_tracks
boolean
required
Whether requested tracks were averaged.
num_replicates
integer
Number of replicates returned (always 4).

Applications

This tool is appropriate for final assessments and for any analysis that benefits from uncertainty quantification. Computing the dispersion across the four replicate predictions at each bin distinguishes positions where the model is confident from positions where the replicates disagree. Representative applications include reporting confidence intervals on a predicted regulatory profile, filtering candidate variants or designed sequences to those with consistent predicted effects, and producing the headline numbers for a locus after single-replicate screening has narrowed the candidates.

Usage Tips

  • The ensemble runs four full forward passes per sequence. Inference therefore takes roughly four times as long as a single replicate. The single-replicate tool is appropriate for iteration, and the ensemble is appropriate for the final reportable result.
  • Confidence is read from agreement across replicates. A low spread across the four predictions at a bin indicates a robust signal, while a high spread indicates lower model confidence at that position. The per-replicate predictions are returned in full so any dispersion statistic can be computed downstream.
  • Replicate selection is not exposed for the ensemble. All four replicates are always evaluated. The single-replicate tool is the appropriate choice when only one specific replicate is needed.
  • The species, track-selection, and averaging behavior match the single-replicate tool. The same species, output_tracks, and avg_output_tracks guidance applies, and the ensemble inherits the same input modes and coordinate reporting.

Toolkit Notes

These apply to every Borzoi tool in this toolkit (borzoi-prediction, borzoi-ensemble).
  • A CUDA GPU is required. Both tools run on GPU, and human prediction uses FlashAttention kernels that are available only on CUDA hardware. The model checkpoints are downloaded from Hugging Face on first use and cached for subsequent runs.
  • Input sequences accept only the bases A, C, G, T, and N. Other characters are rejected during validation. The base N is permitted but encoded as the absence of any base, so a high N content reduces prediction quality and is best minimized.
  • Predicted values are the model’s raw track-activity outputs, returned without any additional post-processing. Higher values correspond to stronger predicted signal. The values are best used for relative comparisons, for example between alleles or across positions, rather than as absolute experimental counts.
  • Output bins map to source coordinates through the reported window. Each result reports the output-bin span in source-sequence coordinates at a resolution of 32 bp per bin, so a bin index can be converted to a genomic position using the output start and the bin resolution.
Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.