Borzoi - Proto

License: Borzoi uses Apache-2.0 for code and CC-BY-4.0 for model weights and may require explicit attribution when utilized. Please refer to the code license and model weights license for full terms.

Proto is not affiliated with Calico. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.

GitHub 234 GitHub 234

HuggingFace

HuggingFace Publication Publication Cite Cite Tool Source Tool Source Open as Notebook Open as Notebook Open on Proto Open on Proto

calico/borzoi

RNA-seq prediction with deep convolutional neural networks.

Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation

Johannes Linder, Divyanshi Srivastava, … David R Kelley

Nature Genetics (2025)

Read paper

@article{linder2025borzoi,
  title={Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation},
  author={Linder, Johannes and Srivastava, Divyanshi and Yuan, Han and Agarwal, Vikram and Kelley, David R},
  journal={Nature Genetics},
  volume={57},
  number={4},
  pages={949--961},
  year={2025},
  publisher={Nature Publishing Group},
  doi={10.1038/s41588-024-02053-6}
}

Copy citation

proto-bio/proto-tools/proto_tools/tools/sequence_scoring/borzoi

View source

Open Notebook

Open notebook

Coming soon!

Run this tool directly in Proto with no setup required.

Function	Description
`run_borzoi_ensemble()`	Regulatory activity prediction using all 4 Borzoi replicates (GPU)	Docs Source
`run_borzoi()`	Regulatory activity prediction using a single Borzoi replicate (GPU)	Docs Source

Background

Gene regulation acts across a wide range of genomic distances. Most promoter-proximal elements operate within a few kilobases, yet enhancers can influence genes from more than 100 kb away, and topologically associating domains organize chromatin contacts over megabase scales. Sequence-to-function models that aim to relate noncoding variation to molecular phenotype therefore require an input window broad enough to capture these long-range relationships at fine spatial resolution. Borzoi (Linder et al., 2025) learns to predict cell- and tissue-specific RNA-seq coverage from DNA sequence, serving as a unifying model of gene regulation. Using statistics computed from its predicted coverage, Borzoi isolates and accurately scores the DNA regulatory elements that modulate transcriptional processes including transcription, splicing, and polyadenylation, with greater accuracy than comparable models. Benchmarked against state-of-the-art models, it accurately predicts the influence of variants on RNA expression and splicing and recapitulates the causal variants underlying molecular quantitative trait loci. Alongside RNA-seq, the model predicts CAGE, DNase-seq, ATAC-seq, and histone modification tracks, making it a broad regulatory genomics predictor. The published model couples a convolutional sequence encoder with transformer-style attention to process the full 524,288 bp window, and separate output heads produce human and mouse track predictions. The Borzoi authors trained four model replicates from independent initializations, which this toolkit exposes both as a single-replicate tool and as a four-replicate ensemble. A separate FlashAttention-based distillation of Borzoi, named Flashzoi, reaches comparable accuracy at substantially higher speed. The human checkpoints exposed by this toolkit use the Flashzoi distillation, and the mouse checkpoints use the standard Borzoi architecture.

Learning Resources

calico/borzoi (Calico Life Sciences). Official repository with the reference model code, training data references, and usage documentation.
Borzoi PyTorch weights (Hugging Face). The PyTorch-converted Borzoi and Flashzoi checkpoints that this toolkit loads at inference time.

Tools

Borzoi Prediction (`borzoi-prediction`)

Predicts regulatory track activity for one or more DNA sequences using a single Borzoi replicate. Each sequence may be supplied as an exact 524,288 bp model window, or as a longer source sequence paired with a sequence-relative target range that the tool uses to extract the fixed model window. For every input, the tool returns a per-bin activity matrix together with the source-sequence coordinates of the model input window and the output-bin span, so predictions can be mapped back onto the original sequence.

API Reference

Source

Input: BorzoiInput

sequences

List[SequenceWindow]

required

DNA sequence(s) for Borzoi inference. Each item is a sequence with an optional target_range, and a bare string is accepted. Without a target_range the sequence must already be the model context length. With one, the source must be long enough to extract a full window (no padding).

Show SequenceWindow

sequence

string

required

DNA sequence — an exact model-context window, or a longer source sequence paired with target_range.

target_range

SequenceTargetRange

Optional sequence-relative span the tool must keep inside the model output bins. Windowing is all-or-nothing across a call: set target_range on every window or on none (see :func:windows_target_ranges).

Source

Config: BorzoiConfig

output_tracks

List[integer]

default:"[0]"

Track indices to extract from model output.

species

enum

default:"human"

Species model to use.Available options: human, mouse

replicate

enum

default:"0"

Replicate ID to run.Available options: 0, 1, 2, 3

avg_output_tracks

boolean

default:"True"

Whether to average selected tracks.

batch_size

integer

default:"1"

Number of sequences to process in each GPU batch.

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cuda"

Device used for inference (inherited).

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.

Source

Output: BorzoiOutput

results

List[BorzoiPredictionResult]

required

Per-sequence prediction results.

Show BorzoiPredictionResult

sequence

string

required

Input DNA sequence that was scored.

sequence_length

integer

required

Length of the input sequence.

prediction

List[array]

required

Prediction matrix with shape [num_tracks, 6144].

context_start

integer

required

Start coordinate of the Borzoi input window in the source sequence.

context_end

integer

required

End coordinate of the Borzoi input window in the source sequence.

output_start

integer

required

Source-sequence coordinate of the first Borzoi output bin.

output_end

integer

required

Source-sequence coordinate immediately after the last Borzoi output bin.

output_resolution

integer

Base pairs represented by each output bin.

target_start

integer

Target start coordinate supplied for this sequence.

target_end

integer

Target end coordinate supplied for this sequence.

output_tracks

List[integer]

required

Track indices used for prediction.

species

string

required

Species used for prediction ("human" or "mouse").

replicate

string

required

Borzoi replicate used ("0" through "3").

avg_output_tracks

boolean

required

Whether requested tracks were averaged.

Applications

This tool is appropriate for high-throughput screening and iterative sequence design, where a single forward pass per sequence keeps the analysis fast. Representative applications include predicting RNA-seq, CAGE, and chromatin-accessibility profiles for a locus of interest, comparing reference and alternate alleles to estimate the regulatory effect of a noncoding variant, and ranking candidate regulatory sequences inside an optimization loop. The single-replicate setting is well suited to the inner iterations of a design campaign before a final ensemble assessment.

Usage Tips

Exact-window inputs must be exactly 524,288 bp. When no target range is supplied, the provided sequence is treated as the literal model input and is rejected unless it matches the model context length. A longer genomic region should instead be paired with a target range so the tool can extract the fixed window.
A target range places a region of interest inside the output bins. Extraction is aligned to the start of the requested range rather than centered, and the window shifts left near the right edge of the source sequence so the full range remains covered. The returned context and output coordinates report where the model window landed in source coordinates.
The region of interest is most informative near the center of the input window. Predictions degrade toward the edges of the 524,288 bp context, so a target gene or variant is best positioned close to the midpoint of the supplied window.
The species setting selects the checkpoint family. A value of "human" loads the FlashAttention Flashzoi checkpoints and requires a CUDA device, while "mouse" loads the standard Borzoi checkpoints. The two heads predict different track panels, so the species must match the organism of the input sequence.
output_tracks selects which assays are returned. Track indices address the full Borzoi output panel (7611 human, 2608 mouse). Selecting a small set of relevant tracks is appropriate when only specific assays inform the analysis.
avg_output_tracks=True collapses the selected tracks into a single composite signal. This default is appropriate when a single objective is needed, for example when combining related assays into one optimization score. A value of False returns one row per requested track when per-assay resolution is required.

Borzoi Ensemble (`borzoi-ensemble`)

Predicts regulatory track activity using all four Borzoi replicates and returns the per-replicate predictions stacked together for each input sequence. The four replicates are evaluated in sequence and share the input handling, coordinate reporting, and track-selection behavior of the single-replicate tool. The spread of predictions across replicates provides a measure of model confidence at each bin.

API Reference

Source

Input: BorzoiInput

sequences

List[SequenceWindow]

required

Show SequenceWindow

sequence

string

required

DNA sequence — an exact model-context window, or a longer source sequence paired with target_range.

target_range

SequenceTargetRange

Source

Config: BorzoiEnsembleConfig

output_tracks

List[integer]

default:"[0]"

Track indices to extract from model output.

species

enum

default:"human"

Species model to use.Available options: human, mouse

avg_output_tracks

boolean

default:"True"

Whether to average selected tracks.

batch_size

integer

default:"1"

Number of sequences to process in each GPU batch.

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cuda"

Device used for inference.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Source

Output: BorzoiEnsembleOutput

results

List[BorzoiEnsemblePredictionResult]

required

Per-sequence ensemble prediction results.

Show BorzoiEnsemblePredictionResult

sequence

string

required

Input DNA sequence that was scored.

sequence_length

integer

required

Length of the input sequence.

predictions

List[array]

required

Stacked predictions with shape [4, num_tracks, 6144] for replicates 0-3.

context_start

integer

required

Start coordinate of the Borzoi input window in the source sequence.

context_end

integer

required

End coordinate of the Borzoi input window in the source sequence.

output_start

integer

required

Source-sequence coordinate of the first Borzoi output bin.

output_end

integer

required

Source-sequence coordinate immediately after the last Borzoi output bin.

output_resolution

integer

Base pairs represented by each output bin.

target_start

integer

Target start coordinate supplied for this sequence.

target_end

integer

Target end coordinate supplied for this sequence.

output_tracks

List[integer]

required

Track indices used for prediction.

species

string

required

Species used for prediction ("human" or "mouse").

avg_output_tracks

boolean

required

Whether requested tracks were averaged.

num_replicates

integer

Number of replicates returned (always 4).

Applications

This tool is appropriate for final assessments and for any analysis that benefits from uncertainty quantification. Computing the dispersion across the four replicate predictions at each bin distinguishes positions where the model is confident from positions where the replicates disagree. Representative applications include reporting confidence intervals on a predicted regulatory profile, filtering candidate variants or designed sequences to those with consistent predicted effects, and producing the headline numbers for a locus after single-replicate screening has narrowed the candidates.

Usage Tips

The ensemble runs four full forward passes per sequence. Inference therefore takes roughly four times as long as a single replicate. The single-replicate tool is appropriate for iteration, and the ensemble is appropriate for the final reportable result.
Confidence is read from agreement across replicates. A low spread across the four predictions at a bin indicates a robust signal, while a high spread indicates lower model confidence at that position. The per-replicate predictions are returned in full so any dispersion statistic can be computed downstream.
Replicate selection is not exposed for the ensemble. All four replicates are always evaluated. The single-replicate tool is the appropriate choice when only one specific replicate is needed.
The species, track-selection, and averaging behavior match the single-replicate tool. The same species, output_tracks, and avg_output_tracks guidance applies, and the ensemble inherits the same input modes and coordinate reporting.

Toolkit Notes

These apply to every Borzoi tool in this toolkit (borzoi-prediction, borzoi-ensemble).

A CUDA GPU is required. Both tools run on GPU, and human prediction uses FlashAttention kernels that are available only on CUDA hardware. The model checkpoints are downloaded from Hugging Face on first use and cached for subsequent runs.
Input sequences accept only the bases A, C, G, T, and N. Other characters are rejected during validation. The base N is permitted but encoded as the absence of any base, so a high N content reduces prediction quality and is best minimized.
Predicted values are the model’s raw track-activity outputs, returned without any additional post-processing. Higher values correspond to stronger predicted signal. The values are best used for relative comparisons, for example between alleles or across positions, rather than as absolute experimental counts.
Output bins map to source coordinates through the reported window. Each result reports the output-bin span in source-sequence coordinates at a resolution of 32 bp per bin, so a bin index can be converted to a genomic position using the output start and the bin resolution.

Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.

​Background

​Learning Resources

​Tools

​Borzoi Prediction (borzoi-prediction)

​API Reference

​Applications

​Usage Tips

​Borzoi Ensemble (borzoi-ensemble)

​API Reference

​Applications

​Usage Tips

​Toolkit Notes

​Infrastructure Guides

Tool Persistence

Device Management

Parallel Execution

Cloud Inference

Background

Learning Resources

Tools

Borzoi Prediction (`borzoi-prediction`)

API Reference

Applications

Usage Tips

Borzoi Ensemble (`borzoi-ensemble`)

API Reference

Applications

Usage Tips

Toolkit Notes

Infrastructure Guides