Enformer - Proto

License: Enformer uses Apache-2.0 for code and CC-BY-4.0 for model weights and may require explicit attribution when utilized. Please refer to the code license and model weights license for full terms.

Proto is not affiliated with Google DeepMind. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.

GitHub 14.8k GitHub 14.8k Publication Publication Cite Cite Tool Source Tool Source Open as Notebook Open as Notebook Open on Proto Open on Proto

google-deepmind/deepmind-research

This repository contains implementations and illustrative code to accompany DeepMind publications

14.8k stars

View repo

Effective gene expression prediction from sequence by integrating long-range interactions

\vZiga Avsec, Vikram Agarwal, … David R Kelley

Nature Methods (2021)

Read paper

@article{avsec2021enformer,
  title={Effective gene expression prediction from sequence by integrating long-range interactions},
  author={Avsec, {\v{Z}}iga and Agarwal, Vikram and Visentin, Daniel and Ledsam, Joseph R and Grabska-Barwinska, Agnieszka and Taylor, Kyle R and Assael, Yannis and Jumper, John and Kohli, Pushmeet and Kelley, David R},
  journal={Nature Methods},
  volume={18},
  number={10},
  pages={1196--1203},
  year={2021},
  publisher={Nature Publishing Group},
  doi={10.1038/s41592-021-01252-x}
}

Copy citation

proto-bio/proto-tools/proto_tools/tools/sequence_scoring/enformer

View source

Open Notebook

Open notebook

Coming soon!

Run this tool directly in Proto with no setup required.

Function	Description
`run_enformer()`	Gene expression and regulatory activity prediction using Enformer (GPU)	Docs Source

Background

Enformer (Avsec et al., 2021) is a neural network that predicts gene expression and chromatin state from genomic DNA sequence. Its architecture combines convolutional layers with transformer self-attention, which allows the model to integrate the influence of distal cis-regulatory elements such as enhancers located up to 100 kilobases away from a promoter. The published work reports that this long-range modeling substantially improves gene expression prediction accuracy relative to earlier convolutional models, and that it yields more accurate predictions of the effect of genetic variants on expression for both natural variants and saturation mutagenesis measured by reporter assays. Enformer predicts activity across 896 output bins, each summarizing 128 base pairs, for a large panel of functional genomics assays. The human output head covers 5,313 tracks spanning chromatin accessibility, transcription factor binding, histone modifications, and CAGE expression measurements, and the mouse output head covers 1,643 tracks. Because the model maps sequence directly to these signals, it can be used to compare the predicted regulatory activity of alternative alleles at the same locus, which is the basis for its use in interpreting noncoding genetic variation. The published analysis additionally shows that Enformer learns enhancer-promoter relationships directly from the input sequence.

Learning Resources

Predicting gene expression with AI (Google DeepMind). The announcement blog post introducing Enformer, its long-range modeling, and its use in interpreting genetic variants.
Enformer model repository (Google DeepMind). Official Enformer code, model card, and usage guidance.
enformer-pytorch (Phil Wang). The PyTorch implementation and pretrained weight loader that this toolkit uses to run the model.

Tools

Enformer Prediction (`enformer-prediction`)

Predicts regulatory track activity for one or more DNA sequences. The tool accepts input in two forms. In exact-window form, each sequence is exactly 196,608 base pairs, the full Enformer model context, and is passed to the model directly. In target-range form, each sequence is longer than the model context and is paired with a target range, and the tool extracts the 196,608 base-pair context window that keeps the requested range inside the model’s output bins. Each result carries the predicted activity matrix of shape 896 bins by the number of selected tracks, together with the coordinates that map the output bins back onto the source sequence.

API Reference

Source

Input: EnformerInput

sequences

List[SequenceWindow]

required

DNA sequence(s) for Enformer inference. Each item is a sequence with an optional target_range, and a bare string is accepted. Without a target_range the sequence must already be the model context length. With one, the source must be long enough to extract a full window (no padding).

Show SequenceWindow

sequence

string

required

DNA sequence — an exact model-context window, or a longer source sequence paired with target_range.

target_range

SequenceTargetRange

Optional sequence-relative span the tool must keep inside the model output bins. Windowing is all-or-nothing across a call: set target_range on every window or on none (see :func:windows_target_ranges).

Source

Config: EnformerConfig

output_tracks

List[integer]

default:"[0]"

Track indices to extract from the Enformer output.

species

enum

default:"human"

Species track head to use.Available options: human, mouse

batch_size

integer

default:"1"

Number of sequences to process in each GPU batch.

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cuda"

Device used for inference.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.

Source

Output: EnformerOutput

results

List[EnformerPredictionResult]

required

Per-sequence prediction results.

Show EnformerPredictionResult

sequence

string

required

Input DNA sequence that was scored.

sequence_length

integer

required

Length of the input sequence.

prediction

List[array]

required

Predicted signal matrix with shape [896, num_tracks].

context_start

integer

required

Start coordinate of the Enformer input window in the source sequence.

context_end

integer

required

End coordinate of the Enformer input window in the source sequence.

output_start

integer

required

Source-sequence coordinate of the first Enformer output bin.

output_end

integer

required

Source-sequence coordinate immediately after the last Enformer output bin.

output_resolution

integer

Base pairs represented by each output bin.

target_start

integer

Target start coordinate supplied for this sequence.

target_end

integer

Target end coordinate supplied for this sequence.

output_tracks

List[integer]

required

Track indices that were extracted.

species

string

required

Species used for prediction ("human" or "mouse").

Applications

This tool is appropriate for analyses that relate noncoding DNA sequence to predicted regulatory activity. Representative applications include predicting the effect of a candidate variant by comparing the activity of the reference and alternate sequences, prioritizing noncoding variants for follow-up by the magnitude of their predicted regulatory change, screening designed or synthetic regulatory sequences for predicted promoter or enhancer activity, and surveying the predicted chromatin and expression landscape across a genomic locus of interest.

Usage Tips

Each input sequence must provide a full 196,608 base-pair model context. A sequence supplied in exact-window form must be exactly this length. A longer sequence must be paired with a target_range, and the tool then extracts the model context window around that range. The tool does not pad missing context, so a sequence shorter than the model context without a target range is rejected.
output_tracks selects which of the model’s tracks are returned and defaults to the single track at index 0. The human head exposes 5,313 tracks and the mouse head exposes 1,643 tracks. Selecting only the tracks of interest, rather than all of them, keeps the returned activity matrix small and the analysis focused on the relevant assays.
species selects the human or mouse output head and defaults to human. The track index meanings differ between the two heads, so the output_tracks indices should be chosen to match the selected species.
batch_size controls how many sequences are processed together on the GPU and defaults to 1. Larger values increase throughput when many sequences are scored in one call, subject to the available GPU memory.
The output covers a 114,688 base-pair central window, narrower than the input context. The 896 output bins span the central portion of the input, and the per-result output_start and output_end coordinates report exactly which part of the source sequence the bins cover. A feature of interest must fall within this central window to receive a prediction.

Toolkit Notes

These apply to every Enformer tool in this toolkit (enformer-prediction).

Enformer runs on a GPU and downloads its published weights on first use. The model parameters are retrieved from the hosted enformer-official-rough checkpoint the first time the tool runs and are reused on subsequent runs. A CUDA-capable GPU is recommended because the 196,608 base-pair context makes CPU inference slow.
Output coordinates are 0-based with exclusive ends, following the genomics interval convention rather than the 1-based residue numbering used elsewhere in proto-tools. The context_start, context_end, output_start, and output_end fields locate the model window and the output-bin span within the source sequence using this convention.

Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.

​Background

​Learning Resources

​Tools

​Enformer Prediction (enformer-prediction)

​API Reference

​Applications

​Usage Tips

​Toolkit Notes

​Infrastructure Guides

Tool Persistence

Device Management

Parallel Execution

Cloud Inference

Background

Learning Resources

Tools

Enformer Prediction (`enformer-prediction`)

API Reference

Applications

Usage Tips

Toolkit Notes

Infrastructure Guides