Skip to main content
License: Enformer uses Apache-2.0 for code and CC-BY-4.0 for model weights and may require explicit attribution when utilized. Please refer to the code license and model weights license for full terms.

Proto is not affiliated with Google DeepMind. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.


google-deepmind/deepmind-research
google-deepmind/deepmind-research
This repository contains implementations and illustrative code to accompany DeepMind publications
14.8k stars
View repo
Effective gene expression prediction from sequence by integrating long-range interactions
\vZiga Avsec, Vikram Agarwal, … David R Kelley
Nature Methods (2021)
Read paper
@article{avsec2021enformer,
  title={Effective gene expression prediction from sequence by integrating long-range interactions},
  author={Avsec, {\v{Z}}iga and Agarwal, Vikram and Visentin, Daniel and Ledsam, Joseph R and Grabska-Barwinska, Agnieszka and Taylor, Kyle R and Assael, Yannis and Jumper, John and Kohli, Pushmeet and Kelley, David R},
  journal={Nature Methods},
  volume={18},
  number={10},
  pages={1196--1203},
  year={2021},
  publisher={Nature Publishing Group},
  doi={10.1038/s41592-021-01252-x}
}
Copy citation
proto-bio/proto-tools/proto_tools/tools/sequence_scoring/enformer
View source
Open Notebook
Open notebook
Coming soon!
Run this tool directly in Proto with no setup required.
FunctionDescription
run_enformer()Gene expression and regulatory activity prediction using Enformer (GPU) Docs Source

Background

Enformer (Avsec et al., 2021) is a neural network that predicts gene expression and chromatin state from genomic DNA sequence. Its architecture combines convolutional layers with transformer self-attention, which allows the model to integrate the influence of distal cis-regulatory elements such as enhancers located up to 100 kilobases away from a promoter. The published work reports that this long-range modeling substantially improves gene expression prediction accuracy relative to earlier convolutional models, and that it yields more accurate predictions of the effect of genetic variants on expression for both natural variants and saturation mutagenesis measured by reporter assays. Enformer predicts activity across 896 output bins, each summarizing 128 base pairs, for a large panel of functional genomics assays. The human output head covers 5,313 tracks spanning chromatin accessibility, transcription factor binding, histone modifications, and CAGE expression measurements, and the mouse output head covers 1,643 tracks. Because the model maps sequence directly to these signals, it can be used to compare the predicted regulatory activity of alternative alleles at the same locus, which is the basis for its use in interpreting noncoding genetic variation. The published analysis additionally shows that Enformer learns enhancer-promoter relationships directly from the input sequence.

Learning Resources

  • Predicting gene expression with AI (Google DeepMind). The announcement blog post introducing Enformer, its long-range modeling, and its use in interpreting genetic variants.
  • Enformer model repository (Google DeepMind). Official Enformer code, model card, and usage guidance.
  • enformer-pytorch (Phil Wang). The PyTorch implementation and pretrained weight loader that this toolkit uses to run the model.

Tools

Enformer Prediction (enformer-prediction)

Predicts regulatory track activity for one or more DNA sequences. The tool accepts input in two forms. In exact-window form, each sequence is exactly 196,608 base pairs, the full Enformer model context, and is passed to the model directly. In target-range form, each sequence is longer than the model context and is paired with a target range, and the tool extracts the 196,608 base-pair context window that keeps the requested range inside the model’s output bins. Each result carries the predicted activity matrix of shape 896 bins by the number of selected tracks, together with the coordinates that map the output bins back onto the source sequence.

API Reference

Source
sequences
List[SequenceWindow]
required
DNA sequence(s) for Enformer inference. Each item is a sequence with an optional target_range, and a bare string is accepted. Without a target_range the sequence must already be the model context length. With one, the source must be long enough to extract a full window (no padding).
Source
output_tracks
List[integer]
default:"[0]"
Track indices to extract from the Enformer output.
species
enum
default:"human"
Species track head to use.Available options: human, mouse
batch_size
integer
default:"1"
Number of sequences to process in each GPU batch.
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cuda"
Device used for inference.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
results
List[EnformerPredictionResult]
required
Per-sequence prediction results.
output_tracks
List[integer]
required
Track indices that were extracted.
species
string
required
Species used for prediction ("human" or "mouse").

Applications

This tool is appropriate for analyses that relate noncoding DNA sequence to predicted regulatory activity. Representative applications include predicting the effect of a candidate variant by comparing the activity of the reference and alternate sequences, prioritizing noncoding variants for follow-up by the magnitude of their predicted regulatory change, screening designed or synthetic regulatory sequences for predicted promoter or enhancer activity, and surveying the predicted chromatin and expression landscape across a genomic locus of interest.

Usage Tips

  • Each input sequence must provide a full 196,608 base-pair model context. A sequence supplied in exact-window form must be exactly this length. A longer sequence must be paired with a target_range, and the tool then extracts the model context window around that range. The tool does not pad missing context, so a sequence shorter than the model context without a target range is rejected.
  • output_tracks selects which of the model’s tracks are returned and defaults to the single track at index 0. The human head exposes 5,313 tracks and the mouse head exposes 1,643 tracks. Selecting only the tracks of interest, rather than all of them, keeps the returned activity matrix small and the analysis focused on the relevant assays.
  • species selects the human or mouse output head and defaults to human. The track index meanings differ between the two heads, so the output_tracks indices should be chosen to match the selected species.
  • batch_size controls how many sequences are processed together on the GPU and defaults to 1. Larger values increase throughput when many sequences are scored in one call, subject to the available GPU memory.
  • The output covers a 114,688 base-pair central window, narrower than the input context. The 896 output bins span the central portion of the input, and the per-result output_start and output_end coordinates report exactly which part of the source sequence the bins cover. A feature of interest must fall within this central window to receive a prediction.

Toolkit Notes

These apply to every Enformer tool in this toolkit (enformer-prediction).
  • Enformer runs on a GPU and downloads its published weights on first use. The model parameters are retrieved from the hosted enformer-official-rough checkpoint the first time the tool runs and are reused on subsequent runs. A CUDA-capable GPU is recommended because the 196,608 base-pair context makes CPU inference slow.
  • Output coordinates are 0-based with exclusive ends, following the genomics interval convention rather than the 1-based residue numbering used elsewhere in proto-tools. The context_start, context_end, output_start, and output_end fields locate the model window and the output-bin span within the source sequence using this convention.
Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.