Skip to main content
License: Ensembl retrieves data from the Ensembl project, distributed under the EMBL-EBI Terms of Use. Attribution to the Ensembl project is required when the data is redistributed. The client wrapper code is Apache-2.0-licensed. Please refer to the data terms for full terms.

Proto is not affiliated with EMBL-EBI. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.


Ensembl/ensembl-rest
Ensembl/ensembl-rest
Language agnostic RESTful data access to Ensembl data over HTTP
150 stars
View repo
ensembl.org
Visit website
Ensembl 2025
Sarah C. Dyer, Olanrewaju Austine-Orimoloye, … Vianey Paola Barrera-Enriquez
Nucleic Acids Research (2025)
Read paper
@article{dyer2025ensembl,
  title={{Ensembl} 2025},
  author={Dyer, Sarah C. and Austine-Orimoloye, Olanrewaju and Azov, Andrey G. and Barba, Matthieu and Barnes, If and Barrera-Enriquez, Vianey Paola and others},
  journal={Nucleic Acids Research},
  volume={53},
  number={D1},
  pages={D948--D957},
  year={2025},
  publisher={Oxford University Press},
  doi={10.1093/nar/gkae1071}
}
Copy citation
proto-bio/proto-tools/proto_tools/tools/database_retrieval/ensembl
View source
Open Notebook
Open notebook
Coming soon!
Run this tool directly in Proto with no setup required.
FunctionDescription
run_ensembl_lookup()Look up an Ensembl gene record by Ensembl gene ID or gene symbol Docs Source
run_ensembl_overlap()Fetch features overlapping an Ensembl region (default: gene; supports exon, regulatory, motif, va… Docs Source
run_ensembl_sequence()Fetch DNA / cDNA / CDS / protein sequence for an Ensembl ID Docs Source
run_ensembl_vep()Predict variant consequences from an HGVS notation via Ensembl’s Variant Effect Predictor REST en… Docs Source
run_ensembl_xrefs()Fetch cross-references from an Ensembl ID to external databases Docs Source

Background

Ensembl (Dyer et al., 2025) is a genome annotation resource maintained by EMBL-EBI. It integrates gene and transcript models, the Ensembl Regulatory Build, cross-references to external databases, and variant consequence prediction with the Variant Effect Predictor (VEP) for human and other supported species. Coordinates returned by Ensembl are 1-indexed and inclusive, to match biological residue selection conventions. Each tool issues a single HTTP GET to the Ensembl REST API, whose base URL is https://rest.ensembl.org for the GRCh38 assembly. Setting assembly="GRCh37" routes requests to https://grch37.rest.ensembl.org instead. The endpoints used are /lookup/id/{id} and /lookup/symbol/{species}/{symbol} for ensembl-lookup, /sequence/id/{id} for ensembl-sequence, /overlap/id/{id} for ensembl-overlap, /xrefs/id/{id} for ensembl-xrefs, and /vep/{species}/hgvs/{hgvs} for ensembl-vep. Responses are parsed into typed Pydantic records, with the full upstream JSON preserved alongside in a raw_payload field. PascalCase keys such as Transcript, Exon, and Translation are kept verbatim so records round-trip cleanly. Results reflect the live Ensembl database at query time rather than a fixed release snapshot.

Learning Resources

Tools

Ensembl Lookup (ensembl-lookup)

Retrieves a single gene record, either directly by Ensembl gene ID or by gene symbol scoped to a species, returning the typed EnsemblGene (identifier, symbol, biotype, genomic coordinates, canonical transcript) plus the source URL and raw payload. With expand enabled, the response includes the nested transcript, translation, and exon hierarchy.

API Reference

Source
ensembl_id
string
Ensembl gene ID (e.g. ENSG...).
symbol
string
Gene symbol (e.g. BRCA1).
Source
species
enum
default:"homo_sapiens"
Species slug used when symbol is the input. Default homo_sapiens.Available options: homo_sapiens, mus_musculus, rattus_norvegicus, danio_rerio, saccharomyces_cerevisiae
assembly
enum
default:"GRCh38"
Genome assembly. GRCh38 (default) calls rest.ensembl.org; GRCh37 calls grch37.rest.ensembl.org.Available options: GRCh38, GRCh37
expand
boolean
default:"False"
Include transcripts, translations, and exons in the response. Default False matches Ensembl REST.
mane
boolean
default:"False"
Include MANE Select annotations (/lookup/id only; requires expand=True).
phenotypes
boolean
default:"False"
Include phenotype annotations on gene records (/lookup/id only).
utr
boolean
default:"False"
Include UTR coordinates per transcript (/lookup/id only; requires expand=True).
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cpu"
Device to run the tool on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
result
EnsemblGene
required
The looked-up gene record.
source_url
string
required
Final Ensembl REST URL that was hit.
raw_payload
Dict[string, any]
Raw API JSON.

Applications

Use this to resolve a gene of interest as the entry point of nearly any Ensembl workflow. Convert a gene symbol such as BRCA1 into its stable Ensembl gene ID, read off the canonical transcript and genomic coordinates for downstream transcriptomics, GWAS annotation, or sequence retrieval with ensembl-sequence, or expand the transcript-and-exon hierarchy for splice-isoform analysis. The returned gene ID also feeds ensembl-overlap and ensembl-xrefs.

Usage Tips

  • Provide exactly one of ensembl_id or symbol. Supplying both or neither raises a validation error. A symbol lookup also requires config.species to disambiguate.
  • Nested transcripts and exons are absent unless expand is set. The default matches Ensembl REST and returns the gene record only, so request expansion explicitly when you need the transcript or exon hierarchy.
  • mane, phenotypes, and utr apply only to ID-based lookup. They are sent only on the /lookup/id path and are ignored for a symbol lookup. mane and utr additionally require expand.

Ensembl Sequence (ensembl-sequence)

Retrieves the sequence for an Ensembl gene, transcript, or protein ID and returns one or more EnsemblSequence records (stable ID, description, molecule type, sequence string) alongside the source URL and raw payload.

API Reference

Source
ensembl_id
string
required
Ensembl ID (ENSG..., ENST..., or ENSP...).
Source
sequence_type
enum
default:"genomic"
What to return — genomic (DNA + UTRs + introns), cdna (spliced mRNA + UTRs), cds (spliced coding only), protein (translation).Available options: genomic, cdna, cds, protein
assembly
enum
default:"GRCh38"
Genome assembly. GRCh38 (default) or GRCh37.Available options: GRCh38, GRCh37
multiple_sequences
boolean
default:"False"
Return all sequences when an ID maps to multiple records (e.g. patches, alternative haplotypes).
mask
string
Mask repeats in the returned sequence. hard replaces with N; soft lowercases. Genomic sequence_type only; mutually exclusive with mask_feature.
mask_feature
boolean
default:"False"
Mask introns (when sequence_type='genomic') or UTRs (when sequence_type='cdna') so the primary feature stands out. Mutually exclusive with mask.
expand_3prime
integer
Bases to add to the 3’ end (genomic only, incompatible with end).
expand_5prime
integer
Bases to add to the 5’ end (genomic only, incompatible with start).
start
integer
1-indexed start trim coordinate (incompatible with expand_5prime).
end
integer
1-indexed end trim coordinate (incompatible with expand_3prime).
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cpu"
Device to run the tool on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
results
List[EnsemblSequence]
Fetched sequence record(s). Length 1 unless multiple_sequences=True and the ID maps to more than one.
source_url
string
required
Final Ensembl REST URL that was hit.
raw_payload
List[Dict[string, any]]
Raw API JSON, always wrapped in a list.

Applications

Use this to retrieve a reference sequence at gene, transcript, or protein granularity for any downstream analysis. Fetch the spliced mRNA or coding sequence of a canonical transcript resolved by ensembl-lookup for primer design, codon-usage analysis, or sequence comparison, pull the genomic span with introns for promoter or splice-site studies, or obtain the protein translation for multiple-sequence alignment or structure prediction. Repeat masking and feature masking support cis-element and intron-aware analyses.

Usage Tips

  • The returned id may differ from the input ID. Requesting a protein sequence for a transcript ID resolves to the corresponding protein ID, so read the record’s id rather than assuming it echoes the input.
  • mask and mask_feature are mutually exclusive, as are the expand and trim pairs. Setting both masks, or both expand_5prime and start, or both expand_3prime and end, raises a validation error.
  • Repeat masking and span expansion apply to genomic sequence only. They have no effect on cdna, cds, or protein requests.
  • Set multiple_sequences when an ID maps to more than one record. Without it, IDs that resolve to multiple sequences (patches, alternative haplotypes) return only the first.

Ensembl Overlap (ensembl-overlap)

Retrieves features overlapping the genomic region of a given Ensembl ID and returns a list of EnsemblOverlapFeatureRecord entries, each exposing the common typed fields (feature type, identifier, biotype, coordinates, strand, region) plus a raw dict carrying the full upstream record.

API Reference

Source
ensembl_id
string
required
Ensembl ID whose region to query for overlapping features.
Source
overlap_feature
enum
default:"gene"
Type of feature to retrieve (e.g. gene, transcript, exon, regulatory, variation).Available options: band, gene, transcript, cds, exon, repeat, simple, misc, variation, somatic_variation, structural_variation, somatic_structural_variation, constrained, regulatory, motif, mane
assembly
enum
default:"GRCh38"
Genome assembly. GRCh38 (default) or GRCh37.Available options: GRCh38, GRCh37
biotype
string
Restrict to a biotype (e.g. protein_coding); most useful when overlap_feature is gene or transcript.
so_term
string
Restrict variation features by Sequence Ontology consequence (e.g. missense_variant).
variant_set
string
Restrict variation features to a named variant set (e.g. ClinVar).
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cpu"
Device to run the tool on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
result
List[EnsemblOverlapFeatureRecord]
Features overlapping the queried region; each carries the full upstream dict in raw for feature-specific keys.
source_url
string
required
Final Ensembl REST URL that was hit.
raw_payload
List[Dict[string, any]]
Raw API JSON.

Applications

Use this to annotate a genomic locus by listing what overlaps it. Identify which gene or transcript contains a GWAS hit, ChIP-seq peak, or ATAC-seq peak, enumerate the regulatory-build features (promoters, enhancers, transcription-factor binding sites) within a region for functional-genomics analysis, or pull overlapping variants filtered to a named set such as ClinVar to ask whether the region is clinically annotated. The locus is typically obtained from ensembl-lookup.

Usage Tips

  • Records are typed only on the common fields. Feature-specific keys live in raw. Different feature classes return divergent payload shapes, so read per-feature attributes from each record’s raw dict.
  • biotype is most meaningful for gene and transcript features. It filters those classes. Pairing it with unrelated feature types is unlikely to narrow results.
  • so_term and variant_set apply to variation features. They have no effect when the feature class is not a variation type.

Ensembl Xrefs (ensembl-xrefs)

Resolves an Ensembl ID to its external-database cross-references and returns a list of EnsemblXref records (external database name, display and primary identifiers, description, cross-reference type) plus the source URL and raw payload.

API Reference

Source
ensembl_id
string
required
Ensembl ID for direct cross-reference lookup.
Source
assembly
enum
default:"GRCh38"
Genome assembly. GRCh38 (default) or GRCh37.Available options: GRCh38, GRCh37
all_levels
boolean
default:"False"
Fan out to transcripts and translations. On a gene query this also returns xrefs from each child transcript and protein.
external_db
string
Restrict to one external database (e.g. UniProtKB/Swiss-Prot, HGNC).
object_type
string
Restrict to one feature type when the stable ID resolves ambiguously.
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cpu"
Device to run the tool on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
result
List[EnsemblXref]
Cross-reference records to external databases (UniProt, EntrezGene, RefSeq, …).
source_url
string
required
Final Ensembl REST URL that was hit.
raw_payload
List[Dict[string, any]]
Raw API JSON.

Applications

Use this to convert identifiers between Ensembl and the other major sequence, gene, and protein resources. Map an Ensembl gene or protein to a UniProt accession before fetching its entry with uniprot-fetch, which then bridges through to PDB structures via SIFTS, recover EntrezGene or RefSeq identifiers for NCBI-side retrieval, or follow GO and InterPro cross-references for functional annotation. The Ensembl ID is commonly produced by ensembl-lookup.

Usage Tips

  • Filter on dbname in the result, not just external_db. A single query can return several UniProt-related and RefSeq-related entries, so select the row by its dbname rather than assuming one record per database.
  • all_levels changes result scope on gene queries. It fans cross-references out to child transcripts and translations, which can substantially enlarge the result.
  • Set object_type when a stable ID resolves ambiguously. It restricts results to one feature type when the ID could map to a gene, transcript, or translation.

Ensembl VEP (ensembl-vep)

Submits an HGVS notation to the Ensembl Variant Effect Predictor REST endpoint and returns a list of EnsemblVEPConsequence records (echoed input, most severe consequence as a Sequence Ontology term, region and coordinates, allele string, raw per-transcript consequences, co-located variants) plus a derived num_consequences count.

API Reference

Source
hgvs
string
required
HGVS notation. Genomic (e.g. 9:g.22125504G>C), coding (ENST00000357654:c.5074G>A), or protein (ENSP00000418960:p.Tyr124Cys) forms all work.
Source
species
enum
default:"homo_sapiens"
Species slug. Default homo_sapiens.Available options: homo_sapiens, mus_musculus, rattus_norvegicus, danio_rerio, saccharomyces_cerevisiae
assembly
enum
default:"GRCh38"
Genome assembly. GRCh38 (default) or GRCh37.Available options: GRCh38, GRCh37
annotations
EnsemblVEPAnnotationConfig
Collapsible group of species-agnostic annotation toggles.
sift
string
SIFT pathogenicity output — b (both prediction + score), p (prediction only), s (score only); None falls back to API default.
polyphen
string
PolyPhen output level; same value semantics as sift.
mane
boolean
default:"False"
Include MANE Select annotations (GRCh38 only).
alphamissense
boolean
default:"False"
AlphaMissense missense pathogenicity scores (human only).
revel
boolean
default:"False"
REVEL ensemble pathogenicity scores (human only).
cadd
boolean
default:"False"
CADD deleteriousness scores (human only).
appris
boolean
default:"False"
Include APPRIS principal isoform tag (human/mouse only).
tsl
boolean
default:"False"
Include transcript support level (human/mouse only).
ccds
boolean
default:"False"
Include CCDS identifier per transcript (human/mouse only).
distance
integer
Up/downstream distance (bp) used to assign consequence terms. None keeps the API default (5000).
pick
boolean
default:"False"
Return only one consequence per variant — Ensembl’s PICK heuristic (canonical, longest CDS, …).
per_gene
boolean
default:"False"
Return one consequence per gene (less aggressive than pick); incompatible with pick.
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cpu"
Device to run the tool on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
consequences
List[EnsemblVEPConsequence]
One record per VEP input (Ensembl returns a list even for a single HGVS).
source_url
string
required
Final URL hit.
raw_payload
List[Dict[string, any]]
Raw API JSON.

Applications

Use this to predict the functional consequence of a variant and assess its likely impact. Classify a coding or genomic HGVS notation as missense, synonymous, stop-gain, splice-disrupting, or noncoding, then read per-transcript SIFT and PolyPhen predictions alongside optional human-only AlphaMissense, REVEL, and CADD pathogenicity scores for clinical or research variant interpretation. Co-located variant lookups surface population-frequency context from gnomAD and ClinVar annotations. Candidate variants are often identified from features returned by ensembl-overlap or from a designed or observed substitution in a downstream design workflow.

Usage Tips

  • transcript_consequences and colocated_variants are returned as raw dicts. Their field sets vary by consequence type and annotation toggles, so read them defensively rather than expecting a fixed shape.
  • pick and per_gene cannot be combined. Setting both raises a validation error. Choose one collapse strategy.
  • Several annotations are species- or assembly-restricted. MANE applies to GRCh38 only. AlphaMissense, REVEL, and CADD are human only. APPRIS, TSL, and CCDS are human and mouse only. Enabling a restricted annotation outside its scope simply yields no extra data.
  • A coding or genomic HGVS form is more reliable than a protein form. A protein-level notation can map to multiple transcripts ambiguously, so prefer coding or genomic notation when available.

Toolkit Notes

These apply to every Ensembl tool in this toolkit (ensembl-lookup, ensembl-sequence, ensembl-overlap, ensembl-xrefs, ensembl-vep).
  • Requires network access. Every tool calls the live Ensembl REST API. None runs offline and no local copy of the database is kept.
  • Subject to the Ensembl REST rate limit. Ensembl REST enforces a uniform per-IP limit of roughly 55,000 requests per hour, returning HTTP 429 with a Retry-After header when exceeded. There is no account or API key that raises this limit.
Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.