Proto is not affiliated with EMBL-EBI. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.
| Function | Description | |
|---|---|---|
run_ensembl_lookup() | Look up an Ensembl gene record by Ensembl gene ID or gene symbol | Docs Source |
run_ensembl_overlap() | Fetch features overlapping an Ensembl region (default: gene; supports exon, regulatory, motif, va… | Docs Source |
run_ensembl_sequence() | Fetch DNA / cDNA / CDS / protein sequence for an Ensembl ID | Docs Source |
run_ensembl_vep() | Predict variant consequences from an HGVS notation via Ensembl’s Variant Effect Predictor REST en… | Docs Source |
run_ensembl_xrefs() | Fetch cross-references from an Ensembl ID to external databases | Docs Source |
Background
Ensembl (Dyer et al., 2025) is a genome annotation resource maintained by EMBL-EBI. It integrates gene and transcript models, the Ensembl Regulatory Build, cross-references to external databases, and variant consequence prediction with the Variant Effect Predictor (VEP) for human and other supported species. Coordinates returned by Ensembl are 1-indexed and inclusive, to match biological residue selection conventions. Each tool issues a single HTTP GET to the Ensembl REST API, whose base URL ishttps://rest.ensembl.org for the GRCh38 assembly. Setting assembly="GRCh37" routes requests to https://grch37.rest.ensembl.org instead. The endpoints used are /lookup/id/{id} and /lookup/symbol/{species}/{symbol} for ensembl-lookup, /sequence/id/{id} for ensembl-sequence, /overlap/id/{id} for ensembl-overlap, /xrefs/id/{id} for ensembl-xrefs, and /vep/{species}/hgvs/{hgvs} for ensembl-vep. Responses are parsed into typed Pydantic records, with the full upstream JSON preserved alongside in a raw_payload field. PascalCase keys such as Transcript, Exon, and Translation are kept verbatim so records round-trip cleanly. Results reflect the live Ensembl database at query time rather than a fixed release snapshot.
Learning Resources
- Ensembl REST API documentation (Ensembl) - the live endpoint reference with request parameters, response shapes, and an interactive console.
- Ensembl and the Ensembl REST API (EMBL-EBI Training) - guided courses on Ensembl data and programmatic access.
Tools
Ensembl Lookup (ensembl-lookup)
Retrieves a single gene record, either directly by Ensembl gene ID or by gene symbol scoped to a species, returning the typed EnsemblGene (identifier, symbol, biotype, genomic coordinates, canonical transcript) plus the source URL and raw payload. With expand enabled, the response includes the nested transcript, translation, and exon hierarchy.API Reference
Config: EnsemblLookupConfig
Config: EnsemblLookupConfig
symbol is the input. Default homo_sapiens.Available options: homo_sapiens, mus_musculus, rattus_norvegicus, danio_rerio, saccharomyces_cerevisiaeGRCh38 (default) calls rest.ensembl.org; GRCh37 calls grch37.rest.ensembl.org.Available options: GRCh38, GRCh37False matches Ensembl REST./lookup/id only; requires expand=True)./lookup/id only)./lookup/id only; requires expand=True).True is coerced to 1 and False to 0.None waits indefinitely.BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.Applications
Use this to resolve a gene of interest as the entry point of nearly any Ensembl workflow. Convert a gene symbol such asBRCA1 into its stable Ensembl gene ID, read off the canonical transcript and genomic coordinates for downstream transcriptomics, GWAS annotation, or sequence retrieval with ensembl-sequence, or expand the transcript-and-exon hierarchy for splice-isoform analysis. The returned gene ID also feeds ensembl-overlap and ensembl-xrefs.Usage Tips
- Provide exactly one of
ensembl_idorsymbol. Supplying both or neither raises a validation error. A symbol lookup also requiresconfig.speciesto disambiguate. - Nested transcripts and exons are absent unless
expandis set. The default matches Ensembl REST and returns the gene record only, so request expansion explicitly when you need the transcript or exon hierarchy. mane,phenotypes, andutrapply only to ID-based lookup. They are sent only on the/lookup/idpath and are ignored for a symbol lookup.maneandutradditionally requireexpand.
Ensembl Sequence (ensembl-sequence)
Retrieves the sequence for an Ensembl gene, transcript, or protein ID and returns one or more EnsemblSequence records (stable ID, description, molecule type, sequence string) alongside the source URL and raw payload.API Reference
Input: EnsemblSequenceInput
Input: EnsemblSequenceInput
ENSG..., ENST..., or ENSP...).Config: EnsemblSequenceConfig
Config: EnsemblSequenceConfig
genomic (DNA + UTRs + introns), cdna (spliced mRNA + UTRs), cds (spliced coding only), protein (translation).Available options: genomic, cdna, cds, proteinGRCh38 (default) or GRCh37.Available options: GRCh38, GRCh37hard replaces with N; soft lowercases. Genomic sequence_type only; mutually exclusive with mask_feature.sequence_type='genomic') or UTRs (when sequence_type='cdna') so the primary feature stands out. Mutually exclusive with mask.end).start).expand_5prime).expand_3prime).True is coerced to 1 and False to 0.None waits indefinitely.BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.Output: EnsemblSequenceOutput
Output: EnsemblSequenceOutput
Applications
Use this to retrieve a reference sequence at gene, transcript, or protein granularity for any downstream analysis. Fetch the spliced mRNA or coding sequence of a canonical transcript resolved byensembl-lookup for primer design, codon-usage analysis, or sequence comparison, pull the genomic span with introns for promoter or splice-site studies, or obtain the protein translation for multiple-sequence alignment or structure prediction. Repeat masking and feature masking support cis-element and intron-aware analyses.Usage Tips
- The returned
idmay differ from the input ID. Requesting a protein sequence for a transcript ID resolves to the corresponding protein ID, so read the record’sidrather than assuming it echoes the input. maskandmask_featureare mutually exclusive, as are the expand and trim pairs. Setting both masks, or bothexpand_5primeandstart, or bothexpand_3primeandend, raises a validation error.- Repeat masking and span expansion apply to genomic sequence only. They have no effect on
cdna,cds, orproteinrequests. - Set
multiple_sequenceswhen an ID maps to more than one record. Without it, IDs that resolve to multiple sequences (patches, alternative haplotypes) return only the first.
Ensembl Overlap (ensembl-overlap)
Retrieves features overlapping the genomic region of a given Ensembl ID and returns a list of EnsemblOverlapFeatureRecord entries, each exposing the common typed fields (feature type, identifier, biotype, coordinates, strand, region) plus a raw dict carrying the full upstream record.API Reference
Input: EnsemblOverlapInput
Input: EnsemblOverlapInput
Config: EnsemblOverlapConfig
Config: EnsemblOverlapConfig
band, gene, transcript, cds, exon, repeat, simple, misc, variation, somatic_variation, structural_variation, somatic_structural_variation, constrained, regulatory, motif, maneGRCh38 (default) or GRCh37.Available options: GRCh38, GRCh37protein_coding); most useful when overlap_feature is gene or transcript.missense_variant).ClinVar).True is coerced to 1 and False to 0.None waits indefinitely.BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.Applications
Use this to annotate a genomic locus by listing what overlaps it. Identify which gene or transcript contains a GWAS hit, ChIP-seq peak, or ATAC-seq peak, enumerate the regulatory-build features (promoters, enhancers, transcription-factor binding sites) within a region for functional-genomics analysis, or pull overlapping variants filtered to a named set such as ClinVar to ask whether the region is clinically annotated. The locus is typically obtained fromensembl-lookup.Usage Tips
- Records are typed only on the common fields. Feature-specific keys live in
raw. Different feature classes return divergent payload shapes, so read per-feature attributes from each record’srawdict. biotypeis most meaningful for gene and transcript features. It filters those classes. Pairing it with unrelated feature types is unlikely to narrow results.so_termandvariant_setapply to variation features. They have no effect when the feature class is not a variation type.
Ensembl Xrefs (ensembl-xrefs)
Resolves an Ensembl ID to its external-database cross-references and returns a list of EnsemblXref records (external database name, display and primary identifiers, description, cross-reference type) plus the source URL and raw payload.API Reference
Input: EnsemblXrefsInput
Input: EnsemblXrefsInput
Config: EnsemblXrefsConfig
Config: EnsemblXrefsConfig
GRCh38 (default) or GRCh37.Available options: GRCh38, GRCh37UniProtKB/Swiss-Prot, HGNC).True is coerced to 1 and False to 0.None waits indefinitely.BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.Applications
Use this to convert identifiers between Ensembl and the other major sequence, gene, and protein resources. Map an Ensembl gene or protein to a UniProt accession before fetching its entry withuniprot-fetch, which then bridges through to PDB structures via SIFTS, recover EntrezGene or RefSeq identifiers for NCBI-side retrieval, or follow GO and InterPro cross-references for functional annotation. The Ensembl ID is commonly produced by ensembl-lookup.Usage Tips
- Filter on
dbnamein the result, not justexternal_db. A single query can return several UniProt-related and RefSeq-related entries, so select the row by itsdbnamerather than assuming one record per database. all_levelschanges result scope on gene queries. It fans cross-references out to child transcripts and translations, which can substantially enlarge the result.- Set
object_typewhen a stable ID resolves ambiguously. It restricts results to one feature type when the ID could map to a gene, transcript, or translation.
Ensembl VEP (ensembl-vep)
Submits an HGVS notation to the Ensembl Variant Effect Predictor REST endpoint and returns a list of EnsemblVEPConsequence records (echoed input, most severe consequence as a Sequence Ontology term, region and coordinates, allele string, raw per-transcript consequences, co-located variants) plus a derived num_consequences count.API Reference
Input: EnsemblVEPInput
Input: EnsemblVEPInput
9:g.22125504G>C), coding (ENST00000357654:c.5074G>A), or protein (ENSP00000418960:p.Tyr124Cys) forms all work.Config: EnsemblVEPConfig
Config: EnsemblVEPConfig
homo_sapiens.Available options: homo_sapiens, mus_musculus, rattus_norvegicus, danio_rerio, saccharomyces_cerevisiaeGRCh38 (default) or GRCh37.Available options: GRCh38, GRCh37b (both prediction + score), p (prediction only), s (score only); None falls back to API default.sift.None keeps the API default (5000).pick); incompatible with pick.True is coerced to 1 and False to 0.None waits indefinitely.BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.Applications
Use this to predict the functional consequence of a variant and assess its likely impact. Classify a coding or genomic HGVS notation as missense, synonymous, stop-gain, splice-disrupting, or noncoding, then read per-transcript SIFT and PolyPhen predictions alongside optional human-only AlphaMissense, REVEL, and CADD pathogenicity scores for clinical or research variant interpretation. Co-located variant lookups surface population-frequency context from gnomAD and ClinVar annotations. Candidate variants are often identified from features returned byensembl-overlap or from a designed or observed substitution in a downstream design workflow.Usage Tips
transcript_consequencesandcolocated_variantsare returned as raw dicts. Their field sets vary by consequence type and annotation toggles, so read them defensively rather than expecting a fixed shape.pickandper_genecannot be combined. Setting both raises a validation error. Choose one collapse strategy.- Several annotations are species- or assembly-restricted. MANE applies to GRCh38 only. AlphaMissense, REVEL, and CADD are human only. APPRIS, TSL, and CCDS are human and mouse only. Enabling a restricted annotation outside its scope simply yields no extra data.
- A coding or genomic HGVS form is more reliable than a protein form. A protein-level notation can map to multiple transcripts ambiguously, so prefer coding or genomic notation when available.
Toolkit Notes
These apply to every Ensembl tool in this toolkit (ensembl-lookup, ensembl-sequence, ensembl-overlap, ensembl-xrefs, ensembl-vep).
- Requires network access. Every tool calls the live Ensembl REST API. None runs offline and no local copy of the database is kept.
- Subject to the Ensembl REST rate limit. Ensembl REST enforces a uniform per-IP limit of roughly 55,000 requests per hour, returning HTTP 429 with a
Retry-Afterheader when exceeded. There is no account or API key that raises this limit.

EMBL-EBI