Skip to main content
License: InterPro retrieves data from the InterPro classification, distributed under the EMBL-EBI Terms of Use. The client wrapper code is Apache-2.0-licensed. Please refer to the data terms for full terms.

Proto is not affiliated with EMBL-EBI. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.


ebi-pf-team/interproscan
ebi-pf-team/interproscan
Genome-scale protein function classification
379 stars
View repo
ebi.ac.uk
Visit website
InterPro: the protein sequence classification resource in 2025
Matthias Blum, Antonina Andreeva, … Alex Bateman
Nucleic Acids Research (2025)
Read paper
@article{blum2025interpro,
  title={{InterPro}: the protein sequence classification resource in 2025},
  author={Blum, Matthias and Andreeva, Antonina and Florentino, Laise Cavalcanti and Chuguransky, Sara Rocio and Grego, Tiago and Hobbs, Emma and Pinto, Beatriz Lazaro and Orr, Ailsa and Paysan-Lafosse, Typhaine and Ponamareva, Irina and Salazar, Gustavo A. and Bordin, Nicola and Bork, Peer and Bridge, Alan and Colwell, Lucy and Gough, Julian and Haft, Daniel H. and Letunic, Ivica and Llinares-L{\'o}pez, Felipe and Marchler-Bauer, Aron and Meng-Papaxanthos, Laetitia and Mi, Huaiyu and Natale, Darren A. and Orengo, Christine A. and Pandurangan, Arun P. and Piovesan, Damiano and Rivoire, Catherine and Sigrist, Christian J. A. and Thanki, Narmada and Thibaud-Nissen, Fran{\c{c}}oise and Thomas, Paul D. and Tosatto, Silvio C. E. and Wu, Cathy H. and Bateman, Alex},
  journal={Nucleic Acids Research},
  volume={53},
  number={D1},
  pages={D444--D456},
  year={2025},
  publisher={Oxford University Press},
  doi={10.1093/nar/gkae1082}
}
Copy citation
proto-bio/proto-tools/proto_tools/tools/database_retrieval/interproscan
View source
Open Notebook
Open notebook
Coming soon!
Run this tool directly in Proto with no setup required.
FunctionDescription
run_interproscan_fetch()Fetch InterPro domain annotations by UniProt accession (direct REST lookup) or by raw protein seq… Docs Source

Background

InterPro (Blum et al., 2025) is a freely accessible classification of protein families, domains, conserved sites, and homologous superfamilies, maintained by EMBL-EBI. A protein family is a set of evolutionarily related proteins that descend from a shared ancestor and share detectable sequence similarity, typically along with a common three-dimensional fold or biological function. A single InterPro entry groups orthogonal member-database signatures, such as a Pfam HMM and a CATH-Gene3D structural model, under one accession. InterProScan is the analysis pipeline that runs the member-database models against a sequence, and EBI exposes it as a public web service. Internally, the direct path issues GET https://www.ebi.ac.uk/interpro/api/entry/all/protein/uniprot/{accession}, walking the opaque next cursor across paginated responses until the result set is exhausted. The submit path issues POST https://www.ebi.ac.uk/Tools/services/rest/iprscan5/run/ with a required contact email and the sequence, receives a plain-text job ID, polls /status/{job_id} every three seconds until the job reaches FINISHED, then fetches /result/{job_id}/json. Both paths flatten matches into the same row schema, with each member-database match contributing rows carrying 1-indexed inclusive start and end coordinates to match biological residue selection conventions, a unified type label, the parent InterPro accession when integrated, and optional Gene Ontology (GO) and pathway cross-references. Annotations and their provenance come directly from EMBL-EBI’s official InterPro REST API and iprscan5 service. Results reflect the live resource at query time rather than a fixed release snapshot.

Learning Resources

Tools

InterProScan Fetch (interproscan-fetch)

Retrieves InterPro domain annotations for a protein, either by direct REST lookup of a UniProt accession or by submitting a raw sequence to the iprscan5 service, and returns the resolved accession, sequence length, the list of member-database hits, the source URL, the iprscan5 job ID on the sequence path, and the raw API entries.

API Reference

Source
uniprot_id
string
UniProt accession for direct entry lookup against interpro/api/entry/all/protein/uniprot/{acc}/.
sequence
string
Raw protein sequence for the iprscan5 submit-and-scan path. Requires config.email.
Source
email
string
Required by EBI’s iprscan5 endpoint when submitting a sequence; ignored on the direct UniProt-lookup path. Defaults to the INTERPROSCAN_EMAIL environment variable; an explicit value passed to the config overrides the env var.
applications
array
Submit-only — restrict iprscan5 to a subset of member databases. None runs the EBI default set (every application enabled, matching upstream appl[] defaults).
include_go_terms
boolean
default:"True"
Include GO term cross-references in the output. Maps to iprscan5’s goterms form param on the submit path; filters parser output on the direct path.
include_pathways
boolean
default:"True"
Fetch Reactome/KEGG/MetaCyc pathway cross-references after an iprscan5 sequence submission. Has no effect on the UniProt-id path — InterPro’s UniProt-keyed endpoint does not return pathway data, so this stays empty on that path regardless of the flag.
sequence_type
enum
default:"protein"
Submit-only — nucleic tells iprscan5 to 6-frame translate the input.Available options: protein, nucleic
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cpu"
Device to run the tool on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
accession
string
Resolved UniProt accession; None when the sequence path returns a result without a UniProt cross-reference.
sequence_length
integer
Length of the queried protein.
domains
List[InterProDomain]
All hits across all member databases, in the order returned by the API.
num_domains
integer
required
len(domains).
job_id
string
required
iprscan5 job ID for the submit path; empty string for the direct-lookup path.
source_url
string
required
Canonical InterPro entry URL for the resolved accession (or the iprscan5 result URL on the sequence path).
raw_entries
List[Dict[string, any]]
Raw API JSON entries — one per InterPro entry on the direct path, one per match on the sequence path — for advanced consumers.

Applications

Use this to attach domain, family, and site annotation to a protein before design or filtering: identify the residues of an active_site or conserved_site match to lock before a redesign loop, partition a sequence into typed family and domain regions, or collect GO and pathway cross-references for functional grouping. The resolved accession and the parent InterPro identifiers compose with the UniProt and AlphaFold DB tools for accession resolution and structural context.

Usage Tips

  • The sequence-submission path requires a contact email. When sequence is provided, config.email must be set. Provide it either via the email config attribute or via the INTERPROSCAN_EMAIL environment variable; an explicit config value overrides the env var. The tool raises a clear ValueError before contacting the server if neither is set. The direct accession path ignores email.
  • Provide exactly one of uniprot_id or sequence. The input validator rejects a call that supplies both or neither.
  • score units are not uniform across rows. The field carries whichever value the source member database publishes, an e-value for some databases and a bit-score for others, so filter by member_database before comparing scores.
  • The direct path returns no pathway cross-references. InterPro’s UniProt-keyed endpoint does not surface pathway data, so pathways stays empty on that path regardless of configuration. Pathways are only populated on the sequence-submission path.
  • A direct lookup raises when the accession is not indexed. Very recent or removed UniProt accessions outside InterPro’s coverage return no entries, surfacing as a ValueError rather than an empty result.

Toolkit Notes

These apply to every InterProScan tool in this toolkit (interproscan-fetch).
  • Requires network access. The tool calls the live InterPro REST API and iprscan5 service. It does not run offline and keeps no local copy of the data.
  • The sequence-submission path requires a contact email for identification. This email lets EBI contact the submitter about job issues. It does not raise any bandwidth or rate allowance.
  • Sequence submissions are subject to a fair-use concurrency cap. EBI asks that jobs be submitted in batches of no more than 30 concurrent jobs.
Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.