Proto is not affiliated with NCBI. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.
| Function | Description | |
|---|---|---|
run_ncbi_efetch() | Fetch FASTA records from NCBI sequence dbs (protein/nuccore) by accession or ID | Docs Source |
run_ncbi_esearch() | Search NCBI Entrez databases by query term to find matching IDs | Docs Source |
run_ncbi_esummary() | Retrieve record summary metadata from NCBI Entrez by ID | Docs Source |
Background
NCBI Entrez and its underlying sequence archive are described in the GenBank report (Sayers et al., 2022), published in Nucleic Acids Research. The Entrez system and the E-utilities are operated by the National Center for Biotechnology Information (NCBI), part of the U.S. National Library of Medicine (NLM). Entrez unifies search and retrieval across more than forty interconnected databases, including the protein, nucleotide, and gene databases used here, with records drawn from sources such as RefSeq and GenBank. Internally, each tool issues an HTTP GET to the E-utilities base endpointhttps://eutils.ncbi.nlm.nih.gov/entrez/eutils. ncbi-esearch calls esearch.fcgi and returns the JSON idlist. ncbi-esummary calls esummary.fcgi and returns the JSON result map. ncbi-efetch calls efetch.fcgi with a FASTA rettype and parses the response into records. Every request carries a fixed tool= identifier. The email= and api_key= parameters are sent only when ncbi_email and ncbi_api_key are configured. The request URL surfaced on outputs is sanitized so the API key and email are stripped before it is returned. Records and their provenance come directly from NCBI’s live E-utilities, so results reflect the database state at query time rather than a fixed release snapshot.
Learning Resources
- Entrez Programming Utilities Help (NCBI) - the official E-utilities reference covering each endpoint, parameters, and response formats.
- General usage guidelines and API key information (NCBI) - the official guidance on rate limits, the
toolandemailparameters, and obtaining an API key. - Entrez Help (NCBI) - introduction to Entrez databases, search field tags, and query syntax.
Tools
NCBI Entrez ESearch (ncbi-esearch)
Runs a query term against a chosen Entrez database and returns the list of matching record UIDs, with optional pagination, sort key, single-field restriction, and date filtering on a modification, publication, or Entrez date axis.API Reference
Input: NCBIEsearchInput
Input: NCBIEsearchInput
protein, nuccore, nucleotide, gene, pubmed, pmc, taxonomy, structure, snp, clinvar, omim, biosample, bioproject, sra, assembly, ipg, mesh, genome, dbvar, gds, geoprofiles, medgen, proteinclusters, protfam, pccompound, pcsubstance, pcassayretmax).retstart).YYYY/MM/DD (also YYYY/MM and YYYY); requires datetype.Config: NCBIFetchConfig
Config: NCBIFetchConfig
NCBI_API_KEY environment variable; an explicit value passed to the config overrides the env var.tool and email for traceability. Defaults to the NCBI_EMAIL environment variable; an explicit value passed to the config overrides the env var.True is coerced to 1 and False to 0.None waits indefinitely.BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.Applications
Use this as the entry point of an Entrez retrieval pipeline: resolve a gene symbol and organism to candidate protein or nucleotide UIDs, page through a large hit set withretstart and max_results, or restrict a literature query by date before downstream processing. The returned UIDs feed directly into ncbi-esummary for metadata screening and ncbi-efetch for sequence retrieval, and a resolved accession pairs naturally with the UniProt and sequence-fetch tools.Usage Tips
- The returned IDs are Entrez UIDs, not always accessions. Depending on the database they may be numeric GI numbers (GenInfo Identifiers). Resolve them through
ncbi-esummaryorncbi-efetchto obtain accession-bearing records. - Date bounds need a date axis. Setting
mindate,maxdate, orreldatewithoutdatetypeis rejected, because NCBI silently ignores date filters that lack an axis. sortandfieldare database-specific. A key valid onpubmedmay be invalid onprotein. Consult the Entrez help for the database being queried.
NCBI Entrez ESummary (ncbi-esummary)
Retrieves the document summary for a UID or accession from a chosen Entrez database and returns the summary as a database-specific mapping alongside the sanitized request URL.API Reference
Input: NCBIEsummaryInput
Input: NCBIEsummaryInput
NCBIDatabase for the full set of supported databases.Available options: protein, nuccore, nucleotide, gene, pubmed, pmc, taxonomy, structure, snp, clinvar, omim, biosample, bioproject, sra, assembly, ipg, mesh, genome, dbvar, gds, geoprofiles, medgen, proteinclusters, protfam, pccompound, pcsubstance, pcassayConfig: NCBIFetchConfig
Config: NCBIFetchConfig
NCBI_API_KEY environment variable; an explicit value passed to the config overrides the env var.tool and email for traceability. Defaults to the NCBI_EMAIL environment variable; an explicit value passed to the config overrides the env var.True is coerced to 1 and False to 0.None waits indefinitely.BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.Applications
Use this to screen candidates cheaply before fetching full records: take the UIDs fromncbi-esearch, inspect titles, lengths, or organism fields in the summary, and select the canonical record before paying the cost of a sequence download with ncbi-efetch. A gene-database summary also bridges organism-level data to protein-centric records in the UniProt tool by resolving the canonical gene symbol.Usage Tips
- The summary shape depends on the database. A
genesummary nests fields under the UID key whileproteinandnuccoresummaries expose record fields directly. Read the structure for the database queried rather than assuming a fixed schema. - Multiple identifiers can be summarized in one call. Passing a comma-joined list of UIDs returns one entry per UID, which is the efficient way to screen a full
ncbi-esearchhit set. - A missing record raises rather than returning empty. An unresolved database-and-identifier pair raises an error, so guard identifiers that may be obsolete or suppressed.
NCBI Entrez EFetch (ncbi-efetch)
Fetches full sequence records by UID or accession from the protein, nuccore, or nucleotide databases, returning parsed FASTA records and the sanitized request URL, with optional subsequence and strand selection.API Reference
Input: NCBIEfetchInput
Input: NCBIEfetchInput
protein, nuccore, nucleotidefasta, fasta_cds_naConfig: NCBIFetchConfig
Config: NCBIFetchConfig
NCBI_API_KEY environment variable; an explicit value passed to the config overrides the env var.tool and email for traceability. Defaults to the NCBI_EMAIL environment variable; an explicit value passed to the config overrides the env var.True is coerced to 1 and False to 0.None waits indefinitely.BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.Applications
Use this to pull reference sequences into a design or analysis pipeline: retrieve a wild-type protein before sequence design, fetch a coding DNA sequence withreturn_format="fasta_cds_na" for codon-usage analysis, or extract a defined genomic region for regulatory-element work. It is the final stage of the canonical Entrez chain, consuming UIDs produced by ncbi-esearch and screened with ncbi-esummary, and complements the UniProt and sequence-fetch tools for cross-source retrieval.Usage Tips
- Restricted to sequence databases. Only
protein,nuccore, andnucleotidereturn sequence bodies. Metadata databases requirencbi-esummaryinstead. fasta_cds_narequires a nucleotide database. This return format extracts coding DNA and is rejected fordb="protein", since CDS extraction has no meaning on a protein record.- Subsequence coordinates are 1-indexed and inclusive on both ends, to match biological residue selection conventions. Position 1 is the first residue, and
seq_stopis included in the returned span. - Strand
"-"returns the reverse complement. Antisense retrieval applies to nucleotide databases. It returns the reverse complement of the requested region.
Toolkit Notes
These apply to every NCBI tool in this toolkit (ncbi-esearch, ncbi-esummary, ncbi-efetch).
- Requires network access. The tools call the live NCBI E-utilities online.
- An NCBI API key raises the rate limit. Without credentials, NCBI E-utilities permits 3 requests per second per IP. Setting credentials raises this to 10 requests per second. A key is obtained at no cost from the Settings page of a free NCBI account (https://www.ncbi.nlm.nih.gov/account/). NCBI also asks that a contact email be set; it uses the email for abuse handling and IP-block recovery. Provide credentials either via the
ncbi_api_key/ncbi_emailconfig attributes or via theNCBI_API_KEY/NCBI_EMAILenvironment variables; an explicit config value overrides the env var.

NCBI