NCBI Entrez - Proto

License: NCBI Entrez retrieves data from NCBI’s Entrez databases, in the public domain (U.S. Government public domain). The client wrapper code is Apache-2.0-licensed. Please refer to the data terms for full terms.

Proto is not affiliated with NCBI. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.

Website Website Publication Publication Cite Cite Tool Source Tool Source Open as Notebook Open as Notebook Open on Proto Open on Proto

ncbi.nlm.nih.gov

Visit website

GenBank

Eric W Sayers, Mark Cavanaugh, … Ilene Karsch-Mizrachi

Nucleic Acids Research (2022)

Read paper

@article{sayers2022genbank,
  title={GenBank},
  author={Sayers, Eric W and Cavanaugh, Mark and Clark, Karen and Pruitt, Kim D and Schoch, Conrad L and Sherry, Stephen T and Karsch-Mizrachi, Ilene},
  journal={Nucleic Acids Research},
  volume={50},
  number={D1},
  pages={D161--D164},
  year={2022},
  publisher={Oxford University Press},
  doi={10.1093/nar/gkab1135}
}

Copy citation

proto-bio/proto-tools/proto_tools/tools/database_retrieval/ncbi

View source

Open Notebook

Open notebook

Coming soon!

Run this tool directly in Proto with no setup required.

Function	Description
`run_ncbi_efetch()`	Fetch FASTA records from NCBI sequence dbs (protein/nuccore) by accession or ID	Docs Source
`run_ncbi_esearch()`	Search NCBI Entrez databases by query term to find matching IDs	Docs Source
`run_ncbi_esummary()`	Retrieve record summary metadata from NCBI Entrez by ID	Docs Source

Background

NCBI Entrez and its underlying sequence archive are described in the GenBank report (Sayers et al., 2022), published in Nucleic Acids Research. The Entrez system and the E-utilities are operated by the National Center for Biotechnology Information (NCBI), part of the U.S. National Library of Medicine (NLM). Entrez unifies search and retrieval across more than forty interconnected databases, including the protein, nucleotide, and gene databases used here, with records drawn from sources such as RefSeq and GenBank. Internally, each tool issues an HTTP GET to the E-utilities base endpoint https://eutils.ncbi.nlm.nih.gov/entrez/eutils. ncbi-esearch calls esearch.fcgi and returns the JSON idlist. ncbi-esummary calls esummary.fcgi and returns the JSON result map. ncbi-efetch calls efetch.fcgi with a FASTA rettype and parses the response into records. Every request carries a fixed tool= identifier. The email= and api_key= parameters are sent only when ncbi_email and ncbi_api_key are configured. The request URL surfaced on outputs is sanitized so the API key and email are stripped before it is returned. Records and their provenance come directly from NCBI’s live E-utilities, so results reflect the database state at query time rather than a fixed release snapshot.

Learning Resources

Entrez Programming Utilities Help (NCBI) - the official E-utilities reference covering each endpoint, parameters, and response formats.
General usage guidelines and API key information (NCBI) - the official guidance on rate limits, the tool and email parameters, and obtaining an API key.
Entrez Help (NCBI) - introduction to Entrez databases, search field tags, and query syntax.

Tools

NCBI Entrez ESearch (`ncbi-esearch`)

Runs a query term against a chosen Entrez database and returns the list of matching record UIDs, with optional pagination, sort key, single-field restriction, and date filtering on a modification, publication, or Entrez date axis.

API Reference

Source

Input: NCBIEsearchInput

enum

required

NCBI database to query (e.g. ‘protein’, ‘nuccore’, ‘gene’, ‘pubmed’, ‘taxonomy’, ‘structure’).Available options: protein, nuccore, nucleotide, gene, pubmed, pmc, taxonomy, structure, snp, clinvar, omim, biosample, bioproject, sra, assembly, ipg, mesh, genome, dbvar, gds, geoprofiles, medgen, proteinclusters, protfam, pccompound, pcsubstance, pcassay

search_term

string

required

NCBI search query.

max_results

integer

default:"20"

Max IDs returned (NCBI retmax).

retstart

integer

default:"0"

0-indexed offset of the first hit (NCBI retstart).

sort

string

Sort key (db-dependent — e.g. ‘relevance’ / ‘pub_date’ / ‘most_recent’ on pubmed).

field

string

Restrict the search term to a single index field (db-dependent — e.g. ‘title’ / ‘author’ on pubmed).

datetype

string

Date axis for mindate/maxdate/reldate (modification / publication / Entrez).

mindate

string

Lower date bound, YYYY/MM/DD (also YYYY/MM and YYYY); requires datetype.

maxdate

string

Upper date bound; requires datetype.

reldate

integer

Restrict to records dated within the last N days; requires datetype.

Source

Config: NCBIFetchConfig

ncbi_api_key

string

Optional NCBI API key (lifts rate limit from 3 to 10 requests/second). Defaults to the NCBI_API_KEY environment variable; an explicit value passed to the config overrides the env var.

ncbi_email

string

Optional contact email. NCBI usage policy requires both tool and email for traceability. Defaults to the NCBI_EMAIL environment variable; an explicit value passed to the config overrides the env var.

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cpu"

Device to run the tool on.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.

Source

Output: NCBIEsearchOutput

ids

List[string]

List of NCBI IDs matching the search query.

Applications

Use this as the entry point of an Entrez retrieval pipeline: resolve a gene symbol and organism to candidate protein or nucleotide UIDs, page through a large hit set with retstart and max_results, or restrict a literature query by date before downstream processing. The returned UIDs feed directly into ncbi-esummary for metadata screening and ncbi-efetch for sequence retrieval, and a resolved accession pairs naturally with the UniProt and sequence-fetch tools.

Usage Tips

The returned IDs are Entrez UIDs, not always accessions. Depending on the database they may be numeric GI numbers (GenInfo Identifiers). Resolve them through ncbi-esummary or ncbi-efetch to obtain accession-bearing records.
Date bounds need a date axis. Setting mindate, maxdate, or reldate without datetype is rejected, because NCBI silently ignores date filters that lack an axis.
sort and field are database-specific. A key valid on pubmed may be invalid on protein. Consult the Entrez help for the database being queried.

NCBI Entrez ESummary (`ncbi-esummary`)

Retrieves the document summary for a UID or accession from a chosen Entrez database and returns the summary as a database-specific mapping alongside the sanitized request URL.

API Reference

Source

Input: NCBIEsummaryInput

enum

required

NCBI database to query (e.g. ‘protein’, ‘nuccore’, ‘gene’, ‘pubmed’, ‘taxonomy’, ‘structure’). See NCBIDatabase for the full set of supported databases.Available options: protein, nuccore, nucleotide, gene, pubmed, pmc, taxonomy, structure, snp, clinvar, omim, biosample, bioproject, sra, assembly, ipg, mesh, genome, dbvar, gds, geoprofiles, medgen, proteinclusters, protfam, pccompound, pcsubstance, pcassay

identifier

string

required

Accession or NCBI ID to summarize (e.g. ‘NP_000537.3’, ‘7157’).

Source

Config: NCBIFetchConfig

ncbi_api_key

string

Optional NCBI API key (lifts rate limit from 3 to 10 requests/second). Defaults to the NCBI_API_KEY environment variable; an explicit value passed to the config overrides the env var.

ncbi_email

string

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cpu"

Device to run the tool on.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Source

Output: NCBIEsummaryOutput

summary

Dict[string, any]

Record summary data returned by esummary.

source_url

string

required

Sanitized URL used for the request.

Applications

Use this to screen candidates cheaply before fetching full records: take the UIDs from ncbi-esearch, inspect titles, lengths, or organism fields in the summary, and select the canonical record before paying the cost of a sequence download with ncbi-efetch. A gene-database summary also bridges organism-level data to protein-centric records in the UniProt tool by resolving the canonical gene symbol.

Usage Tips

The summary shape depends on the database. A gene summary nests fields under the UID key while protein and nuccore summaries expose record fields directly. Read the structure for the database queried rather than assuming a fixed schema.
Multiple identifiers can be summarized in one call. Passing a comma-joined list of UIDs returns one entry per UID, which is the efficient way to screen a full ncbi-esearch hit set.
A missing record raises rather than returning empty. An unresolved database-and-identifier pair raises an error, so guard identifiers that may be obsolete or suppressed.

NCBI Entrez EFetch (`ncbi-efetch`)

Fetches full sequence records by UID or accession from the protein, nuccore, or nucleotide databases, returning parsed FASTA records and the sanitized request URL, with optional subsequence and strand selection.

API Reference

Source

Input: NCBIEfetchInput

enum

required

Sequence database to query.Available options: protein, nuccore, nucleotide

identifier

string

required

Accession or NCBI ID to fetch (e.g. ‘NP_000537.3’).

return_format

enum

default:"fasta"

NCBI rettype. ‘fasta_cds_na’ is nuccore-only.Available options: fasta, fasta_cds_na

seq_start

integer

Subsequence start (1-indexed, inclusive).

seq_stop

integer

Subsequence stop (1-indexed, inclusive).

strand

string

Strand for nucleotide retrieval.

Source

Config: NCBIFetchConfig

ncbi_api_key

string

Optional NCBI API key (lifts rate limit from 3 to 10 requests/second). Defaults to the NCBI_API_KEY environment variable; an explicit value passed to the config overrides the env var.

ncbi_email

string

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cpu"

Device to run the tool on.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Source

Output: NCBIEfetchOutput

fasta_records

List[NCBIFastaRecord]

Parsed FASTA records from efetch.

Show NCBIFastaRecord

header

string

required

FASTA header line (without >).

sequence

string

required

Sequence string with whitespace stripped.

accession

string

Best-effort accession extracted from header.

source_url

string

required

Sanitized URL used for the request.

Applications

Use this to pull reference sequences into a design or analysis pipeline: retrieve a wild-type protein before sequence design, fetch a coding DNA sequence with return_format="fasta_cds_na" for codon-usage analysis, or extract a defined genomic region for regulatory-element work. It is the final stage of the canonical Entrez chain, consuming UIDs produced by ncbi-esearch and screened with ncbi-esummary, and complements the UniProt and sequence-fetch tools for cross-source retrieval.

Usage Tips

Restricted to sequence databases. Only protein, nuccore, and nucleotide return sequence bodies. Metadata databases require ncbi-esummary instead.
fasta_cds_na requires a nucleotide database. This return format extracts coding DNA and is rejected for db="protein", since CDS extraction has no meaning on a protein record.
Subsequence coordinates are 1-indexed and inclusive on both ends, to match biological residue selection conventions. Position 1 is the first residue, and seq_stop is included in the returned span.
Strand "-" returns the reverse complement. Antisense retrieval applies to nucleotide databases. It returns the reverse complement of the requested region.

Toolkit Notes

These apply to every NCBI tool in this toolkit (ncbi-esearch, ncbi-esummary, ncbi-efetch).

Requires network access. The tools call the live NCBI E-utilities online.
An NCBI API key raises the rate limit. Without credentials, NCBI E-utilities permits 3 requests per second per IP. Setting credentials raises this to 10 requests per second. A key is obtained at no cost from the Settings page of a free NCBI account (https://www.ncbi.nlm.nih.gov/account/). NCBI also asks that a contact email be set; it uses the email for abuse handling and IP-block recovery. Provide credentials either via the ncbi_api_key / ncbi_email config attributes or via the NCBI_API_KEY / NCBI_EMAIL environment variables; an explicit config value overrides the env var.

Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.

​Background

​Learning Resources

​Tools

​NCBI Entrez ESearch (ncbi-esearch)

​API Reference

​Applications

​Usage Tips

​NCBI Entrez ESummary (ncbi-esummary)

​API Reference

​Applications

​Usage Tips

​NCBI Entrez EFetch (ncbi-efetch)

​API Reference

​Applications

​Usage Tips

​Toolkit Notes

​Infrastructure Guides

Tool Persistence

Device Management

Parallel Execution

Cloud Inference

Background

Learning Resources

Tools

NCBI Entrez ESearch (`ncbi-esearch`)

API Reference

Applications

Usage Tips

NCBI Entrez ESummary (`ncbi-esummary`)

API Reference

Applications

Usage Tips

NCBI Entrez EFetch (`ncbi-efetch`)

API Reference

Applications

Usage Tips

Toolkit Notes

Infrastructure Guides