Skip to main content
License: NCBI Entrez retrieves data from NCBI’s Entrez databases, in the public domain (U.S. Government public domain). The client wrapper code is Apache-2.0-licensed. Please refer to the data terms for full terms.

Proto is not affiliated with NCBI. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.


ncbi.nlm.nih.gov
Visit website
GenBank
Eric W Sayers, Mark Cavanaugh, … Ilene Karsch-Mizrachi
Nucleic Acids Research (2022)
Read paper
@article{sayers2022genbank,
  title={GenBank},
  author={Sayers, Eric W and Cavanaugh, Mark and Clark, Karen and Pruitt, Kim D and Schoch, Conrad L and Sherry, Stephen T and Karsch-Mizrachi, Ilene},
  journal={Nucleic Acids Research},
  volume={50},
  number={D1},
  pages={D161--D164},
  year={2022},
  publisher={Oxford University Press},
  doi={10.1093/nar/gkab1135}
}
Copy citation
proto-bio/proto-tools/proto_tools/tools/database_retrieval/ncbi
View source
Open Notebook
Open notebook
Coming soon!
Run this tool directly in Proto with no setup required.
FunctionDescription
run_ncbi_efetch()Fetch FASTA records from NCBI sequence dbs (protein/nuccore) by accession or ID Docs Source
run_ncbi_esearch()Search NCBI Entrez databases by query term to find matching IDs Docs Source
run_ncbi_esummary()Retrieve record summary metadata from NCBI Entrez by ID Docs Source

Background

NCBI Entrez and its underlying sequence archive are described in the GenBank report (Sayers et al., 2022), published in Nucleic Acids Research. The Entrez system and the E-utilities are operated by the National Center for Biotechnology Information (NCBI), part of the U.S. National Library of Medicine (NLM). Entrez unifies search and retrieval across more than forty interconnected databases, including the protein, nucleotide, and gene databases used here, with records drawn from sources such as RefSeq and GenBank. Internally, each tool issues an HTTP GET to the E-utilities base endpoint https://eutils.ncbi.nlm.nih.gov/entrez/eutils. ncbi-esearch calls esearch.fcgi and returns the JSON idlist. ncbi-esummary calls esummary.fcgi and returns the JSON result map. ncbi-efetch calls efetch.fcgi with a FASTA rettype and parses the response into records. Every request carries a fixed tool= identifier. The email= and api_key= parameters are sent only when ncbi_email and ncbi_api_key are configured. The request URL surfaced on outputs is sanitized so the API key and email are stripped before it is returned. Records and their provenance come directly from NCBI’s live E-utilities, so results reflect the database state at query time rather than a fixed release snapshot.

Learning Resources

Tools

NCBI Entrez ESummary (ncbi-esummary)

Retrieves the document summary for a UID or accession from a chosen Entrez database and returns the summary as a database-specific mapping alongside the sanitized request URL.

API Reference

Source
db
enum
required
NCBI database to query (e.g. ‘protein’, ‘nuccore’, ‘gene’, ‘pubmed’, ‘taxonomy’, ‘structure’). See NCBIDatabase for the full set of supported databases.Available options: protein, nuccore, nucleotide, gene, pubmed, pmc, taxonomy, structure, snp, clinvar, omim, biosample, bioproject, sra, assembly, ipg, mesh, genome, dbvar, gds, geoprofiles, medgen, proteinclusters, protfam, pccompound, pcsubstance, pcassay
identifier
string
required
Accession or NCBI ID to summarize (e.g. ‘NP_000537.3’, ‘7157’).
Source
ncbi_api_key
string
Optional NCBI API key (lifts rate limit from 3 to 10 requests/second). Defaults to the NCBI_API_KEY environment variable; an explicit value passed to the config overrides the env var.
ncbi_email
string
Optional contact email. NCBI usage policy requires both tool and email for traceability. Defaults to the NCBI_EMAIL environment variable; an explicit value passed to the config overrides the env var.
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cpu"
Device to run the tool on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
summary
Dict[string, any]
Record summary data returned by esummary.
source_url
string
required
Sanitized URL used for the request.

Applications

Use this to screen candidates cheaply before fetching full records: take the UIDs from ncbi-esearch, inspect titles, lengths, or organism fields in the summary, and select the canonical record before paying the cost of a sequence download with ncbi-efetch. A gene-database summary also bridges organism-level data to protein-centric records in the UniProt tool by resolving the canonical gene symbol.

Usage Tips

  • The summary shape depends on the database. A gene summary nests fields under the UID key while protein and nuccore summaries expose record fields directly. Read the structure for the database queried rather than assuming a fixed schema.
  • Multiple identifiers can be summarized in one call. Passing a comma-joined list of UIDs returns one entry per UID, which is the efficient way to screen a full ncbi-esearch hit set.
  • A missing record raises rather than returning empty. An unresolved database-and-identifier pair raises an error, so guard identifiers that may be obsolete or suppressed.

NCBI Entrez EFetch (ncbi-efetch)

Fetches full sequence records by UID or accession from the protein, nuccore, or nucleotide databases, returning parsed FASTA records and the sanitized request URL, with optional subsequence and strand selection.

API Reference

Source
db
enum
required
Sequence database to query.Available options: protein, nuccore, nucleotide
identifier
string
required
Accession or NCBI ID to fetch (e.g. ‘NP_000537.3’).
return_format
enum
default:"fasta"
NCBI rettype. ‘fasta_cds_na’ is nuccore-only.Available options: fasta, fasta_cds_na
seq_start
integer
Subsequence start (1-indexed, inclusive).
seq_stop
integer
Subsequence stop (1-indexed, inclusive).
strand
string
Strand for nucleotide retrieval.
Source
ncbi_api_key
string
Optional NCBI API key (lifts rate limit from 3 to 10 requests/second). Defaults to the NCBI_API_KEY environment variable; an explicit value passed to the config overrides the env var.
ncbi_email
string
Optional contact email. NCBI usage policy requires both tool and email for traceability. Defaults to the NCBI_EMAIL environment variable; an explicit value passed to the config overrides the env var.
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cpu"
Device to run the tool on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
fasta_records
List[NCBIFastaRecord]
Parsed FASTA records from efetch.
source_url
string
required
Sanitized URL used for the request.

Applications

Use this to pull reference sequences into a design or analysis pipeline: retrieve a wild-type protein before sequence design, fetch a coding DNA sequence with return_format="fasta_cds_na" for codon-usage analysis, or extract a defined genomic region for regulatory-element work. It is the final stage of the canonical Entrez chain, consuming UIDs produced by ncbi-esearch and screened with ncbi-esummary, and complements the UniProt and sequence-fetch tools for cross-source retrieval.

Usage Tips

  • Restricted to sequence databases. Only protein, nuccore, and nucleotide return sequence bodies. Metadata databases require ncbi-esummary instead.
  • fasta_cds_na requires a nucleotide database. This return format extracts coding DNA and is rejected for db="protein", since CDS extraction has no meaning on a protein record.
  • Subsequence coordinates are 1-indexed and inclusive on both ends, to match biological residue selection conventions. Position 1 is the first residue, and seq_stop is included in the returned span.
  • Strand "-" returns the reverse complement. Antisense retrieval applies to nucleotide databases. It returns the reverse complement of the requested region.

Toolkit Notes

These apply to every NCBI tool in this toolkit (ncbi-esearch, ncbi-esummary, ncbi-efetch).
  • Requires network access. The tools call the live NCBI E-utilities online.
  • An NCBI API key raises the rate limit. Without credentials, NCBI E-utilities permits 3 requests per second per IP. Setting credentials raises this to 10 requests per second. A key is obtained at no cost from the Settings page of a free NCBI account (https://www.ncbi.nlm.nih.gov/account/). NCBI also asks that a contact email be set; it uses the email for abuse handling and IP-block recovery. Provide credentials either via the ncbi_api_key / ncbi_email config attributes or via the NCBI_API_KEY / NCBI_EMAIL environment variables; an explicit config value overrides the env var.
Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.