Skip to main content
Unified Sequence Fetch

This toolkit is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.


GenBank
Eric W Sayers, Mark Cavanaugh, … Ilene Karsch-Mizrachi
Nucleic Acids Research (2022)
Read paper
@article{sayers2022genbank,
  title={GenBank},
  author={Sayers, Eric W and Cavanaugh, Mark and Clark, Karen and Pruitt, Kim D and Schoch, Conrad L and Sherry, Stephen T and Karsch-Mizrachi, Ilene},
  journal={Nucleic Acids Research},
  volume={50},
  number={D1},
  pages={D161--D164},
  year={2022},
  publisher={Oxford University Press},
  doi={10.1093/nar/gkab1135}
}

@article{theuniprotconsortium2025,
  title={UniProt: the Universal Protein Knowledgebase in 2025},
  author={The UniProt Consortium},
  journal={Nucleic Acids Research},
  volume={53},
  number={D1},
  pages={D609--D617},
  year={2025},
  publisher={Oxford University Press},
  doi={10.1093/nar/gkae1010}
}

@article{burley2022rcsb,
  title={RCSB Protein Data Bank: Celebrating 50 years of the PDB with new tools for understanding and visualizing biological macromolecules in 3D},
  author={Burley, Stephen K and Bhikadiya, Charmi and Bi, Chunxiao and Bittrich, Sebastian and Duarte, Jose M and Dutta, Sayan and Feng, Zukang and Goodsell, David S and others},
  journal={Protein Science},
  volume={31},
  number={1},
  pages={187--208},
  year={2022},
  publisher={Wiley},
  doi={10.1002/pro.4213}
}
Copy citation
proto-bio/proto-tools/proto_tools/tools/database_retrieval/sequence_fetch
View source
Open Notebook
Open notebook
Coming soon!
Run this tool directly in Proto with no setup required.
FunctionDescription
run_sequence_fetch()Fetch DNA, RNA, protein, and structure records from NCBI, UniProt, and PDB Docs Source
License: Unified Sequence Fetch’s own code is licensed under Apache-2.0, and it federates over bundled data sources and components, each under its own license terms.Bundled dependencies, each under its own license:Review each source’s terms before commercial use or redistribution.

Background

This tool wraps three public databases rather than a single predictive model, so it has no single primary paper. The underlying sources are GenBank (Sayers et al., 2022), UniProt (The UniProt Consortium, 2025), and the RCSB Protein Data Bank (Berman et al., 2000). Internally, SequenceFetchInput wraps a list[SequenceFetchRequest], and each request is resolved independently by molecule type with deterministic, priority-based routing where provided identifiers are consulted before a name-and-organism search. For a protein request the routing priority is a supplied UniProt accession (resolved at rest.uniprot.org), then an NCBI protein accession, then a linked PDB entry’s FASTA chains, and finally a name-and-organism search. Nucleotide requests use the NCBI E-utilities at https://eutils.ncbi.nlm.nih.gov/entrez/eutils (esearch, esummary, efetch), with a gene-locus coordinate fallback that fetches the genomic interval directly from the chromosome accession. Structure requests resolve a PDB identifier and read entry metadata from https://data.rcsb.org, with chain sequences pulled from https://www.rcsb.org/fasta/entry. Every returned record carries a source URL and a SHA256 checksum for provenance, and the NCBI API key and contact email are sanitized out of provenance URLs. Genomic coordinates are interpreted as 1-indexed, inclusive intervals to match biological residue selection conventions. Results reflect the live databases at query time rather than a fixed release snapshot.

Learning Resources

Tools

Multi-source Sequence Fetch (sequence-fetch)

Resolves a list of SequenceFetchRequest objects across NCBI Entrez, UniProt, and RCSB PDB, returning per-request fetched sequences, fetched structures, resolved identifiers, warnings, and errors, with run-level counts of successful, warning, and failed requests.

API Reference

Source
requests
List[SequenceFetchRequest]
required
One or more retrieval requests.
Source
max_candidates_per_source
integer
default:"5"
Maximum database candidates to evaluate per name-based search.
type_check_mode
enum
default:"error"
Controls how molecule-type mismatches are handled (e.g. requesting “protein” for a name that looks like an ncRNA gene). "off" skips validation entirely; "warn" records a warning but continues; "error" (default) fails the request.Available options: off, warn, error
ncbi_api_key
string
Optional NCBI API key (lifts rate limit from 3 to 10 requests/second). Defaults to the NCBI_API_KEY environment variable; an explicit value passed to the config overrides the env var.
ncbi_email
string
Optional contact email. Defaults to the NCBI_EMAIL environment variable; an explicit value passed to the config overrides the env var.
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cpu"
Device to run the tool on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
results
List[SequenceFetchResult]
Per-request retrieval outcomes.

Applications

Use this to pull mixed protein, nucleotide, and structure data for many targets in a single call. Resolve a batch of gene symbols plus organisms to reference protein sequences for analysis or design, retrieve coding DNA and transcript RNA alongside the protein for a codon-optimization workflow, or fetch a protein and its linked PDB structures together to seed structure-aware downstream steps. Partial batches are normal. Each request reports its own status, sequences, structures, and errors independently, so one unresolved target does not fail the job.

Usage Tips

  • Provide identifiers whenever possible. Accession overrides such as uniprot_id, genbank_accession, or pdb_id route directly and are far more reliable than a name-and-organism search, which selects a single top-ranked candidate that may be ambiguous.
  • The strand in genomic_coordinates changes the returned sequence. Coordinates are interpreted as 1-indexed, inclusive intervals to match biological residue selection conventions, and an explicit + or - strand controls whether the forward or reverse-complement sequence is returned. Omitting the strand can yield the wrong-orientation sequence for genes on the minus strand.
  • A protein hit does not guarantee a structure exists. Structure retrieval requires a linked PDB entry. A UniProt or NCBI protein match with no PDB cross-reference produces a not-found error for the structure type while the protein sequence still returns.
  • type_check_mode defaults to "error", the right setting for production. It rejects obvious molecule-type mismatches early, such as requesting protein for a name that looks like a non-coding RNA (ncRNA) gene or for an NR_ or XR_ RefSeq transcript. "warn" records the mismatch and continues, and "off" skips the check entirely.
  • rna_premrna is inferred, not curated. It is transcribed from the genomic DNA sequence and includes introns where present, so it is annotation-dependent and not a directly curated transcript.

Toolkit Notes

These apply to every Sequence Fetch tool in this toolkit (sequence-fetch).
  • Requires network access. The tool federates the live NCBI, UniProt, and RCSB PDB endpoints. It does not run offline and keeps no local copy of any database.
  • An NCBI API key raises the rate limit for NCBI-backed requests. Requests routed to NCBI E-utilities are limited to 3 requests per second per IP without credentials. Setting credentials raises this to 10 requests per second. A key is obtained at no cost from the Settings page of a free NCBI account (https://www.ncbi.nlm.nih.gov/account/). NCBI also asks that a contact email be set; it uses the email for abuse handling and IP-block recovery. Provide credentials either via the ncbi_api_key / ncbi_email config attributes or via the NCBI_API_KEY / NCBI_EMAIL environment variables. An explicit config value overrides the env var. The UniProt and RCSB PDB backends are keyless and have no equivalent mechanism.
Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.