Unified Sequence Fetch

This toolkit is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.

Publication Publication Cite Cite Tool Source Tool Source Open as Notebook Open as Notebook Open on Proto Open on Proto

GenBank

Eric W Sayers, Mark Cavanaugh, … Ilene Karsch-Mizrachi

Nucleic Acids Research (2022)

Read paper

@article{sayers2022genbank,
  title={GenBank},
  author={Sayers, Eric W and Cavanaugh, Mark and Clark, Karen and Pruitt, Kim D and Schoch, Conrad L and Sherry, Stephen T and Karsch-Mizrachi, Ilene},
  journal={Nucleic Acids Research},
  volume={50},
  number={D1},
  pages={D161--D164},
  year={2022},
  publisher={Oxford University Press},
  doi={10.1093/nar/gkab1135}
}

@article{theuniprotconsortium2025,
  title={UniProt: the Universal Protein Knowledgebase in 2025},
  author={The UniProt Consortium},
  journal={Nucleic Acids Research},
  volume={53},
  number={D1},
  pages={D609--D617},
  year={2025},
  publisher={Oxford University Press},
  doi={10.1093/nar/gkae1010}
}

@article{burley2022rcsb,
  title={RCSB Protein Data Bank: Celebrating 50 years of the PDB with new tools for understanding and visualizing biological macromolecules in 3D},
  author={Burley, Stephen K and Bhikadiya, Charmi and Bi, Chunxiao and Bittrich, Sebastian and Duarte, Jose M and Dutta, Sayan and Feng, Zukang and Goodsell, David S and others},
  journal={Protein Science},
  volume={31},
  number={1},
  pages={187--208},
  year={2022},
  publisher={Wiley},
  doi={10.1002/pro.4213}
}

Copy citation

proto-bio/proto-tools/proto_tools/tools/database_retrieval/sequence_fetch

View source

Open Notebook

Open notebook

Coming soon!

Run this tool directly in Proto with no setup required.

Function	Description
`run_sequence_fetch()`	Fetch DNA, RNA, protein, and structure records from NCBI, UniProt, and PDB	Docs Source

License: Unified Sequence Fetch’s own code is licensed under Apache-2.0, and it federates over bundled data sources and components, each under its own license terms.Bundled dependencies, each under its own license:

NCBI Entrez: U.S. Government public domain
UniProt: CC-BY-4.0
RCSB PDB: CC0-1.0

Review each source’s terms before commercial use or redistribution.

Background

This tool wraps three public databases rather than a single predictive model, so it has no single primary paper. The underlying sources are GenBank (Sayers et al., 2022), UniProt (The UniProt Consortium, 2025), and the RCSB Protein Data Bank (Berman et al., 2000). Internally, SequenceFetchInput wraps a list[SequenceFetchRequest], and each request is resolved independently by molecule type with deterministic, priority-based routing where provided identifiers are consulted before a name-and-organism search. For a protein request the routing priority is a supplied UniProt accession (resolved at rest.uniprot.org), then an NCBI protein accession, then a linked PDB entry’s FASTA chains, and finally a name-and-organism search. Nucleotide requests use the NCBI E-utilities at https://eutils.ncbi.nlm.nih.gov/entrez/eutils (esearch, esummary, efetch), with a gene-locus coordinate fallback that fetches the genomic interval directly from the chromosome accession. Structure requests resolve a PDB identifier and read entry metadata from https://data.rcsb.org, with chain sequences pulled from https://www.rcsb.org/fasta/entry. Every returned record carries a source URL and a SHA256 checksum for provenance, and the NCBI API key and contact email are sanitized out of provenance URLs. Genomic coordinates are interpreted as 1-indexed, inclusive intervals to match biological residue selection conventions. Results reflect the live databases at query time rather than a fixed release snapshot.

Learning Resources

Entrez Programming Utilities help (NCBI) - official documentation for the E-utilities API, including esearch, esummary, and efetch.
UniProt help and documentation (UniProt) - official documentation covering accessions, query syntax, and the REST API.
RCSB PDB Data API (RCSB PDB) - official documentation for the PDB entry data and FASTA endpoints.

Tools

Multi-source Sequence Fetch (`sequence-fetch`)

Resolves a list of SequenceFetchRequest objects across NCBI Entrez, UniProt, and RCSB PDB, returning per-request fetched sequences, fetched structures, resolved identifiers, warnings, and errors, with run-level counts of successful, warning, and failed requests.

API Reference

Source

Input: SequenceFetchInput

requests

List[SequenceFetchRequest]

required

One or more retrieval requests.

Show SequenceFetchRequest

request_id

string

Optional caller-provided request identifier.

target_name

string

required

Gene, protein, or RNA name to resolve.

organism

string

required

Organism name used for disambiguation.

sequence_types

List[string]

required

Requested outputs: protein, dna, rna, or structure.

uniprot_id

string

UniProt accession override.

genbank_accession

string

GenBank accession override.

refseq_accession

string

RefSeq accession override.

pdb_id

string

PDB accession override.

gene_id

string

NCBI Gene ID override.

protein_id

string

NCBI protein accession override.

transcript_id

string

Transcript accession override.

genomic_coordinates

string

Genomic interval like NC_000913.3:1-100:+.

additional_ids

Dict[string, string]

Extra IDs used for custom routing.

Source

Config: SequenceFetchConfig

max_candidates_per_source

integer

default:"5"

Maximum database candidates to evaluate per name-based search.

type_check_mode

enum

default:"error"

Controls how molecule-type mismatches are handled (e.g. requesting “protein” for a name that looks like an ncRNA gene). "off" skips validation entirely; "warn" records a warning but continues; "error" (default) fails the request.Available options: off, warn, error

ncbi_api_key

string

Optional NCBI API key (lifts rate limit from 3 to 10 requests/second). Defaults to the NCBI_API_KEY environment variable; an explicit value passed to the config overrides the env var.

ncbi_email

string

Optional contact email. Defaults to the NCBI_EMAIL environment variable; an explicit value passed to the config overrides the env var.

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cpu"

Device to run the tool on.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.

Source

Output: SequenceFetchOutput

results

List[SequenceFetchResult]

Per-request retrieval outcomes.

Show SequenceFetchResult

request_id

string

required

Request identifier used in this result.

target_name

string

required

Original target name.

organism

string

required

Original organism name.

requested_types

List[string]

required

Requested output molecule types.

status

enum

required

One of success, warning, or failed.

fetched_sequences

List[FetchedSequence]

Retrieved sequence records.

fetched_structures

List[FetchedStructure]

Retrieved structure records.

resolved_ids

Dict[string, string]

IDs resolved or used during retrieval.

warnings

List[string]

Non-fatal warnings.

errors

List[string]

Fatal or partial failure messages.

Applications

Use this to pull mixed protein, nucleotide, and structure data for many targets in a single call. Resolve a batch of gene symbols plus organisms to reference protein sequences for analysis or design, retrieve coding DNA and transcript RNA alongside the protein for a codon-optimization workflow, or fetch a protein and its linked PDB structures together to seed structure-aware downstream steps. Partial batches are normal. Each request reports its own status, sequences, structures, and errors independently, so one unresolved target does not fail the job.

Usage Tips

Provide identifiers whenever possible. Accession overrides such as uniprot_id, genbank_accession, or pdb_id route directly and are far more reliable than a name-and-organism search, which selects a single top-ranked candidate that may be ambiguous.
The strand in genomic_coordinates changes the returned sequence. Coordinates are interpreted as 1-indexed, inclusive intervals to match biological residue selection conventions, and an explicit + or - strand controls whether the forward or reverse-complement sequence is returned. Omitting the strand can yield the wrong-orientation sequence for genes on the minus strand.
A protein hit does not guarantee a structure exists. Structure retrieval requires a linked PDB entry. A UniProt or NCBI protein match with no PDB cross-reference produces a not-found error for the structure type while the protein sequence still returns.
type_check_mode defaults to "error", the right setting for production. It rejects obvious molecule-type mismatches early, such as requesting protein for a name that looks like a non-coding RNA (ncRNA) gene or for an NR_ or XR_ RefSeq transcript. "warn" records the mismatch and continues, and "off" skips the check entirely.
rna_premrna is inferred, not curated. It is transcribed from the genomic DNA sequence and includes introns where present, so it is annotation-dependent and not a directly curated transcript.

Toolkit Notes

These apply to every Sequence Fetch tool in this toolkit (sequence-fetch).

Requires network access. The tool federates the live NCBI, UniProt, and RCSB PDB endpoints. It does not run offline and keeps no local copy of any database.
An NCBI API key raises the rate limit for NCBI-backed requests. Requests routed to NCBI E-utilities are limited to 3 requests per second per IP without credentials. Setting credentials raises this to 10 requests per second. A key is obtained at no cost from the Settings page of a free NCBI account (https://www.ncbi.nlm.nih.gov/account/). NCBI also asks that a contact email be set; it uses the email for abuse handling and IP-block recovery. Provide credentials either via the ncbi_api_key / ncbi_email config attributes or via the NCBI_API_KEY / NCBI_EMAIL environment variables. An explicit config value overrides the env var. The UniProt and RCSB PDB backends are keyless and have no equivalent mechanism.

Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.

​Background

​Learning Resources

​Tools

​Multi-source Sequence Fetch (sequence-fetch)

​API Reference

​Applications

​Usage Tips

​Toolkit Notes

​Infrastructure Guides

Tool Persistence

Device Management

Parallel Execution

Cloud Inference

Background

Learning Resources

Tools

Multi-source Sequence Fetch (`sequence-fetch`)

API Reference

Applications

Usage Tips

Toolkit Notes

Infrastructure Guides