
This toolkit is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.
- NCBI Entrez: U.S. Government public domain
- UniProt: CC-BY-4.0
- RCSB PDB: CC0-1.0
Background
This tool wraps three public databases rather than a single predictive model, so it has no single primary paper. The underlying sources are GenBank (Sayers et al., 2022), UniProt (The UniProt Consortium, 2025), and the RCSB Protein Data Bank (Berman et al., 2000). Internally,SequenceFetchInput wraps a list[SequenceFetchRequest], and each request is resolved independently by molecule type with deterministic, priority-based routing where provided identifiers are consulted before a name-and-organism search. For a protein request the routing priority is a supplied UniProt accession (resolved at rest.uniprot.org), then an NCBI protein accession, then a linked PDB entry’s FASTA chains, and finally a name-and-organism search. Nucleotide requests use the NCBI E-utilities at https://eutils.ncbi.nlm.nih.gov/entrez/eutils (esearch, esummary, efetch), with a gene-locus coordinate fallback that fetches the genomic interval directly from the chromosome accession. Structure requests resolve a PDB identifier and read entry metadata from https://data.rcsb.org, with chain sequences pulled from https://www.rcsb.org/fasta/entry. Every returned record carries a source URL and a SHA256 checksum for provenance, and the NCBI API key and contact email are sanitized out of provenance URLs. Genomic coordinates are interpreted as 1-indexed, inclusive intervals to match biological residue selection conventions. Results reflect the live databases at query time rather than a fixed release snapshot.
Learning Resources
- Entrez Programming Utilities help (NCBI) - official documentation for the E-utilities API, including esearch, esummary, and efetch.
- UniProt help and documentation (UniProt) - official documentation covering accessions, query syntax, and the REST API.
- RCSB PDB Data API (RCSB PDB) - official documentation for the PDB entry data and FASTA endpoints.
Tools
Multi-source Sequence Fetch (sequence-fetch)
Resolves a list of SequenceFetchRequest objects across NCBI Entrez, UniProt, and RCSB PDB, returning per-request fetched sequences, fetched structures, resolved identifiers, warnings, and errors, with run-level counts of successful, warning, and failed requests.API Reference
Input: SequenceFetchInput
Input: SequenceFetchInput
Config: SequenceFetchConfig
Config: SequenceFetchConfig
"off" skips validation entirely; "warn" records a warning but continues; "error" (default) fails the request.Available options: off, warn, errorNCBI_API_KEY environment variable; an explicit value passed to the config overrides the env var.NCBI_EMAIL environment variable; an explicit value passed to the config overrides the env var.True is coerced to 1 and False to 0.None waits indefinitely.BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.Applications
Use this to pull mixed protein, nucleotide, and structure data for many targets in a single call. Resolve a batch of gene symbols plus organisms to reference protein sequences for analysis or design, retrieve coding DNA and transcript RNA alongside the protein for a codon-optimization workflow, or fetch a protein and its linked PDB structures together to seed structure-aware downstream steps. Partial batches are normal. Each request reports its own status, sequences, structures, and errors independently, so one unresolved target does not fail the job.Usage Tips
- Provide identifiers whenever possible. Accession overrides such as
uniprot_id,genbank_accession, orpdb_idroute directly and are far more reliable than a name-and-organism search, which selects a single top-ranked candidate that may be ambiguous. - The strand in
genomic_coordinateschanges the returned sequence. Coordinates are interpreted as 1-indexed, inclusive intervals to match biological residue selection conventions, and an explicit+or-strand controls whether the forward or reverse-complement sequence is returned. Omitting the strand can yield the wrong-orientation sequence for genes on the minus strand. - A protein hit does not guarantee a structure exists. Structure retrieval requires a linked PDB entry. A UniProt or NCBI protein match with no PDB cross-reference produces a not-found error for the
structuretype while the protein sequence still returns. type_check_modedefaults to"error", the right setting for production. It rejects obvious molecule-type mismatches early, such as requestingproteinfor a name that looks like a non-coding RNA (ncRNA) gene or for anNR_orXR_RefSeq transcript."warn"records the mismatch and continues, and"off"skips the check entirely.rna_premrnais inferred, not curated. It is transcribed from the genomic DNA sequence and includes introns where present, so it is annotation-dependent and not a directly curated transcript.
Toolkit Notes
These apply to every Sequence Fetch tool in this toolkit (sequence-fetch).
- Requires network access. The tool federates the live NCBI, UniProt, and RCSB PDB endpoints. It does not run offline and keeps no local copy of any database.
- An NCBI API key raises the rate limit for NCBI-backed requests. Requests routed to NCBI E-utilities are limited to 3 requests per second per IP without credentials. Setting credentials raises this to 10 requests per second. A key is obtained at no cost from the Settings page of a free NCBI account (https://www.ncbi.nlm.nih.gov/account/). NCBI also asks that a contact email be set; it uses the email for abuse handling and IP-block recovery. Provide credentials either via the
ncbi_api_key/ncbi_emailconfig attributes or via theNCBI_API_KEY/NCBI_EMAILenvironment variables. An explicit config value overrides the env var. The UniProt and RCSB PDB backends are keyless and have no equivalent mechanism.