Skip to main content
PDB
License: PDB retrieves data from the RCSB Protein Data Bank, distributed under CC0-1.0 (public domain; no attribution required). The client wrapper code is Apache-2.0-licensed. Please refer to the data terms for full terms.

This toolkit is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.


rcsb/py-rcsb_api
rcsb/py-rcsb_api
Python interface for RCSB.org API services
61 stars
View repo
The Protein Data Bank
Helen M Berman, John Westbrook, … Philip E Bourne
Nucleic Acids Research (2000)
Read paper
@article{berman2000pdb,
  title={The Protein Data Bank},
  author={Berman, Helen M and Westbrook, John and Feng, Zukang and Gilliland, Gary and Bhat, T N and Weissig, Helge and Shindyalov, Ilya N and Bourne, Philip E},
  journal={Nucleic Acids Research},
  volume={28},
  number={1},
  pages={235--242},
  year={2000},
  publisher={Oxford University Press},
  doi={10.1093/nar/28.1.235}
}
Copy citation
proto-bio/proto-tools/proto_tools/tools/database_retrieval/pdb
View source
Open Notebook
Open notebook
Coming soon!
Run this tool directly in Proto with no setup required.
FunctionDescription
run_pdb_fetch_entry()Fetch structure metadata (title, method, resolution) from RCSB PDB Docs Source
run_pdb_fetch_fasta()Fetch chain sequences from RCSB PDB with protein/nucleotide classification Docs Source

Background

The Protein Data Bank (Berman et al., 2000) is the single worldwide archive of experimentally determined macromolecular structures, served here through the RCSB PDB. It is operated by the Research Collaboratory for Structural Bioinformatics (RCSB) at Rutgers University and the University of California San Diego, with funding from the National Science Foundation, the National Institutes of Health, and the Department of Energy. Entries are solved by X-ray crystallography, cryo-electron microscopy, nuclear magnetic resonance spectroscopy, and other experimental methods. The tools call two RCSB HTTP endpoints directly. pdb-fetch-entry issues a GET request to the RCSB Data API core entry endpoint (https://data.rcsb.org/rest/v1/core/entry/{pdb_id}) and reads the structure title from struct.title, the experimental method from the first exptl record, and the resolution from rcsb_entry_info.resolution_combined, which covers both X-ray and cryo-EM entries; entries solved by NMR have no resolution value. pdb-fetch-fasta requests the FASTA endpoint (https://www.rcsb.org/fasta/entry/{pdb_id}), parses each record, extracts the author-assigned chain identifiers from the header, and classifies a sequence as protein when it contains amino-acid letters that do not also occur in nucleotide alphabets. Both tools uppercase the supplied accession, retry transient HTTP failures with backoff, and return an empty result when the accession is not found (HTTP 404). Results reflect the live archive at query time rather than a fixed release snapshot.

Learning Resources

Tools

PDB Fetch Entry (pdb-fetch-entry)

Retrieves structure metadata for a PDB accession from the RCSB Data API core entry endpoint, returning the structure title, the experimental method, the resolution in angstroms, and the request URL.

API Reference

Source
pdb_id
string
required
PDB accession (e.g. ‘1LBG’).
Source
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cpu"
Device to run the tool on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
title
string
Structure title.
method
string
Experimental method.
resolution
number
Resolution in angstroms.
source_url
string
URL used for the request.

Applications

Use this to assess whether an experimental structure is suitable as a reference before structure-based design or benchmarking: check the experimental method and resolution, then decide whether to use the entry. It pairs with UniProt, whose returned PDB cross-references can be ranked by resolution, and with PDB Fetch FASTA to pull the chain sequences once a suitable entry is selected.

Usage Tips

  • Resolution is absent for some methods. NMR and fiber-diffraction entries have no resolution value, so resolution is None; filter on it before sorting entries by quality.
  • This is metadata only. The tool returns the title, method, and resolution, not atomic coordinates or a structure file.
  • An unknown accession is not an error. A missing or obsolete accession returns an empty output rather than raising, so check the populated fields before using the result.

PDB Fetch FASTA (pdb-fetch-fasta)

Retrieves the chain sequences of a PDB entry from the RCSB FASTA endpoint, returning one record per unique sequence with the author-assigned chain identifiers that share it, the FASTA header, the sequence, and a protein/nucleic-acid classification, plus the request URL.

API Reference

Source
pdb_id
string
required
PDB accession (e.g. ‘1LBG’).
Source
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cpu"
Device to run the tool on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
chains
List[PdbChain]
Parsed chain sequences with protein/nucleotide classification.
source_url
string
URL used for the request.

Applications

Use this to extract reference sequences from an experimental structure for sequence design, alignment, or comparison against computational predictions. Filter chains by is_protein to separate protein subunits from nucleic-acid chains in a complex, and deduplicate identical sequences to recover the unique entities of a homo-oligomer. It follows PDB Fetch Entry once a suitable entry is chosen and consumes PDB identifiers surfaced by UniProt.

Usage Tips

  • One record can cover several chains. A single PdbChain carries every author-assigned chain identifier that shares its sequence, so a homo-oligomer collapses to one record with multiple chain_ids.
  • Protein classification is heuristic. A chain is called protein only when it contains amino-acid letters absent from nucleotide alphabets; peptide nucleic acids and other hybrid molecules may be misclassified.
  • An unknown accession is not an error. A missing or obsolete accession returns an empty chains list rather than raising.
  • Exporting to fasta writes the original headers. The fasta export emits each record using its stored FASTA header verbatim; json and csv are also supported, with the csv form joining shared chain identifiers with a semicolon.

Toolkit Notes

These apply to every PDB tool in this toolkit (pdb-fetch-entry, pdb-fetch-fasta).
  • Requires network access. The tools call the live RCSB PDB HTTP endpoints; they do not run offline and keep no local copy of the archive.
  • Subject to RCSB rate limits. RCSB throttles clients that exceed a few requests per second and returns HTTP 429 when the limit is exceeded; space out high-volume requests, since no account or API key is available to raise the limit.
  • Runs on CPU. There is no model and no GPU; latency is dominated by the network round-trip.
Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.