PDB - Proto

License: PDB retrieves data from the RCSB Protein Data Bank, distributed under CC0-1.0 (public domain; no attribution required). The client wrapper code is Apache-2.0-licensed. Please refer to the data terms for full terms.

This toolkit is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.

GitHub 61 GitHub 61 Publication Publication Cite Cite Tool Source Tool Source Open as Notebook Open as Notebook Open on Proto Open on Proto

rcsb/py-rcsb_api

Python interface for RCSB.org API services

61 stars

View repo

The Protein Data Bank

Helen M Berman, John Westbrook, … Philip E Bourne

Nucleic Acids Research (2000)

Read paper

@article{berman2000pdb,
  title={The Protein Data Bank},
  author={Berman, Helen M and Westbrook, John and Feng, Zukang and Gilliland, Gary and Bhat, T N and Weissig, Helge and Shindyalov, Ilya N and Bourne, Philip E},
  journal={Nucleic Acids Research},
  volume={28},
  number={1},
  pages={235--242},
  year={2000},
  publisher={Oxford University Press},
  doi={10.1093/nar/28.1.235}
}

Copy citation

proto-bio/proto-tools/proto_tools/tools/database_retrieval/pdb

View source

Open Notebook

Open notebook

Coming soon!

Run this tool directly in Proto with no setup required.

Function	Description
`run_pdb_fetch_entry()`	Fetch structure metadata (title, method, resolution) from RCSB PDB	Docs Source
`run_pdb_fetch_fasta()`	Fetch chain sequences from RCSB PDB with protein/nucleotide classification	Docs Source

Background

The Protein Data Bank (Berman et al., 2000) is the single worldwide archive of experimentally determined macromolecular structures, served here through the RCSB PDB. It is operated by the Research Collaboratory for Structural Bioinformatics (RCSB) at Rutgers University and the University of California San Diego, with funding from the National Science Foundation, the National Institutes of Health, and the Department of Energy. Entries are solved by X-ray crystallography, cryo-electron microscopy, nuclear magnetic resonance spectroscopy, and other experimental methods. The tools call two RCSB HTTP endpoints directly. pdb-fetch-entry issues a GET request to the RCSB Data API core entry endpoint (https://data.rcsb.org/rest/v1/core/entry/{pdb_id}) and reads the structure title from struct.title, the experimental method from the first exptl record, and the resolution from rcsb_entry_info.resolution_combined, which covers both X-ray and cryo-EM entries; entries solved by NMR have no resolution value. pdb-fetch-fasta requests the FASTA endpoint (https://www.rcsb.org/fasta/entry/{pdb_id}), parses each record, extracts the author-assigned chain identifiers from the header, and classifies a sequence as protein when it contains amino-acid letters that do not also occur in nucleotide alphabets. Both tools uppercase the supplied accession, retry transient HTTP failures with backoff, and return an empty result when the accession is not found (HTTP 404). Results reflect the live archive at query time rather than a fixed release snapshot.

Learning Resources

RCSB PDB Data API documentation (RCSB PDB) - reference for the REST endpoints, query syntax, and rate limits.
PDB-101 training and education (RCSB PDB) - guided material on PDB data, structure determination methods, and how to interpret entries.

Tools

PDB Fetch Entry (`pdb-fetch-entry`)

Retrieves structure metadata for a PDB accession from the RCSB Data API core entry endpoint, returning the structure title, the experimental method, the resolution in angstroms, and the request URL.

API Reference

Source

Input: PdbFetchEntryInput

pdb_id

string

required

PDB accession (e.g. ‘1LBG’).

Source

Config: PdbFetchConfig

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cpu"

Device to run the tool on.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.

Source

Output: PdbFetchEntryOutput

title

string

Structure title.

method

string

Experimental method.

resolution

number

Resolution in angstroms.

source_url

string

URL used for the request.

Applications

Use this to assess whether an experimental structure is suitable as a reference before structure-based design or benchmarking: check the experimental method and resolution, then decide whether to use the entry. It pairs with UniProt, whose returned PDB cross-references can be ranked by resolution, and with PDB Fetch FASTA to pull the chain sequences once a suitable entry is selected.

Usage Tips

Resolution is absent for some methods. NMR and fiber-diffraction entries have no resolution value, so resolution is None; filter on it before sorting entries by quality.
This is metadata only. The tool returns the title, method, and resolution, not atomic coordinates or a structure file.
An unknown accession is not an error. A missing or obsolete accession returns an empty output rather than raising, so check the populated fields before using the result.

PDB Fetch FASTA (`pdb-fetch-fasta`)

Retrieves the chain sequences of a PDB entry from the RCSB FASTA endpoint, returning one record per unique sequence with the author-assigned chain identifiers that share it, the FASTA header, the sequence, and a protein/nucleic-acid classification, plus the request URL.

API Reference

Source

Input: PdbFetchFastaInput

pdb_id

string

required

PDB accession (e.g. ‘1LBG’).

Source

Config: PdbFetchConfig

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cpu"

Device to run the tool on.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Source

Output: PdbFetchFastaOutput

chains

List[PdbChain]

Parsed chain sequences with protein/nucleotide classification.

Show PdbChain

chain_ids

List[string]

Author-assigned chain identifiers sharing this sequence (typically multi-element for homo-oligomers).

header

string

required

Full FASTA header line.

sequence

string

required

Chain sequence string.

is_protein

boolean

required

True if protein, False if nucleic acid.

source_url

string

URL used for the request.

Applications

Use this to extract reference sequences from an experimental structure for sequence design, alignment, or comparison against computational predictions. Filter chains by is_protein to separate protein subunits from nucleic-acid chains in a complex, and deduplicate identical sequences to recover the unique entities of a homo-oligomer. It follows PDB Fetch Entry once a suitable entry is chosen and consumes PDB identifiers surfaced by UniProt.

Usage Tips

One record can cover several chains. A single PdbChain carries every author-assigned chain identifier that shares its sequence, so a homo-oligomer collapses to one record with multiple chain_ids.
Protein classification is heuristic. A chain is called protein only when it contains amino-acid letters absent from nucleotide alphabets; peptide nucleic acids and other hybrid molecules may be misclassified.
An unknown accession is not an error. A missing or obsolete accession returns an empty chains list rather than raising.
Exporting to fasta writes the original headers. The fasta export emits each record using its stored FASTA header verbatim; json and csv are also supported, with the csv form joining shared chain identifiers with a semicolon.

Toolkit Notes

These apply to every PDB tool in this toolkit (pdb-fetch-entry, pdb-fetch-fasta).

Requires network access. The tools call the live RCSB PDB HTTP endpoints; they do not run offline and keep no local copy of the archive.
Subject to RCSB rate limits. RCSB throttles clients that exceed a few requests per second and returns HTTP 429 when the limit is exceeded; space out high-volume requests, since no account or API key is available to raise the limit.
Runs on CPU. There is no model and no GPU; latency is dominated by the network round-trip.

Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.

​Background

​Learning Resources

​Tools

​PDB Fetch Entry (pdb-fetch-entry)

​API Reference

​Applications

​Usage Tips

​PDB Fetch FASTA (pdb-fetch-fasta)

​API Reference

​Applications

​Usage Tips

​Toolkit Notes

​Infrastructure Guides

Tool Persistence

Device Management

Parallel Execution

Cloud Inference

Background

Learning Resources

Tools

PDB Fetch Entry (`pdb-fetch-entry`)

API Reference

Applications

Usage Tips

PDB Fetch FASTA (`pdb-fetch-fasta`)

API Reference

Applications

Usage Tips

Toolkit Notes

Infrastructure Guides