Skip to main content
License: CCD Lookup retrieves data from the wwPDB Chemical Component Dictionary, distributed under CC0-1.0 (public domain; no attribution required). The client wrapper code is Apache-2.0-licensed. Please refer to the data terms for full terms.

Proto is not affiliated with PDBe, EMBL-EBI, and wwPDB. This toolkit is open source and builds on the implementations produced by these organizations. Product names, logos, and trademarks are the property of their respective owners.


PDBeurope/ccdutils
PDBeurope/ccdutils
A set of python tools to deal with PDB chemical components definitions for small molecules, taken from the wwPDB Chemical Component Dictionary, uses RDKit
80 stars
View repo
wwpdb.org
Visit website
PDBe CCDUtils: an RDKit-based toolkit for handling and analysing small molecules in the Protein Data Bank
Ibrahim Roshan Kunnakkattu, Preeti Choudhary, … Sameer Velankar
Journal of Cheminformatics (2023)
Read paper
@article{kunnakkattu2023pdbeccdutils,
  title={{PDBe} {CCDUtils}: an {RDKit}-based toolkit for handling and analysing small molecules in the {Protein Data Bank}},
  author={Kunnakkattu, Ibrahim Roshan and Choudhary, Preeti and Pravda, Lukas and Nadzirin, Nurul and Smart, Oliver S. and Yuan, Qi and Anyango, Stephen and Nair, Sreenath and Varadi, Mihaly and Velankar, Sameer},
  journal={Journal of Cheminformatics},
  volume={15},
  number={1},
  pages={117},
  year={2023},
  publisher={BioMed Central},
  doi={10.1186/s13321-023-00786-w}
}
Copy citation
proto-bio/proto-tools/proto_tools/tools/database_retrieval/ccd_lookup
View source
Open Notebook
Open notebook
Coming soon!
Run this tool directly in Proto with no setup required.
FunctionDescription
run_ccd_lookup()Rich enrichment for wwPDB Chemical Component Dictionary entries via pdbeccdutils: returns Fragmen… Docs Source

Background

The wwPDB Chemical Component Dictionary (CCD) is the dictionary of every chemical component observed in the Protein Data Bank: small-molecule ligands, modified amino acids, ions, cofactors, nucleotides, and sugars. Each component has a 1- to 5-character identifier (for example ATP for adenosine triphosphate, HEM for heme, MG for magnesium ion, SEP for phosphoserine), and each entry stores atoms, bonds, formula, IUPAC name, descriptors (SMILES / InChI / InChIKey), release status, and, for modified residues, a parent component (for example SEP to SER). Cross-references via UniChem link CCD entries to external chemistry databases (DrugBank, ChEMBL, PubChem, ChEBI), so the same molecule can be looked up across resources. The tool reads a bundled copy of the CCD components.cif (Kunnakkattu et al., 2023) loaded via pdbeccdutils.core.ccd_reader. SMILES inputs are canonicalized with RDKit and matched against an index built over the bundled dictionary by canonical SMILES and InChIKey. Records and their provenance come directly from the wwPDB Chemical Component Dictionary, distributed by PDBe.

Learning Resources

  • PDBe CCDUtils documentation (PDBe) - official documentation for the underlying library, covering the CCD reader, descriptors, and depiction.
  • Chemical Component Dictionary (wwPDB) - the reference description of the CCD, its identifiers, and what each entry stores.
  • UniChem (EMBL-EBI) - the cross-reference service used to map CCD entries to external chemistry databases.

Tools

CCD Lookup (ccd-lookup)

Enriches wwPDB Chemical Component Dictionary entries. Accepts CCD codes (such as "ATP") or SMILES strings, in mixed batches, and returns a CcdLookupOutput containing a Ligands collection plus parallel CcdEnrichment records: formula, descriptors, parent component, RDKit physicochemical properties, optional UniChem cross-references, and optional PDB usage.

API Reference

Source
identifiers
List[string]
required
CCD codes (e.g. "ATP") or SMILES strings. A single string is normalized to a list.
Source
include_cross_references
boolean
default:"False"
Fetch UniChem cross-references (DrugBank/ChEMBL/PubChem/etc. IDs). Requires network. Default: False.
include_pdb_usage
boolean
default:"False"
Fetch PDB structures containing each ligand from the RCSB search API. Requires network. Default: False.
sanitize
boolean
default:"True"
Sanitize the parsed RDKit molecule. Disable only for CCD entries with unusual valences. Default: True.
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cpu"
Device to run the tool on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
ligands
Ligands
Collection of Fragment objects, one per input identifier in input order. SMILES inputs with no CCD match still produce a Fragment (with ccd_code=None); see enrichments[i] for the resolution status.
enrichments
List[CcdEnrichment]
Per-identifier CCD-specific metadata (formula, descriptors, release status, optional network data). Same length and order as ligands.fragments.

Applications

Use this for user-facing enrichment workflows such as notebooks, scripts, dashboards, and ligand reports. Resolve a CCD code or SMILES to a canonical Fragment with formula, descriptors, and physicochemical properties before structure prediction or docking, map a ligand to external chemistry databases via UniChem, or discover which experimental structures contain a ligand before structure-based work. The returned Ligands collection feeds directly into tools that take ligands as input, and PDB identifiers from pdb_structures pair naturally with the PDB tool.

Usage Tips

  • For per-fragment SMILES-to-CCD lookups, use proto_tools.entities.ligands.ccd_utils.map_smiles_to_ccd_code instead. It runs the same lookup in the current Python process, without the subprocess startup this tool incurs, so when you are not using the persistent tool context it can be much faster.
  • A SMILES with no CCD match returns ccd_code=None rather than an error. Check enrichment.ccd_code is not None before treating an entry as found. result.num_unresolved gives the batch-level count.
  • parent_ccd_code is populated only when the CCD entry declares a canonical parent component. Modified residues like SEP (phosphoserine, parent SER), MSE (selenomethionine, parent MET), or PTR (phosphotyrosine, parent TYR) carry a parent code. Most small-molecule ligands have no parent and return None.
  • pdb_structures can be very large. For common cofactors and ions (HEM, ATP, NAG, MG, ZN) the list runs to many thousands of PDB IDs. It may be helpful to process it in chunks rather than loading every entry at once when you pass it to a downstream step.

Toolkit Notes

These apply to every CCD Lookup tool in this toolkit (ccd-lookup).
  • Offline by default. The tool reads a bundled copy of the wwPDB CCD and runs fully offline. Only include_cross_references (UniChem) and include_pdb_usage (RCSB) require network access, and both default to off.
  • One-time data download. First use downloads the roughly 70 MB compressed CCD bundle (components.cif.gz) to $PROTO_MODEL_CACHE/ccd_lookup/ and decompresses it to a components.cif of roughly 500 MB, which grows as the dictionary grows. Subsequent runs reuse the decompressed file.
Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.