Skip to main content
License: PubChem retrieves data from the PubChem database, in the public domain (U.S. Government public domain). The client wrapper code is Apache-2.0-licensed. Please refer to the data terms for full terms.

Proto is not affiliated with NCBI. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.


pubchem.ncbi.nlm.nih.gov
Visit website
PubChem 2023 update
Sunghwan Kim, Jie Chen, … Evan E. Bolton
Nucleic Acids Research (2023)
Read paper
@article{kim2023pubchem,
  title={{PubChem} 2023 update},
  author={Kim, Sunghwan and Chen, Jie and Cheng, Tiejun and Gindulyte, Asta and He, Jia and He, Siqian and Li, Qingliang and Shoemaker, Benjamin A. and Thiessen, Paul A. and Yu, Bo and Zaslavsky, Leonid and Zhang, Jian and Bolton, Evan E.},
  journal={Nucleic Acids Research},
  volume={51},
  number={D1},
  pages={D1373--D1380},
  year={2023},
  publisher={Oxford University Press},
  doi={10.1093/nar/gkac956}
}
Copy citation
proto-bio/proto-tools/proto_tools/tools/database_retrieval/pubchem
View source
Open Notebook
Open notebook
Coming soon!
Run this tool directly in Proto with no setup required.
FunctionDescription
run_pubchem_fetch()Resolve small-molecule identifiers (CID, name, SMILES, InChIKey) against PubChem PUG REST and ret… Docs Source

Background

PubChem (Kim et al., 2023) is a freely accessible chemistry resource hosted by NCBI. It aggregates compound records with well-defined chemical structures, depositor-supplied substance records, and bioassay results contributed by hundreds of data sources. Each unique compound is assigned a stable Compound Identifier (CID), and standardized structure representations and computed descriptors are derived from a uniform processing pipeline. Internally, the tool calls the PUG REST endpoint at https://pubchem.ncbi.nlm.nih.gov/rest/pug. It first resolves the supplied identifier to one or more CIDs. A name, SMILES, or InChIKey is sent as a URL-encoded GET against the matching /compound/{domain}/{value}/cids/JSON endpoint, an InChI is submitted via POST, and a CID skips resolution entirely. It then fetches the configured property bundle from /compound/cid/{cid}/property/{properties}/JSON, and optionally retrieves synonyms, descriptions, and BioAssay identifiers through additional endpoints. Results reflect the live database at query time rather than a fixed release snapshot.

Learning Resources

  • PUG REST documentation (PubChem) - official reference for the request grammar, compound domains, property names, and response formats.
  • Programmatic access (PubChem) - overview of the programmatic interfaces and the published usage policies and rate limits.

Tools

PubChem Fetch (pubchem-fetch)

Resolves a single small-molecule identifier to a PubChem CID and returns the requested property bundle, the full list of matched CIDs, and optionally synonyms, textual descriptions, BioAssay identifiers, the source URL, and the raw property record.

API Reference

Source
cid
integer
PubChem Compound Identifier (e.g. 2244 for aspirin).
name
string
Common or systematic name (e.g. ‘aspirin’).
smiles
string
SMILES string (e.g. ‘CC(=O)Oc1ccccc1C(=O)O’).
inchi
string
Standard InChI string (e.g. ‘InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/…’).
inchikey
string
Standard InChIKey (e.g. ‘BSYNRYMUTXBXSQ-UHFFFAOYSA-N’).
Source
properties
List[string]
PubChem property names to request. Defaults to a 16-property bundle covering the common name (Title), structure (SMILES, InChI), mass, and basic descriptor counts (TPSA, HBA, HBD, etc.).
include_synonyms
boolean
default:"False"
If True, also fetch the compound’s synonyms (one extra HTTP call). Returns up to 50 synonyms.
include_description
boolean
default:"False"
If True, also fetch the compound’s textual descriptions (one extra HTTP call to /description/JSON).
include_aids
boolean
default:"False"
If True, also fetch the list of BioAssay IDs that tested this compound (one extra HTTP call to /aids/JSON). For common compounds this can return thousands of assay IDs.
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cpu"
Device to run the tool on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
cid
integer
required
Resolved PubChem CID.
all_matched_cids
List[integer]
All CIDs returned by the resolver (length 1 for unambiguous queries; may be longer for ambiguous names).
title
string
Common compound name (e.g. ‘Aspirin’), distinct from the IUPAC systematic name in iupac_name.
molecular_formula
string
Hill-system molecular formula.
molecular_weight
number
Average molecular weight in g/mol.
smiles
string
PubChem canonical SMILES with stereochemistry (the API field formerly named IsomericSMILES).
connectivity_smiles
string
Connectivity-only SMILES, with stereochemistry stripped (the API field formerly named CanonicalSMILES).
inchi
string
Standard InChI string.
inchikey
string
Standard InChIKey hash.
iupac_name
string
IUPAC systematic name.
exact_mass
number
Exact (monoisotopic) mass in Da.
tpsa
number
Topological polar surface area in angstroms-squared.
complexity
integer
Bertz / Hendrickson / Ihlenfeldt complexity.
charge
integer
Net formal charge.
hbond_donor_count
integer
Number of hydrogen-bond donors.
hbond_acceptor_count
integer
Number of hydrogen-bond acceptors.
rotatable_bond_count
integer
Number of rotatable bonds.
heavy_atom_count
integer
Number of non-hydrogen atoms.
synonyms
List[string]
Up to 50 synonyms (empty when include_synonyms is False).
descriptions
List[string]
Textual descriptions of the compound, one per source (empty when include_description is False).
bioassay_ids
List[integer]
BioAssay IDs that have tested this compound (empty when include_aids is False). For common compounds this can return thousands of IDs.
source_url
string
required
URL of the PubChem property request.
raw_property_record
Dict[string, any]
Complete property record from PubChem for advanced programmatic access.

Applications

Use this to resolve a ligand to its canonical structure and properties before structure-based or chemical-constraint work: convert a user-supplied name or SMILES into a canonical CID and standardized SMILES/InChI/InChIKey before docking, deduplicate or join compound sets on canonical identifiers, or pull descriptor counts for rule-of-five style filtering. PubChem CIDs anchor cross-references into other chemistry resources. Pair this with NCBI E-utilities to pull linked literature or biomolecule records once a CID is resolved.

Usage Tips

  • Ambiguous names resolve to multiple CIDs. A generic name can match many compounds. The tool deterministically selects the first CID and records the full list in all_matched_cids. Pass a CID directly when the identity must be exact.
  • Prefer CID inputs for large batches. Supplying a CID skips the resolution call and reduces the request count per query, which matters under the rate limits.
  • Synonym, description, and BioAssay retrieval each add a request. Enabling them issues an extra HTTP call, and for common compounds the BioAssay list can return thousands of identifiers.
  • Results track the live database. The same call can return updated structures or properties as PubChem ingests new depositions. It is not pinned to a release.

Toolkit Notes

These apply to every PubChem tool in this toolkit (pubchem-fetch).
  • Requires network access. The tool calls the live PubChem PUG REST API. It does not run offline and keeps no local copy of the database.
  • Subject to PUG REST throttling. PubChem applies dynamic per-user throttling, with limits of no more than 5 requests per second, 400 requests per minute, and 300 seconds of running time per minute. Exceeding them returns HTTP 503.
Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.