Skip to main content
License: AlphaFold DB retrieves data from the AlphaFold Protein Structure Database, distributed under CC-BY-4.0. Attribution to the AlphaFold Protein Structure Database is required when the data is redistributed. The client wrapper code is Apache-2.0-licensed. Please refer to the data terms for full terms.

Proto is not affiliated with Google DeepMind and EMBL-EBI. This toolkit is open source and builds on the implementations produced by these organizations. Product names, logos, and trademarks are the property of their respective owners.


google-deepmind/alphafold
google-deepmind/alphafold
Open source code for AlphaFold 2.
14.4k stars
View repo
AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models
Mihaly Varadi, Stephen Anyango, … Sameer Velankar
Nucleic Acids Research (2022)
Read paper
@article{varadi2022alphafold,
  title={AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models},
  author={Varadi, Mihaly and Anyango, Stephen and Deshpande, Mandar and Nair, Sreenath and Natassia, Cindy and Yordanova, Galabina and Yuan, David and Stroe, Oana and Wood, Gemma and Laydon, Agata and {\v{Z}}{\'\i}dek, Augustin and Green, Tim and Tunyasuvunakool, Kathryn and Petersen, Stig and Jumper, John and Clancy, Ellen and Green, Richard and Vora, Ankur and Lutfi, Mira and Figurnov, Michael and Cowie, Andrew and Hobbs, Nicole and Kohli, Pushmeet and Kleywegt, Gerard and Birney, Ewan and Hassabis, Demis and Velankar, Sameer},
  journal={Nucleic Acids Research},
  volume={50},
  number={D1},
  pages={D439--D444},
  year={2022},
  publisher={Oxford University Press},
  doi={10.1093/nar/gkab1061}
}
Copy citation
proto-bio/proto-tools/proto_tools/tools/database_retrieval/alphafold_db
View source
Open Notebook
Open notebook
Coming soon!
Run this tool directly in Proto with no setup required.
FunctionDescription
run_alphafold_db_fetch()Fetch predicted structure (PDB/mmCIF), per-residue pLDDT, and PAE matrix from the AlphaFold Prote… Docs Source

Background

The AlphaFold Protein Structure Database (AFDB) (Varadi et al., 2022) is a freely accessible archive of protein structures predicted by AlphaFold2 (Jumper et al., 2021), maintained by Google DeepMind and EMBL-EBI. It hosts predicted atomic coordinates for the UniProt reference proteomes. Each entry carries a per-residue confidence score (pLDDT, 0 to 100) and a pairwise predicted aligned error (pAE) matrix in angstroms. AFDB hosts AlphaFold2 single-chain predictions only. Multi-chain complexes are produced by separate pipelines and are not part of this database. Internally, the tool issues a GET request to the AFDB prediction endpoint at https://alphafold.ebi.ac.uk/api/prediction/{accession}, which returns a JSON list of prediction records. It selects the canonical record (AF-{accession}-F1) by default, or the record matching the requested isoform, then follows the URLs carried in that record: pdbUrl or cifUrl for the structure body, plddtDocUrl for the per-residue pLDDT array, paeDocUrl for the pAE matrix, and msaUrl for the input multiple-sequence alignment (an A3M file). The mean pLDDT is read from the record’s globalMetricValue field. Records and their provenance come directly from the official AlphaFold DB REST API. Results reflect the live database, which always serves the latest version of each prediction.

Learning Resources

Tools

AlphaFold DB Fetch (alphafold-db-fetch)

Retrieves a single AlphaFold DB prediction record by UniProt accession and returns the predicted sequence and its 1-indexed coordinates, gene and organism metadata, mean pLDDT, the AFDB artifact URLs, the full JSON record, and an optional parsed Structure carrying per-residue pLDDT and optional pAE on structure.metrics.

API Reference

Source
uniprot_id
string
required
UniProt accession to look up (e.g. ‘P04637’).
isoform
integer
Isoform number to select from the multi-record AFDB response. None (default) returns the canonical entry (AF-{accession}-F1); 2 selects AF-{accession}-2-F1, etc. AFDB typically exposes isoforms 2-9 for human proteins. Raises ValueError if the requested isoform doesn’t exist.
Source
structure_format
enum
default:"pdb"
Structure file format.Available options: pdb, cif
include_structure
boolean
default:"True"
If True (default), fetch the structure body and the per-residue pLDDT array, returning a parsed Structure on the output. Set to False for metadata-only probes (URLs, mean pLDDT, gene, sequence) — saves ~100-500 KB per call, meaningful for batch sweeps.
include_pae
boolean
default:"False"
If True, also fetch the PAE (predicted aligned error) matrix and attach it to output.structure.metrics["pae"]. Disabled by default — PAE files can be tens of MB for long proteins. No-op when include_structure=False.
include_msa
boolean
default:"False"
If True, fetch the A3M MSA used as input to the AlphaFold prediction. Disabled by default — A3M files can be hundreds of KB to several MB for highly conserved proteins.
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cpu"
Device to run the tool on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
uniprot_accession
string
required
Primary UniProt accession that was looked up.
entry_id
string
required
AlphaFold entry identifier (e.g. ‘AF-P04637-F1’).
gene
string
Gene symbol from the AlphaFold record.
organism_scientific_name
string
Source organism scientific name.
tax_id
integer
NCBI taxonomy ID.
sequence
string
required
Amino-acid sequence covered by the prediction.
sequence_length
integer
required
Length of the predicted sequence.
sequence_start
integer
required
1-indexed start residue of the prediction (relative to the full UniProt sequence; >1 only for non-first fragments of very long proteins).
sequence_end
integer
required
1-indexed inclusive end residue of the prediction.
latest_version
integer
required
Latest version of the AlphaFold DB prediction (this is the version of the served prediction; AlphaFold DB always serves the latest).
model_created_date
string
ISO 8601 timestamp when this prediction was generated.
mean_plddt
number
Mean per-residue pLDDT for the prediction (AlphaFold DB’s globalMetricValue field). Always populated from the metadata response, regardless of include_structure; when include_structure=True it is also mirrored at structure.metrics["avg_plddt"].
pdb_url
string
required
URL to the PDB structure file on AlphaFold DB.
cif_url
string
required
URL to the mmCIF structure file on AlphaFold DB.
bcif_url
string
URL to the BinaryCIF structure file; None on legacy entries that predate the bcif export.
pae_doc_url
string
required
URL to the PAE JSON document on AlphaFold DB.
plddt_doc_url
string
required
URL to the per-residue pLDDT JSON document on AlphaFold DB.
pae_image_url
string
required
URL to the rendered PAE PNG on AlphaFold DB.
msa_url
string
URL to the MSA A3M used for prediction, when present.
am_annotations_url
string
AlphaMissense pathogenicity CSV URL (sequence coords); None for non-human or unscored entries.
am_annotations_hg19_url
string
AlphaMissense annotations on GRCh37.
am_annotations_hg38_url
string
AlphaMissense annotations on GRCh38.
sequence_checksum
string
CRC64 checksum of the predicted sequence.
structure
Structure
Parsed AlphaFold structure (PDB or mmCIF body in structure_format, b_factor_type=BFactorType.PLDDT) with an :class:AlphaFoldDBMetrics metrics container carrying avg_plddt, plddt_per_residue, and (when include_pae=True) pae. None when include_structure=False.
msa_a3m
string
A3M-format MSA contents used as input to the AlphaFold prediction. None when include_msa is False or when the entry has no associated MSA URL.
source_url
string
required
AlphaFold DB API URL used for the metadata lookup.
raw_entry
Dict[string, any]
Complete AlphaFold DB JSON record for advanced programmatic access.
Metrics
MetricTypeRangeAvailability
avg_plddtfloat0.0 to 100.0always
plddt_per_residuelist[float]0.0 to 100.0always
paelist[list[float]]≥ 0.0when include_pae=True

Applications

Use this to pull an AlphaFold-predicted structure into a pipeline when no experimental entry is needed: fetch a target by accession before inverse folding, docking, or binder design, screen accessions for AFDB coverage with metadata-only requests, or assess per-residue and pairwise confidence before structure-based work. The returned Structure feeds directly into structure-consuming tools such as TM-align, US-align, and structure scoring. The UniProt tool supplies the UniProt accession from a gene name and organism, and the PDB tool provides the experimental counterpart when one exists.

Usage Tips

  • Coverage is broad but not universal. When AFDB has no prediction for an accession the tool raises ValueError. Catch that error and fall back to predicting the structure from sequence.
  • A high mean_plddt can hide locally unreliable regions. Inspect the per-residue pLDDT on structure.metrics before trusting any specific residue.
  • latest_version advances when AFDB refreshes a prediction. Cache it alongside any structure you persist and refetch when it moves past the cached value.
  • Multiple records signal isoforms or fragments. The canonical record is selected by default and a warning lists the alternatives. To select a non-canonical isoform, pass the isoform input, and check entry_id, sequence_start, and sequence_end to confirm which record was returned.
  • Low-confidence regions are usually real disorder, not a prediction error. Disordered or flexibly linked regions get very low per-residue confidence (pLDDT) and high predicted aligned error (pAE) between regions because they have no single fixed shape. Find those residue ranges from the per-residue pLDDT array and trim or down-weight just those residues. Do not throw away the whole prediction, because the confident domains are still reliable.

Toolkit Notes

These apply to every AlphaFold DB tool in this toolkit (alphafold-db-fetch).
  • Requires network access. The tool calls the live AlphaFold DB REST API. It does not run offline and keeps no local copy of the database.
  • Subject to AlphaFold DB rate limits. The EMBL-EBI API is unauthenticated and applies per-IP fair-use limits (EMBL-EBI Terms of Use). Space out high-volume requests.
Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.