Skip to main content
License: AlphaMissense DB retrieves data from the AlphaFold Protein Structure Database, distributed under CC-BY-4.0. Attribution to the AlphaFold Protein Structure Database is required when the data is redistributed. The client wrapper code is Apache-2.0-licensed. Please refer to the data terms for full terms.

Proto is not affiliated with Google DeepMind. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.


google-deepmind/alphamissense
google-deepmind/alphamissense
629 stars
View repo
Accurate proteome-wide missense variant effect prediction with AlphaMissense
Jun Cheng, Guido Novati, … \vZiga Avsec
Science (2023)
Read paper
@article{cheng2023alphamissense,
  title={Accurate proteome-wide missense variant effect prediction with {AlphaMissense}},
  author={Cheng, Jun and Novati, Guido and Pan, Joshua and Bycroft, Clare and {\v{Z}}emgulyt{\.{e}}, Akvil{\.{e}} and Applebaum, Taylor and Pritzel, Alexander and Wong, Lai Hong and Zielinski, Michal and Sargeant, Tobias and Schneider, Rosalia G. and Senior, Andrew W. and Jumper, John and Hassabis, Demis and Kohli, Pushmeet and Avsec, {\v{Z}}iga},
  journal={Science},
  volume={381},
  number={6664},
  pages={eadg7492},
  year={2023},
  publisher={American Association for the Advancement of Science},
  doi={10.1126/science.adg7492}
}
Copy citation
proto-bio/proto-tools/proto_tools/tools/database_retrieval/alphamissense_db
View source
Open Notebook
Open notebook
Coming soon!
Run this tool directly in Proto with no setup required.
FunctionDescription
run_alphamissense_db_fetch()Fetch per-residue, per-substitution AlphaMissense pathogenicity scores for a human UniProt access… Docs Source

Background

AlphaMissense (Cheng et al., 2023) is a deep-learning model that scores the pathogenicity of human missense variants. It is adapted from AlphaFold and fine-tuned on human and primate population variant frequencies, treating variants common in healthy populations as benign and rare variants as putatively pathogenic. For each canonical UniProt sequence it scores all 19 alternate amino acids at every position, covering every possible single missense substitution. Its classification thresholds are set to a cutoff that reaches about 90% precision on ClinVar variants. The paper reports that the model classifies 89% of all 71 million possible human missense variants, labeling 32% likely pathogenic and 57% likely benign at the default thresholds. The predictions are not computed at query time. They are precomputed by Google DeepMind and distributed as static CSV files by the AlphaFold Protein Structure Database, maintained by EMBL-EBI, keyed by UniProt accession at https://alphafold.ebi.ac.uk/files/AF-{accession}-F1-{suffix}.csv. Internally, the tool strips and uppercases the supplied accession, builds the AlphaFold DB CSV URL for the requested coordinate system, and issues a single HTTP GET. The uniprot coordinate system fetches the aa-substitutions CSV, which holds the full protein-coordinate grid of every possible substitution. The hg19 and hg38 coordinate systems fetch the genomic CSVs, which cover only substitutions reachable by a single-nucleotide change (a single-nucleotide variant, SNV) and additionally carry chromosome, position, reference allele, alternate allele, and GENCODE transcript identifier. Each CSV row is parsed into one prediction record, with the genomic fields populated only in genomic mode. A 404 response means the accession is not covered and surfaces as a clear error. Predictions reflect the fixed CSV snapshot published by AlphaFold DB rather than a value recomputed per request.

Learning Resources

  • AlphaFold DB FAQ (EMBL-EBI) - official documentation covering the AlphaMissense CSV files, coverage, and the genomic and protein coordinate variants.
  • AlphaMissense GitHub (Google DeepMind) - the official repository with usage notes for the released code and prediction tables.

Tools

AlphaMissense DB Fetch (alphamissense-db-fetch)

Retrieves the complete AlphaMissense prediction set for a single human UniProt accession and returns every per-substitution prediction, the prediction count, the mean pathogenicity score, and the source CSV URL. The coordinate_system configuration selects the protein-coordinate grid of every possible substitution or one of the genomic-coordinate tables, which are limited to substitutions reachable by a single-nucleotide change. The genomic tables additionally populate chromosome, position, reference allele, alternate allele, and transcript identifier on each prediction.

API Reference

Source
uniprot_id
string
required
UniProt accession (must be a human protein covered by AlphaMissense; e.g. ‘P04637’).
Source
coordinate_system
enum
default:"uniprot"
Which AFDB CSV to fetch. "uniprot" (default) returns the full protein-coordinate saturation grid (~7,500 rows for TP53). "hg19" / "hg38" return SNV-accessible substitutions in genomic coordinates (~2,500 rows for TP53) and populate chrom / pos / ref / alt / transcript_id on each prediction; a protein variant reachable by multiple SNVs appears multiple times.Available options: uniprot, hg19, hg38
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cpu"
Device to run the tool on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
uniprot_accession
string
required
UniProt accession that was looked up.
predictions
List[AlphaMissensePrediction]
All per-substitution pathogenicity predictions in the source CSV (full saturation grid: sequence_length * 19 for UniProt-coordinate fetches).
num_predictions
integer
required
Number of predictions in the source CSV.
mean_pathogenicity
number
Mean pathogenicity score across all predictions; None when predictions is empty.
source_url
string
required
URL of the AlphaMissense CSV that was fetched.

Applications

Use this to pull model-based missense pathogenicity into a pipeline: triage missense variants of uncertain significance from clinical sequencing, prioritize candidate disease-causing variants from case cohorts, avoid disruptive substitutions during sequence design or optimization, or apply a per-residue pathogenicity penalty in a generative-design loop. The accession can come from the UniProt tool, which resolves a gene symbol and organism to a canonical reviewed human accession. The same accession also drives the AlphaFold DB tool, aligning per-residue pathogenicity scores with predicted backbone coordinates.

Usage Tips

  • The tool always returns the entire prediction set. There is no server-side filtering on the static CSV. Filter the returned predictions list in your own code by position, score, or classification.
  • Cache the output once per accession. A typical protein has roughly 7,000 to 20,000 substitution rows. Fetch once and reuse the result rather than refetching inside tight loops.
  • Group by position for hotspot analysis. mean_pathogenicity over a wide region is a coarse summary. Inspect predictions grouped by residue position to surface hotspots.
  • Coverage is human canonical isoforms only. Non-canonical isoforms and non-human accessions return a 404 and surface as a clear error. Resolve the accession with the UniProt tool first if the organism is uncertain.

Toolkit Notes

These apply to every AlphaMissense DB tool in this toolkit (alphamissense-db-fetch).
  • Requires network access. The tool downloads the AlphaMissense CSV from the AlphaFold Protein Structure Database. It does not run offline and keeps no local copy of the predictions.
  • Subject to EMBL-EBI fair use. The CSV is an anonymous static download from AlphaFold DB with no API key or account. Observe the EMBL-EBI terms of use and space out high-volume requests.
Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.