AlphaMissense DB

License: AlphaMissense DB retrieves data from the AlphaFold Protein Structure Database, distributed under CC-BY-4.0. Attribution to the AlphaFold Protein Structure Database is required when the data is redistributed. The client wrapper code is Apache-2.0-licensed. Please refer to the data terms for full terms.

Proto is not affiliated with Google DeepMind. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.

GitHub 629 GitHub 629 Publication Publication Cite Cite Tool Source Tool Source Open as Notebook Open as Notebook Open on Proto Open on Proto

google-deepmind/alphamissense

629 stars

View repo

Accurate proteome-wide missense variant effect prediction with AlphaMissense

Jun Cheng, Guido Novati, … \vZiga Avsec

Science (2023)

Read paper

@article{cheng2023alphamissense,
  title={Accurate proteome-wide missense variant effect prediction with {AlphaMissense}},
  author={Cheng, Jun and Novati, Guido and Pan, Joshua and Bycroft, Clare and {\v{Z}}emgulyt{\.{e}}, Akvil{\.{e}} and Applebaum, Taylor and Pritzel, Alexander and Wong, Lai Hong and Zielinski, Michal and Sargeant, Tobias and Schneider, Rosalia G. and Senior, Andrew W. and Jumper, John and Hassabis, Demis and Kohli, Pushmeet and Avsec, {\v{Z}}iga},
  journal={Science},
  volume={381},
  number={6664},
  pages={eadg7492},
  year={2023},
  publisher={American Association for the Advancement of Science},
  doi={10.1126/science.adg7492}
}

Copy citation

proto-bio/proto-tools/proto_tools/tools/database_retrieval/alphamissense_db

View source

Open Notebook

Open notebook

Coming soon!

Run this tool directly in Proto with no setup required.

Function	Description
`run_alphamissense_db_fetch()`	Fetch per-residue, per-substitution AlphaMissense pathogenicity scores for a human UniProt access…	Docs Source

Background

AlphaMissense (Cheng et al., 2023) is a deep-learning model that scores the pathogenicity of human missense variants. It is adapted from AlphaFold and fine-tuned on human and primate population variant frequencies, treating variants common in healthy populations as benign and rare variants as putatively pathogenic. For each canonical UniProt sequence it scores all 19 alternate amino acids at every position, covering every possible single missense substitution. Its classification thresholds are set to a cutoff that reaches about 90% precision on ClinVar variants. The paper reports that the model classifies 89% of all 71 million possible human missense variants, labeling 32% likely pathogenic and 57% likely benign at the default thresholds. The predictions are not computed at query time. They are precomputed by Google DeepMind and distributed as static CSV files by the AlphaFold Protein Structure Database, maintained by EMBL-EBI, keyed by UniProt accession at https://alphafold.ebi.ac.uk/files/AF-{accession}-F1-{suffix}.csv. Internally, the tool strips and uppercases the supplied accession, builds the AlphaFold DB CSV URL for the requested coordinate system, and issues a single HTTP GET. The uniprot coordinate system fetches the aa-substitutions CSV, which holds the full protein-coordinate grid of every possible substitution. The hg19 and hg38 coordinate systems fetch the genomic CSVs, which cover only substitutions reachable by a single-nucleotide change (a single-nucleotide variant, SNV) and additionally carry chromosome, position, reference allele, alternate allele, and GENCODE transcript identifier. Each CSV row is parsed into one prediction record, with the genomic fields populated only in genomic mode. A 404 response means the accession is not covered and surfaces as a clear error. Predictions reflect the fixed CSV snapshot published by AlphaFold DB rather than a value recomputed per request.

Learning Resources

AlphaFold DB FAQ (EMBL-EBI) - official documentation covering the AlphaMissense CSV files, coverage, and the genomic and protein coordinate variants.
AlphaMissense GitHub (Google DeepMind) - the official repository with usage notes for the released code and prediction tables.

Tools

AlphaMissense DB Fetch (`alphamissense-db-fetch`)

Retrieves the complete AlphaMissense prediction set for a single human UniProt accession and returns every per-substitution prediction, the prediction count, the mean pathogenicity score, and the source CSV URL. The coordinate_system configuration selects the protein-coordinate grid of every possible substitution or one of the genomic-coordinate tables, which are limited to substitutions reachable by a single-nucleotide change. The genomic tables additionally populate chromosome, position, reference allele, alternate allele, and transcript identifier on each prediction.

API Reference

Source

Input: AlphaMissenseDBFetchInput

uniprot_id

string

required

UniProt accession (must be a human protein covered by AlphaMissense; e.g. ‘P04637’).

Source

Config: AlphaMissenseDBFetchConfig

coordinate_system

enum

default:"uniprot"

Which AFDB CSV to fetch. "uniprot" (default) returns the full protein-coordinate saturation grid (~7,500 rows for TP53). "hg19" / "hg38" return SNV-accessible substitutions in genomic coordinates (~2,500 rows for TP53) and populate chrom / pos / ref / alt / transcript_id on each prediction; a protein variant reachable by multiple SNVs appears multiple times.Available options: uniprot, hg19, hg38

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cpu"

Device to run the tool on.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.

Source

Output: AlphaMissenseDBFetchOutput

uniprot_accession

string

required

UniProt accession that was looked up.

predictions

List[AlphaMissensePrediction]

All per-substitution pathogenicity predictions in the source CSV (full saturation grid: sequence_length * 19 for UniProt-coordinate fetches).

Show AlphaMissensePrediction

position

integer

required

1-indexed residue position in the canonical UniProt sequence.

wild_type_aa

string

required

Single-letter wild-type amino acid at this position.

alt_aa

string

required

Single-letter alternate amino acid being scored.

pathogenicity_score

number

required

AlphaMissense pathogenicity score (0.0-1.0). Higher values indicate the variant is more likely to be pathogenic.

classification

enum

required

AlphaMissense class label (‘likely_benign’, ‘ambiguous’, or ‘likely_pathogenic’).

chrom

string

Chromosome (e.g. ‘chr17’); populated only for genomic-coordinate fetches.

pos

integer

1-indexed genomic position; populated only for genomic-coordinate fetches.

ref

string

Reference allele; populated only for genomic- coordinate fetches.

alt

string

Alternate allele; populated only for genomic- coordinate fetches.

transcript_id

string

GENCODE transcript identifier; populated only for genomic-coordinate fetches.

num_predictions

integer

required

Number of predictions in the source CSV.

mean_pathogenicity

number

Mean pathogenicity score across all predictions; None when predictions is empty.

source_url

string

required

URL of the AlphaMissense CSV that was fetched.

Applications

Use this to pull model-based missense pathogenicity into a pipeline: triage missense variants of uncertain significance from clinical sequencing, prioritize candidate disease-causing variants from case cohorts, avoid disruptive substitutions during sequence design or optimization, or apply a per-residue pathogenicity penalty in a generative-design loop. The accession can come from the UniProt tool, which resolves a gene symbol and organism to a canonical reviewed human accession. The same accession also drives the AlphaFold DB tool, aligning per-residue pathogenicity scores with predicted backbone coordinates.

Usage Tips

The tool always returns the entire prediction set. There is no server-side filtering on the static CSV. Filter the returned predictions list in your own code by position, score, or classification.
Cache the output once per accession. A typical protein has roughly 7,000 to 20,000 substitution rows. Fetch once and reuse the result rather than refetching inside tight loops.
Group by position for hotspot analysis. mean_pathogenicity over a wide region is a coarse summary. Inspect predictions grouped by residue position to surface hotspots.
Coverage is human canonical isoforms only. Non-canonical isoforms and non-human accessions return a 404 and surface as a clear error. Resolve the accession with the UniProt tool first if the organism is uncertain.

Toolkit Notes

These apply to every AlphaMissense DB tool in this toolkit (alphamissense-db-fetch).

Requires network access. The tool downloads the AlphaMissense CSV from the AlphaFold Protein Structure Database. It does not run offline and keeps no local copy of the predictions.
Subject to EMBL-EBI fair use. The CSV is an anonymous static download from AlphaFold DB with no API key or account. Observe the EMBL-EBI terms of use and space out high-volume requests.

Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.

​Background

​Learning Resources

​Tools

​AlphaMissense DB Fetch (alphamissense-db-fetch)

​API Reference

​Applications

​Usage Tips

​Toolkit Notes

​Infrastructure Guides

Tool Persistence

Device Management

Parallel Execution

Cloud Inference

Background

Learning Resources

Tools

AlphaMissense DB Fetch (`alphamissense-db-fetch`)

API Reference

Applications

Usage Tips

Toolkit Notes

Infrastructure Guides