Skip to main content
License: Foldseek has a GPL-3.0 license. Please refer to the license for full terms.

Proto is not affiliated with the Steinegger Lab. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.


steineggerlab/foldseek
steineggerlab/foldseek
Foldseek enables fast and sensitive comparisons of large structure sets.
1.2k stars
View repo
search.foldseek.com
Visit website
Fast and accurate protein structure search with Foldseek
Michel van Kempen, Stephanie S. Kim, … Martin Steinegger
Nature Biotechnology (2024)
Read paper
@article{vanKempen2023foldseek,
  title={Fast and accurate protein structure search with {F}oldseek},
  author={van Kempen, Michel and Kim, Stephanie S. and Tumescheit, Charlotte and Mirdita, Milot and Lee, Jeongjae and Gilchrist, Cameron L. M. and S{\"o}ding, Johannes and Steinegger, Martin},
  journal={Nature Biotechnology},
  volume={42},
  pages={243--246},
  year={2024},
  publisher={Nature Publishing Group},
  doi={10.1038/s41587-023-01773-0}
}

@article{kim2025foldseekmultimer,
  title={Rapid and sensitive protein complex alignment with {F}oldseek-{M}ultimer},
  author={Kim, Woosub and Mirdita, Milot and Levy Karin, Eli and Gilchrist, Cameron L. M. and Schweke, Hugo and S{\"o}ding, Johannes and Levy, Emmanuel D. and Steinegger, Martin},
  journal={Nature Methods},
  volume={22},
  pages={469--472},
  year={2025},
  publisher={Nature Publishing Group},
  doi={10.1038/s41592-025-02593-7}
}
Copy citation
proto-bio/proto-tools/proto_tools/tools/structure_alignment/foldseek
View source
Open Notebook
Open notebook
FunctionDescription
run_foldseek_cluster()Cluster a set of protein structures by structural similarity using Foldseek easy-cluster Docs Source
run_foldseek_multimer_search()Search Foldseek multimer (complex) structural homology — remote (server) or local (CLI) Docs Source
run_foldseek_multimercluster()Cluster a set of protein complexes by multimer-level structural similarity using Foldseek easy-mu… Docs Source
run_foldseek_rbh()Find reciprocal best-hit structural alignments between a query and a target DB using Foldseek eas… Docs Source
run_foldseek_search()Search Foldseek structural homology against PDB100/AlphaFold DB (remote) or a local DB (local) Docs Source

Background

Foldseek (van Kempen et al., 2024) performs structural homology search, identifying distant evolutionary relatives of a query protein by structural similarity rather than sequence similarity. Each residue of a protein structure is represented as a discrete letter over a learned structural alphabet (the 3Di alphabet) that captures the tertiary interactions between that residue and its spatial neighbours. Pairs of structures are then aligned by running MMseqs2-style sensitive sequence alignment over the 3Di strings together with the underlying amino-acid sequences. The original publication reports that this approach decreases computation times by four to five orders of magnitude relative to the established structural aligners Dali, TM-align, and CE. Foldseek can also accept amino-acid sequences directly, in which case the bundled ProstT5 language model predicts a 3Di sequence before alignment. Foldseek-Multimer (Kim et al., 2025) extends the same machinery to multi-chain complexes. It computes pairwise chain-to-chain alignments and then clusters their superposition vectors to identify mutually compatible chain pairs. The multimer publication reports speedups of three to four orders of magnitude over the gold-standard multimer aligner while producing comparable alignments, and demonstrates that the method aligns billions of complex pairs within 11 hours of compute. The Foldseek codebase is released as open source by the Steinegger Lab at steineggerlab/foldseek, and the same group operates a public web service at search.foldseek.com that the remote execution modes of this toolkit target.

Learning Resources

  • steineggerlab/foldseek (Steinegger Lab, Seoul National University). Official repository and command-line interface for easy-search, easy-cluster, easy-multimersearch, easy-multimercluster, and easy-rbh.
  • search.foldseek.com (Steinegger Lab). The public web service that the remote execution mode targets.

Tools

Foldseek Cluster (foldseek-cluster)

Groups a set of structures into clusters by 3Di structural similarity using foldseek easy-cluster. Inputs can be structure text (PDB or mmCIF) or amino-acid sequences (FASTA). The latter are routed through the bundled ProstT5 language model, which predicts a 3Di sequence per input before clustering proceeds.

API Reference

Source
structures
List[Structure | string] | string | Path
Items to cluster (≥2) — a list of Structure objects / file paths / PDB·mmCIF·FASTA text, or a directory path (filename stems become structure_ids).
structure_ids
array
Optional IDs for the list form (default structure_0, …); derived from filename stems for a directory.
Source
min_seq_id
number
default:"0.0"
Sequence-identity threshold (0-1). Default 0.0 because Foldseek clusters by 3Di structural similarity, not seq id.
cov
number
default:"0.8"
Coverage threshold (0-1) for the alignment.
cov_mode
enum
default:"0"
Foldseek coverage mode (0: bidirectional,Available options: 0, 1, 2
evalue
number
default:"0.01"
E-value cutoff for cluster-membership alignments (lower = stricter; default 0.01 matches the foldseek cluster workflow’s runtime default).
alignment_type
enum
default:"2"
Alignment scoring method (0=3Di, 1=TMalign, 2=3Di+AA, 3=LoL).Available options: 0, 1, 2, 3
tmscore_threshold
number
default:"0.0"
Keep cluster-membership alignments with TM-score above this (0-1). 0.0 keeps all.
lddt_threshold
number
default:"0.0"
Keep cluster-membership alignments with LDDT above this (0-1). 0.0 keeps all.
prostt5_weights_dir
string
Path to ProstT5 model weights for FASTA inputs. If None, weights are auto-provisioned under resolve_weights_dir("foldseek")/prostt5/weights on first FASTA call (honors PROTO_FOLDSEEK_WEIGHTS_DIR / PROTO_MODEL_CACHE).
num_threads
integer
default:"4"
CPU threads.
use_gpu
boolean
default:"False"
Run with —gpu 1 on a Linux x86_64 NVIDIA GPU host (driver >= 525.60.13).
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cpu"
Device to run the tool on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
clusters
List[FoldseekCluster]
One entry per cluster, each holding a representative and its members.
num_clusters
integer
required
len(clusters).
num_structures
integer
required
Total number of input structures clustered.

Applications

This tool is appropriate for deduplicating a set of designed structures before downstream analysis, for surveying fold families across a screened library, and for partitioning a large structure collection into representative groups for further inspection. Clusters with a single member identify structurally isolated entries that share no near-neighbour in the input set.

Usage Tips

  • structures accepts either a list or a directory path. Provide an in-memory list of structure or FASTA text strings (Structure objects and file paths are also accepted per item), or a single path to a directory of supported files, in which case filename stems become the structure identifiers.
  • A single call must use one input format. Mixing FASTA inputs with PDB or mmCIF inputs is rejected by input validation. Format is auto-detected per input entry.
  • min_seq_id=0.0 is intentional and lets 3Di structural similarity dominate cluster assignment. Raising it adds a sequence-identity floor to cluster membership. Use a non-zero value only when a sequence-similarity constraint is desired alongside structural similarity.
  • There is no parameter that requests an exact cluster count. Foldseek clusters by similarity threshold, not by a target count. To approximate a target number of clusters, sweep the cov field and select the run whose cluster count is closest to the target.

Foldseek Multimer Cluster (foldseek-multimercluster)

Groups a set of multi-chain assemblies into clusters using foldseek easy-multimercluster, which combines per-chain TM-score and interface lDDT into a multimer-level similarity score. Inputs are multi-chain PDB or mmCIF text.

API Reference

Source
structures
List[Structure | string] | string | Path
Multi-chain items to cluster (≥2) — a list of Structure objects / file paths / PDB·mmCIF text, or a directory path (filename stems become structure_ids).
structure_ids
array
Optional IDs for the list form (default multimer-0, …); derived from filename stems for a directory. No _.
Source
multimer_tm_threshold
number
default:"0.65"
Maps to --multimer-tm-threshold. Multimer-level TM-score (0-1) above which two multimers cluster together.
chain_tm_threshold
number
default:"0.001"
Maps to --chain-tm-threshold. Per-chain TM-score (0-1) used to filter chain-pair alignments before assembling the multimer score.
interface_lddt_threshold
number
default:"0.5"
Maps to --interface-lddt-threshold. Interface lDDT (0-1) for chain-pair alignments.
alignment_type
enum
default:"2"
Alignment scoring method (0=3Di, 1=TMalign, 2=3Di+AA, 3=LoL).Available options: 0, 1, 2, 3
tmscore_threshold
number
default:"0.0"
Keep chain-pair alignments with TM-score above this (0-1). 0.0 keeps all.
lddt_threshold
number
default:"0.0"
Keep chain-pair alignments with LDDT above this (0-1). 0.0 keeps all.
num_threads
integer
default:"4"
CPU threads.
use_gpu
boolean
default:"False"
Run with —gpu 1 on a Linux x86_64 NVIDIA GPU host (driver >= 525.60.13).
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cpu"
Device to run the tool on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
clusters
List[FoldseekCluster]
One entry per cluster, each holding a representative multimer and its members. Member IDs may include {multimer_id}_{chain} suffixes per Foldseek’s chain-aware schema.
num_clusters
integer
required
len(clusters).
num_multimers
integer
required
Total number of input multimers clustered.
rep_seq_fasta
string
required
Representative-multimer FASTA produced by Foldseek (with #multimer_id group separators between chains).

Applications

This tool is appropriate for partitioning a candidate set of designed complexes by overall complex geometry, for selecting structurally diverse representatives from a larger pool of binder-target poses, and for analysing the structural diversity of an experimentally determined complex collection.

Usage Tips

  • Structure identifiers must not contain an underscore. Foldseek emits cluster member identifiers as {multimer_id}_{chain}, so an underscore in the multimer identifier would silently corrupt downstream parsing. Both user-supplied and filename-derived identifiers are validated and rejected if they contain an underscore.
  • Three thresholds control cluster membership. multimer_tm_threshold (default 0.65) sets the multimer-level TM-score required for inclusion. chain_tm_threshold (default 0.001) governs the per-chain TM-score required during chain-pair filtering. interface_lddt_threshold (default 0.5) sets the interface quality required for a chain-pair alignment to contribute to the multimer score.

Foldseek Reciprocal Best Hits (foldseek-rbh)

Performs a reciprocal-best-hits structural search between a single-chain query and a target database using foldseek easy-rbh. Only mutual best matches are returned, in contrast to the all-hit output of foldseek-search.

API Reference

Source
structure
Structure
required
Single-chain query structure. Accepts a Structure object, a file path, or raw PDB/CIF content.
Source
local_db
string
Path to the target — either a prebuilt Foldseek DB (e.g. /data/pdb100) or a directory of PDB files (Foldseek auto-builds a temporary DB). Required.
evalue
number
default:"10.0"
E-value cutoff (lower = stricter).
sensitivity
number
default:"4.0"
Prefilter sensitivity (1.0-9.5; higher = slower + more sensitive). Default 4.0 matches foldseek’s setStructureRbhDefaults (which, unlike the search workflow, does not bump sensitivity to 9.5).
max_seqs
integer
default:"1000"
Max prefilter targets per query.
alignment_type
enum
default:"2"
Alignment scoring method (0=3Di, 1=TMalign, 2=3Di+AA, 3=LoL). Note: foldseek’s RBH workflow only branches on TMalign (1) and 3Di+AA (2); 0 falls through to the same alignment branch as 2.Available options: 0, 1, 2, 3
cov
number
default:"0.0"
Minimum aligned-residue coverage for an RBH pair (0-1). 0.0 keeps all.
cov_mode
enum
default:"0"
How cov is measured: 0=bidirectional, 1=target-only, 2=query-only.Available options: 0, 1, 2
tmscore_threshold
number
default:"0.0"
Keep RBH pairs with TM-score above this (0-1). 0.0 keeps all.
lddt_threshold
number
default:"0.0"
Keep RBH pairs with LDDT above this (0-1). 0.0 keeps all.
num_threads
integer
default:"4"
CPU threads.
use_gpu
boolean
default:"False"
Run with —gpu 1 on a Linux x86_64 NVIDIA GPU host (driver >= 525.60.13).
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cpu"
Device to run the tool on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
hits
List[FoldseekHit]
Mutual best-hit alignments. Each hit is a standard 12-column M8 row, identical schema to foldseek-search.
num_hits
integer
required
len(hits).
target_db
string
required
The target DB path that was queried.

Applications

This tool produces conservative one-to-one structural correspondences. It is appropriate for structural orthology calls between species, for mapping designed proteins to their closest natural counterpart in a curated reference set, and for any analysis in which the absence of a reciprocal best match should be interpreted as no confident correspondence.

Usage Tips

  • This tool runs only in local execution mode. No remote endpoint exists for reciprocal best hits, and a local_db value pointing at a prebuilt database or a directory of PDB files is required.
  • The output is sparse by construction. Most queries return zero or one hit, and the absence of a reciprocal best match indicates that no target in the database satisfies the reciprocity criterion.

Toolkit Notes

These apply to every Foldseek tool in this toolkit (foldseek-search, foldseek-cluster, foldseek-multimer-search, foldseek-multimercluster, foldseek-rbh).
  • Local memory consumption scales linearly with database size. The upstream documentation gives a per-residue cost of (6 + 1 + 1) bytes × num_residues for Cα coordinates, 3Di letters, and amino-acid letters, and reports that the 54 million entries in AFDB50 require approximately 151 GB of RAM under default settings.
  • Hits use a 12-column M8 tabular schema with sequence_identity normalised to the range 0 to 1. Filtering structural hits by sequence identity defeats the purpose of structural search, since distant homologues commonly share fold without sharing sequence. evalue and bit_score are the appropriate ranking criteria.
  • Accepted input formats differ by tool. foldseek-search, foldseek-multimer-search, and foldseek-rbh currently accept only raw PDB text, foldseek-cluster accepts PDB, mmCIF, or FASTA, and foldseek-multimercluster accepts PDB or mmCIF.
  • Local execution requires a user-supplied target. Either a prebuilt Foldseek database or a directory of structure files must be provided through the local_db field. No reference database is bundled with the toolkit.
  • A directory passed to structures caches by file content, not directory path. Modifying files in place between calls correctly invalidates the cache, so structure-set updates do not produce stale results.
  • Local search can use an NVIDIA GPU. Set use_gpu=True on any local-mode tool; the GPU build auto-installs on Linux x86_64 hosts with a compatible NVIDIA driver (>= 525.60.13).
Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.