Skip to main content
PyHMMER
License: PyHMMER is open source and free for academic and commercial use under an MIT license. Please refer to the license for full terms.

This toolkit is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.


pyHMMER: a Python library binding to HMMER for efficient sequence analysis
Martin Larralde and Georg Zeller
Bioinformatics (2023)
Read paper
@article{larralde2023pyhmmer,
  title={pyHMMER: a Python library binding to HMMER for efficient sequence analysis},
  author={Larralde, Martin and Zeller, Georg},
  journal={Bioinformatics},
  volume={39},
  number={5},
  pages={btad214},
  year={2023},
  publisher={Oxford University Press},
  doi={10.1093/bioinformatics/btad214}
}
Copy citation
proto-bio/proto-tools/proto_tools/tools/gene_annotation/pyhmmer
View source
Open Notebook
Open notebook
Coming soon!
Run this tool directly in Proto with no setup required.
FunctionDescription
run_pyhmmer_hmmscan()Search sequences against HMM database using PyHMMER Docs Source
run_pyhmmer_hmmsearch()Search HMM profile(s) against sequences using PyHMMER Docs Source
run_pyhmmer_jackhmmer()Iteratively search protein sequences against protein database using PyHMMER Docs Source
run_pyhmmer_nhmmer()Search nucleotide sequences against nucleotide database using PyHMMER Docs Source
run_pyhmmer_phmmer()Search protein sequences against protein database using PyHMMER Docs Source

Background

PyHMMER (Larralde & Zeller, 2023) is a Cython binding to the HMMER C API that ships the HMMER source itself, so a single pip install provides both the Python interface and the compiled search engine. The underlying HMMER3 algorithm (Eddy, 2011) builds a profile hidden Markov model from a multiple sequence alignment, where each match state stores position-specific emission probabilities and the transitions between states model insertions and deletions. Search proceeds through a cascade of accelerated filters: a SIMD-vectorised multiple-segment Viterbi (MSV) filter, a vectorised Viterbi filter, and a Forward/Backward filter, each tightening the candidate set before the final scored alignment. Each hit carries a database-size-independent bit score together with an E-value derived from extreme-value-distribution theory. The E-value calibrates the expected number of false-positive hits at that bit score for the database being searched. Profile HMMs detect homology that pairwise methods such as BLAST miss because they encode an entire family’s position-specific conservation pattern rather than the similarity of two sequences alone. HMMER3 brought profile-HMM search within roughly the runtime envelope of BLAST while keeping that sensitivity advantage. PyHMMER preserves the algorithm exactly and adds Python-native multithreading, in-memory HMM and sequence handles, and structured result objects. Coordinates returned for HMM matches, target alignments, and envelopes are reported as 1-indexed, inclusive intervals to match biological residue selection conventions.

Learning Resources

  • pyhmmer documentation (Martin Larralde) - the canonical API reference, with worked examples for every binding and a guide to feeding HMM and sequence files in and out of memory.
  • HMMER User’s Guide (The Eddy/Rivas Laboratory, Harvard) - reference for the HMMER 3 command-line surface, the MSV/Viterbi/Forward filter cascade, E-value statistics, and the gathering/noise/trusted cutoff system used by Pfam HMMs.
  • Pfam (via InterPro) (EMBL-EBI) - the standard curated HMM library that ships gathering, noise, and trusted cutoffs, and the typical target database for hmmscan domain annotation.

Tools

PyHMMER HMM Scan (pyhmmer-hmmscan)

Searches one or more query protein sequences against an HMM database and returns the profiles that match each query.

API Reference

Source
hmm_db
string | string
required
Path to an HMM database file containing multiple profile HMMs. The file should be in HMMER3 format and typically represents a comprehensive database like Pfam. All HMMs in the database will be searched against the query sequences.
sequences
List[string]
required
Query protein sequences to search. Inherited from PyHmmerInput. Can be a single sequence string or a list of sequence strings.
Source
bit_cutoffs
string
Use the HMM’s stored bit-score cutoff in place of E-value reporting. gathering is the Pfam-curated default for inclusion; noise is the most permissive; trusted is the strictest. None = use E-value/score thresholds. Default: None. Pyhmmer raises MissingCutoffs if the HMM file lacks the requested cutoff line — set None for HMMs without curated thresholds.
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cpu"
Device to run the tool on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
num_threads
integer
default:"0"
CPU threads (0 = auto). Inherited from PyHmmerConfig.
evalue_threshold
number
default:"10.0"
Sequence-level E-value cap to report. Inherited from PyHmmerConfig.
score_threshold
number
Sequence-level bit-score floor. Inherited from PyHmmerConfig.
domain_evalue_threshold
number
default:"10.0"
Per-domain E-value cap to report. Inherited from PyHmmerConfig.
domain_score_threshold
number
Per-domain bit-score floor. Inherited from PyHmmerConfig.
inclusion_evalue_threshold
number
default:"0.01"
Sequence-level E-value cap for inclusion. Inherited from PyHmmerConfig.
inclusion_domain_evalue_threshold
number
default:"0.01"
Per-domain E-value cap for inclusion. Inherited from PyHmmerConfig.
z_value
number
Effective database size. Inherited from PyHmmerConfig.
domain_z_value
number
Significant hit count. Inherited from PyHmmerConfig.
skip_filters
boolean
default:"False"
Disable MSV/Vit/Fwd filters. Inherited from PyHmmerConfig.
Source
sequence_hits
List[SequenceHit]
List of sequence-level hits from the search. Each SequenceHit contains:
domain_hits
List[DomainHit]
List of domain-level hits from the search. Each DomainHit contains:

Applications

Use this when the question is “what does this protein contain.” Run a query proteome against Pfam to annotate each protein with its domain architecture, then filter on domain_included to keep curated hits.

Usage Tips

  • Pick hmmscan versus hmmsearch by what you are querying with. hmmscan takes sequences as queries and a database of HMMs as the target; hmmsearch is the reverse. For one or a few queries against Pfam, hmmscan is the natural choice; for one HMM against a large sequence database, hmmsearch is much faster.
  • bit_cutoffs="gathering" applies here too and is the recommended Pfam annotation default. As with hmmsearch, the cutoff is read from the HMM file and ad-hoc HMMs without stored cutoffs will fail with MissingCutoffs.

PyHMMER Single-Sequence Protein Search (pyhmmer-phmmer)

Searches one or more protein query sequences against a target protein database by building a temporary HMM around each query.

API Reference

Source
target_sequences
List[string]
required
Target protein sequences to search against. Can be a single sequence string or a list of sequence strings. The query sequences will be compared against these targets.
sequences
List[string]
required
Query protein sequences. Inherited from PyHmmerInput. Can be a single sequence string or a list of sequence strings. These sequences will be used to build temporary HMM profiles on-the-fly.
Source
num_threads
integer
default:"0"
CPU threads; 0 = auto-detect. Default 0.
evalue_threshold
number
default:"10.0"
Sequence-level E-value cap to report. Default 10.0.
score_threshold
number
Sequence-level bit-score floor. Overrides E-value when set. Default None.
domain_evalue_threshold
number
default:"10.0"
Per-domain E-value cap to report. Default 10.0.
domain_score_threshold
number
Per-domain bit-score floor. Overrides domain E-value when set. Default None.
inclusion_evalue_threshold
number
default:"0.01"
Sequence-level inclusion E-value. Default 0.01.
inclusion_domain_evalue_threshold
number
default:"0.01"
Per-domain inclusion E-value. Default 0.01.
z_value
number
Effective database size for E-value calc. None = use the actual target count.
domain_z_value
number
Significant hit count for domain E-value calc. None = use actual.
skip_filters
boolean
default:"False"
Disable MSV/Vit/Fwd heuristic filters. Slower but maximally sensitive. Default False.
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cpu"
Device to run the tool on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
sequence_hits
List[SequenceHit]
List of sequence-level hits from the search. Each SequenceHit contains:
domain_hits
List[DomainHit]
List of domain-level hits from the search. Each DomainHit contains:

Applications

Use this for HMM-grade sensitivity when no pre-built profile is available. Typical workflows include finding remote homologs of a newly characterised protein in a reference proteome and running a sequence-based homology pass when the family of interest is too narrow or too new to have a curated HMM.

Usage Tips

  • A single-query, single-target search will not converge. phmmer builds the HMM from the query against the target database’s residue statistics; a database of one sequence has no background to estimate against. Use phmmer with a real target proteome, not a synthetic pair.

PyHMMER Nucleotide Search (pyhmmer-nhmmer)

Searches nucleotide query sequences against a nucleotide target database with the same profile-HMM machinery used for proteins.

API Reference

Source
target_sequences
List[string]
required
Target nucleotide sequences to search against. Can be a single sequence string or a list of sequence strings.
sequences
List[string]
required
Query nucleotide sequences. Inherited from PyHmmerInput. Can be a single sequence string or a list of sequence strings.
Source
strand
enum
default:"both"
Strand to search. both (default) runs the forward strand and its reverse complement; watson runs only the forward strand; crick runs only the reverse complement.Available options: both, watson, crick
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cpu"
Device to run the tool on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
num_threads
integer
default:"0"
CPU threads (0 = auto). Inherited from PyHmmerConfig.
evalue_threshold
number
default:"10.0"
Sequence-level E-value cap to report. Inherited from PyHmmerConfig.
score_threshold
number
Sequence-level bit-score floor. Inherited from PyHmmerConfig.
domain_evalue_threshold
number
default:"10.0"
Per-domain E-value cap to report. Inherited from PyHmmerConfig.
domain_score_threshold
number
Per-domain bit-score floor. Inherited from PyHmmerConfig.
inclusion_evalue_threshold
number
default:"0.01"
Sequence-level E-value cap for inclusion. Inherited from PyHmmerConfig.
inclusion_domain_evalue_threshold
number
default:"0.01"
Per-domain E-value cap for inclusion. Inherited from PyHmmerConfig.
z_value
number
Effective database size. Inherited from PyHmmerConfig.
domain_z_value
number
Significant hit count. Inherited from PyHmmerConfig.
skip_filters
boolean
default:"False"
Disable MSV/Vit/Fwd filters. Inherited from PyHmmerConfig.
Source
sequence_hits
List[SequenceHit]
List of sequence-level hits from the search. Each SequenceHit contains:
domain_hits
List[DomainHit]
List of domain-level hits from the search. Each DomainHit contains:

Applications

Use this to find homologs of transposable elements, non-coding RNAs, regulatory elements, and other nucleotide features that diverge fast enough to slip past direct sequence alignment. Pair it with Dfam - the curated profile-HMM library of transposable-element families that was co-designed with nhmmer - or with custom-built nucleotide HMMs when annotating genomes and metagenomic contigs.

Usage Tips

  • strand defaults to "both" and searches the forward and reverse-complement strands. Set "watson" to restrict to forward or "crick" to restrict to reverse-complement when the orientation of a hit is meaningful (e.g., on annotated coding strands).

PyHMMER Iterative Protein Search (pyhmmer-jackhmmer)

Performs iterative protein-sequence search against a target protein database, rebuilding the HMM from each round’s included hits to extend the search outward across remote homologs.

API Reference

Source
target_sequences
List[string]
required
Target protein sequences to search against. Can be a single sequence string or a list of sequence strings.
sequences
List[string]
required
Query protein sequences. Inherited from PyHmmerInput. Can be a single sequence string or a list of sequence strings.
Source
max_iterations
integer
default:"5"
Maximum jackhmmer iterations; stops early on convergence. Default 5.
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cpu"
Device to run the tool on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
num_threads
integer
default:"0"
CPU threads (0 = auto). Inherited from PyHmmerConfig.
evalue_threshold
number
default:"10.0"
Sequence-level E-value cap to report. Inherited from PyHmmerConfig.
score_threshold
number
Sequence-level bit-score floor. Inherited from PyHmmerConfig.
domain_evalue_threshold
number
default:"10.0"
Per-domain E-value cap to report. Inherited from PyHmmerConfig.
domain_score_threshold
number
Per-domain bit-score floor. Inherited from PyHmmerConfig.
inclusion_evalue_threshold
number
default:"0.01"
Sequence-level inclusion E-value. Inherited from PyHmmerConfig. Critical for jackhmmer — the included set seeds the next iteration’s HMM.
inclusion_domain_evalue_threshold
number
default:"0.01"
Per-domain inclusion E-value. Inherited from PyHmmerConfig.
z_value
number
Effective database size for E-value calc. Inherited from PyHmmerConfig.
domain_z_value
number
Significant hit count for domain E-value. Inherited from PyHmmerConfig.
skip_filters
boolean
default:"False"
Disable MSV/Vit/Fwd filters. Inherited from PyHmmerConfig.
Source
sequence_hits
List[SequenceHit]
List of sequence-level hits from the search. Each SequenceHit contains:
domain_hits
List[DomainHit]
List of domain-level hits from the search. Each DomainHit contains:

Applications

Use this when you need to reach divergent family members that a single-pass phmmer would miss, for example when seeding a new family from one characterised representative or expanding a manually curated set to its full evolutionary breadth.

Usage Tips

  • inclusion_evalue_threshold is the lever that controls iterative drift. Each iteration rebuilds the HMM from hits that pass the inclusion thresholds (--incE / --incdomE, defaults 0.01). A looser inclusion threshold pulls in more sequences per round and increases the risk of pulling in unrelated families; tighten it when iterations start drifting.
  • max_iterations defaults to 5 and the search exits early on convergence. Raising it rarely helps if the search has already converged on a stable set, and a higher cap multiplies runtime on long-running jobs.

Toolkit Notes

These apply to every PyHMMER tool in this toolkit (pyhmmer-hmmsearch, pyhmmer-hmmscan, pyhmmer-phmmer, pyhmmer-nhmmer, pyhmmer-jackhmmer).
  • Runs on CPU with SIMD acceleration. The HMMER3 filter cascade is SIMD-vectorised on x86 platforms. pyhmmer compiles HMMER from source at install time and inherits whatever instruction sets the build host exposes, with no GPU acceleration to enable.
  • Self-contained after install. The HMMER C library is compiled into the PyHMMER wheel, so no separate HMMER install or PATH lookup is needed; HMM databases such as Pfam-A still have to be downloaded separately.
  • num_threads parallelises within a single search. Default 0 auto-detects the available cores. Memory scales with HMM database size; Pfam-A (around 20,000 HMMs) needs roughly 2 GB of RAM held resident.
  • Reporting versus inclusion thresholds are independent filters. evalue_threshold / score_threshold (and their domain_* siblings) control what appears in the output, while inclusion_evalue_threshold marks the stricter “trusted” subset via the included and domain_included flags. jackhmmer seeds the next iteration’s HMM from the included set, so the inclusion threshold drives iterative behaviour while reporting only affects what is returned.
Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.