Skip to main content
Protein Domain Match
License: This constraint can use multiple tools, each under its own license. See the Tools Used tab and each tool’s page for license details.

This constraint is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.


proto-bio/proto-language/proto_language/constraint/protein_quality/protein_domain_constraint.py
View source
Evaluate whether sequences contain protein domains matching specified keywords. This constraint function searches for functional protein domains using HMMER’s hmmscan tool against HMM profile databases. It identifies domains in protein sequences and matches them against user-specified keywords, enabling selection of proteins with desired functional domains. For DNA sequences, the function first runs Prodigal to predict protein-coding regions (ORFs), then searches each predicted protein for matching domains. For protein sequences, the domain search is performed directly. The constraint is satisfied when the specified keyword criteria are met (any or all keywords, depending on configuration).

API Reference

ConfigProteinDomainConfig Source
Configuration for protein domain constraint.This class defines configuration parameters for evaluating whether protein sequences contain specific functional domains identified by keyword searches against HMM (Hidden Markov Model) profile databases. The constraint uses HMMER’s hmmscan tool to identify protein domains and matches them against user-specified keywords, enabling targeted selection for proteins with desired functional characteristics.
For DNA sequences, Prodigal is used to predict ORFs first, then each predicted protein is searched for domains. For protein sequences, the search is performed directly.
hmm_db
string
required
Path to HMM database file for hmmscan (e.g., Pfam-A.hmm). Must be pressed with hmmpress.
keywords
List[string]
required
Keywords to search for in domain descriptions (case-insensitive).
evalue_threshold
number
default:"0.005"
Maximum E-value for significant HMM hits; lower is more stringent (typical range 0.0001 to 0.01).
query_coverage
number
Min query coverage percentage for significant hits (0-100).
match_all_keywords
boolean
default:"False"
If True, require ALL keywords to be found. If False, require ANY keyword (default).
hmmscan_config
PyHmmerConfig
Configuration for PyHMMER hmmscan.
ReturnsConstraintOutput
One result per sequence. A score of 0.0 indicates domain criteria are satisfied (matching domains found) and 1.0 indicates no matching domains found or failure to meet keyword requirements. metadata carries:For DNA sequences:
  • prodigal_proteins: List of dicts of predicted proteins from Prodigal (or None if no ORFs were predicted)
  • prodigal_protein_count: Integer count of predicted ORFs
  • domain_search_results: List of domain search results for each predicted protein
  • domain_keywords_found: List of unique keywords found across all predicted proteins
  • domain_matching_proteins: List of protein IDs that matched keywords
For protein sequences:
  • domain_search_results: List containing domain search results
  • domain_keywords_found: List of keywords found in domain descriptions
  • domain_matching_hits: DataFrame of domain hits matching keywords
  • hmmscan_all_hits: DataFrame of all significant hmmscan hits

Usage

Evaluating domain presence in protein with single keyword:
python
>>> from proto_language.core import Sequence, SequenceType
>>> seq = Sequence("MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSF", "protein")
>>> cfg = ProteinDomainConfig(hmm_db="Pfam-A.hmm", keywords=["kinase"], evalue_threshold=0.001)
>>> results = protein_domain_constraint([(seq,)], config=cfg)
>>> print(results[0].score)  # 0.0 if kinase domain found, 1.0 if not
>>> print(results[0].metadata["domain_keywords_found"])  # ['kinase'] if found
Evaluating DNA sequence (with automatic ORF prediction):
python
>>> dna_seq = Sequence("ATGGTACTGAGCCCAGCG...", "dna")
>>> cfg = ProteinDomainConfig(hmm_db="Pfam-A.hmm", keywords=["helicase"])
>>> results = protein_domain_constraint([(dna_seq,)], config=cfg)
>>> print(results[0].metadata["prodigal_protein_count"])  # Number of predicted ORFs
>>> print(results[0].metadata["domain_matching_proteins"])  # IDs of proteins with helicase domain

Metadata

PropertyValue
Keyprotein-domain
Functionprotein_domain_constraint
Categoryprotein_quality
Modediscrete
Uses GPUFalse
Supported Typesdna, protein