Protein Domain Match

License: This constraint can use multiple tools, each under its own license. See the Tools Used tab and each tool’s page for license details.

This constraint is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.

Tools Used Tools Used Source Source

proto-bio/proto-language/proto_language/constraint/protein_quality/protein_domain_constraint.py

View source Evaluate whether sequences contain protein domains matching specified keywords. This constraint function searches for functional protein domains using HMMER’s hmmscan tool against HMM profile databases. It identifies domains in protein sequences and matches them against user-specified keywords, enabling selection of proteins with desired functional domains. For DNA sequences, the function first runs Prodigal to predict protein-coding regions (ORFs), then searches each predicted protein for matching domains. For protein sequences, the domain search is performed directly. The constraint is satisfied when the specified keyword criteria are met (any or all keywords, depending on configuration).

API Reference

ConfigProteinDomainConfig Source

Configuration for protein domain constraint.This class defines configuration parameters for evaluating whether protein sequences contain specific functional domains identified by keyword searches against HMM (Hidden Markov Model) profile databases. The constraint uses HMMER’s hmmscan tool to identify protein domains and matches them against user-specified keywords, enabling targeted selection for proteins with desired functional characteristics.

For DNA sequences, Prodigal is used to predict ORFs first, then each predicted protein is searched for domains. For protein sequences, the search is performed directly.

hmm_db

string

required

Path to HMM database file for hmmscan (e.g., Pfam-A.hmm). Must be pressed with hmmpress.

keywords

List[string]

required

Keywords to search for in domain descriptions (case-insensitive).

evalue_threshold

number

default:"0.005"

Maximum E-value for significant HMM hits; lower is more stringent (typical range 0.0001 to 0.01).

query_coverage

number

Min query coverage percentage for significant hits (0-100).

match_all_keywords

boolean

default:"False"

If True, require ALL keywords to be found. If False, require ANY keyword (default).

hmmscan_config

PyHmmerConfig

Configuration for PyHMMER hmmscan.

ReturnsConstraintOutput

One result per sequence. A score of 0.0 indicates domain criteria are satisfied (matching domains found) and 1.0 indicates no matching domains found or failure to meet keyword requirements. metadata carries:For DNA sequences:

prodigal_proteins: List of dicts of predicted proteins from Prodigal (or None if no ORFs were predicted)
prodigal_protein_count: Integer count of predicted ORFs
domain_search_results: List of domain search results for each predicted protein
domain_keywords_found: List of unique keywords found across all predicted proteins
domain_matching_proteins: List of protein IDs that matched keywords

For protein sequences:

domain_search_results: List containing domain search results
domain_keywords_found: List of keywords found in domain descriptions
domain_matching_hits: DataFrame of domain hits matching keywords
hmmscan_all_hits: DataFrame of all significant hmmscan hits

Usage

Evaluating domain presence in protein with single keyword:

python

>>> from proto_language.core import Sequence, SequenceType
>>> seq = Sequence("MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSF", "protein")
>>> cfg = ProteinDomainConfig(hmm_db="Pfam-A.hmm", keywords=["kinase"], evalue_threshold=0.001)
>>> results = protein_domain_constraint([(seq,)], config=cfg)
>>> print(results[0].score)  # 0.0 if kinase domain found, 1.0 if not
>>> print(results[0].metadata["domain_keywords_found"])  # ['kinase'] if found

Evaluating DNA sequence (with automatic ORF prediction):

python

>>> dna_seq = Sequence("ATGGTACTGAGCCCAGCG...", "dna")
>>> cfg = ProteinDomainConfig(hmm_db="Pfam-A.hmm", keywords=["helicase"])
>>> results = protein_domain_constraint([(dna_seq,)], config=cfg)
>>> print(results[0].metadata["prodigal_protein_count"])  # Number of predicted ORFs
>>> print(results[0].metadata["domain_matching_proteins"])  # IDs of proteins with helicase domain

Metadata

Property	Value
Key	`protein-domain`
Function	`protein_domain_constraint`
Category	`protein_quality`
Mode	`discrete`
Uses GPU	`False`
Supported Types	`dna`, `protein`

​API Reference

​Usage

​Metadata

API Reference

Usage

Metadata