Skip to main content
Gene/Protein Similarity
License: This constraint can use multiple tools, each under its own license. See the Tools Used tab and each tool’s page for license details.

This constraint is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.


proto-bio/proto-language/proto_language/constraint/sequence_annotation/mmseqs_similarity_constraint.py
View source
Evaluate sequence similarity using MMseqs2 protein database search. This constraint function evaluates whether protein sequences (or proteins predicted from DNA sequences) have percent identity to known proteins within an acceptable range. It uses MMseqs2, an ultra-fast sequence search tool, to search against a reference protein database and calculates similarity scores. For DNA sequences, the function first predicts open reading frames (ORFs) using either Prodigal (for prokaryotes) or ORFipy (viral), then searches the translated proteins. For protein sequences, the search is performed directly. The constraint is satisfied when all database hits have percent identity within the specified [min_similarity, max_similarity] range.

API Reference

ConfigMMseqsSimilarityConfig Source
Configuration for MMseqs gene similarity constraint.This class defines configuration parameters for evaluating sequence similarity (percent identity) to known proteins using MMseqs2, an ultra-fast sequence search tool. For DNA sequences, the constraint first predicts open reading frames (ORFs) using either Prodigal or ORFipy, then searches the translated proteins against a reference database. For protein sequences, the search is performed directly.
For examples with tool configuration, see:
from proto_tools import Mmseqs2SearchProteinsConfig The similarity range [min_similarity, max_similarity] defines acceptable percent identity. Sequences with hits outside this range are penalized. For example:
  • [40, 70]: Moderate similarity, useful for inferring functional similarity while avoiding identical sequences
  • [0, 40]: Low similarity filter, for novelty/uniqueness constraints
  • [80, 100]: High similarity filter, for functional conservation requirements
min_similarity
number
required
Minimum acceptable percent identity (0-100). Lower values are more permissive.
max_similarity
number
required
Maximum acceptable percent identity (0-100). Higher values allow more similar hits.
mmseqs_db
string
required
Path to MMseqs2 protein database for similarity search
mmseqs_config
Mmseqs2SearchProteinsConfig
MMseqs configuration (threads, sensitivity, etc.).
orf_predictor
enum
default:"prodigal"
ORF prediction tool (DNA only): ‘orfipy’ (viral) or ‘prodigal’ (prokaryotic).Options: orfipy, prodigal
orfipy_config
OrfipyConfig
ORFipy configuration (DNA only, used if orf_predictor=‘orfipy’).
prodigal_config
ProdigalConfig
Prodigal configuration (DNA only, used if orf_predictor=‘prodigal’).
ReturnsConstraintOutput
One result per sequence. Score 0.0 means all hits fall within [min_similarity, max_similarity]; higher scores indicate greater deviation. Score 1.0 (MAX_ENERGY) is returned if no ORFs are found (DNA) or no database hits are found. metadata carries:For DNA sequences (with Prodigal):
  • prodigal_orfs: List of dictionaries containing predicted ORF information (id, start, end, strand, protein_sequence, etc.)
  • mmseqs_results: List of dictionaries with MMseqs2 hit information (target_id, pident, evalue)
  • unique_orfs_with_hits: Integer count of distinct ORFs with at least one database match
  • orfs_with_acceptable_similarity: Integer count of ORFs with hits in acceptable range
  • total_orfs_with_hits: Integer total number of ORF-hit pairs
  • similarity_compliance_rate: Float fraction of hits within acceptable range (0.0-1.0)
For DNA sequences (with ORFipy):
  • orfipy_orfs: List of dictionaries with ORFipy ORF predictions
  • Other fields same as Prodigal above
For protein sequences:
  • direct_protein: Dictionary with protein information (id, sequence, length)
  • mmseqs_results: List of MMseqs2 hit dictionaries
  • unique_orfs_with_hits: Count of distinct ORFs with at least one hit (always 1 or 0 for a single protein, which has exactly one ORF)
  • orfs_with_acceptable_similarity: Count of acceptable hits
  • total_orfs_with_hits: Total hit count
  • similarity_compliance_rate: Fraction of hits in range

Usage

Filtering for sequences with low similarity to existing proteins:
python
>>> from proto_language.core import Sequence, SequenceType
>>> protein_seq = Sequence("MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSF", "protein")
>>> config = MMseqsSimilarityConfig(
...     min_similarity=10.0, max_similarity=30.0, mmseqs_db="/data/databases/uniref90"
... )
>>> results = mmseqs_similarity_constraint([(protein_seq,)], config)
>>> print(results[0].score)  # 0.0 means no high-similarity hits
>>> print(results[0].metadata["similarity_compliance_rate"])

Metadata

PropertyValue
Keymmseqs-gene-similarity
Functionmmseqs_similarity_constraint
Categorysequence_annotation
Modediscrete
Uses GPUFalse
Supported Typesdna, protein