Gene/Protein Similarity

License: This constraint can use multiple tools, each under its own license. See the Tools Used tab and each tool’s page for license details.

This constraint is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.

Tools Used Tools Used Source Source

ORF Prediction · 2 tools

proto-bio/proto-language/proto_language/constraint/sequence_annotation/mmseqs_similarity_constraint.py

View source Evaluate sequence similarity using MMseqs2 protein database search. This constraint function evaluates whether protein sequences (or proteins predicted from DNA sequences) have percent identity to known proteins within an acceptable range. It uses MMseqs2, an ultra-fast sequence search tool, to search against a reference protein database and calculates similarity scores. For DNA sequences, the function first predicts open reading frames (ORFs) using either Prodigal (for prokaryotes) or ORFipy (viral), then searches the translated proteins. For protein sequences, the search is performed directly. The constraint is satisfied when all database hits have percent identity within the specified [min_similarity, max_similarity] range.

API Reference

ConfigMMseqsSimilarityConfig Source

Configuration for MMseqs gene similarity constraint.This class defines configuration parameters for evaluating sequence similarity (percent identity) to known proteins using MMseqs2, an ultra-fast sequence search tool. For DNA sequences, the constraint first predicts open reading frames (ORFs) using either Prodigal or ORFipy, then searches the translated proteins against a reference database. For protein sequences, the search is performed directly.

For examples with tool configuration, see:

from proto_tools import Mmseqs2SearchProteinsConfig The similarity range [min_similarity, max_similarity] defines acceptable percent identity. Sequences with hits outside this range are penalized. For example:

[40, 70]: Moderate similarity, useful for inferring functional similarity while avoiding identical sequences
[0, 40]: Low similarity filter, for novelty/uniqueness constraints
[80, 100]: High similarity filter, for functional conservation requirements

min_similarity

number

required

Minimum acceptable percent identity (0-100). Lower values are more permissive.

max_similarity

number

required

Maximum acceptable percent identity (0-100). Higher values allow more similar hits.

mmseqs_db

string

required

Path to MMseqs2 protein database for similarity search

mmseqs_config

Mmseqs2SearchProteinsConfig

MMseqs configuration (threads, sensitivity, etc.).

orf_predictor

enum

default:"prodigal"

ORF prediction tool (DNA only): ‘orfipy’ (viral) or ‘prodigal’ (prokaryotic).Options: orfipy, prodigal

orfipy_config

OrfipyConfig

ORFipy configuration (DNA only, used if orf_predictor=‘orfipy’).

prodigal_config

ProdigalConfig

Prodigal configuration (DNA only, used if orf_predictor=‘prodigal’).

ReturnsConstraintOutput

One result per sequence. Score 0.0 means all hits fall within [min_similarity, max_similarity]; higher scores indicate greater deviation. Score 1.0 (MAX_ENERGY) is returned if no ORFs are found (DNA) or no database hits are found. metadata carries:For DNA sequences (with Prodigal):

prodigal_orfs: List of dictionaries containing predicted ORF information (id, start, end, strand, protein_sequence, etc.)
mmseqs_results: List of dictionaries with MMseqs2 hit information (target_id, pident, evalue)
unique_orfs_with_hits: Integer count of distinct ORFs with at least one database match
orfs_with_acceptable_similarity: Integer count of ORFs with hits in acceptable range
total_orfs_with_hits: Integer total number of ORF-hit pairs
similarity_compliance_rate: Float fraction of hits within acceptable range (0.0-1.0)

For DNA sequences (with ORFipy):

orfipy_orfs: List of dictionaries with ORFipy ORF predictions
Other fields same as Prodigal above

For protein sequences:

direct_protein: Dictionary with protein information (id, sequence, length)
mmseqs_results: List of MMseqs2 hit dictionaries
unique_orfs_with_hits: Count of distinct ORFs with at least one hit (always 1 or 0 for a single protein, which has exactly one ORF)
orfs_with_acceptable_similarity: Count of acceptable hits
total_orfs_with_hits: Total hit count
similarity_compliance_rate: Fraction of hits in range

Usage

Filtering for sequences with low similarity to existing proteins:

python

>>> from proto_language.core import Sequence, SequenceType
>>> protein_seq = Sequence("MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSF", "protein")
>>> config = MMseqsSimilarityConfig(
...     min_similarity=10.0, max_similarity=30.0, mmseqs_db="/data/databases/uniref90"
... )
>>> results = mmseqs_similarity_constraint([(protein_seq,)], config)
>>> print(results[0].score)  # 0.0 means no high-similarity hits
>>> print(results[0].metadata["similarity_compliance_rate"])

Metadata

Property	Value
Key	`mmseqs-gene-similarity`
Function	`mmseqs_similarity_constraint`
Category	`sequence_annotation`
Mode	`discrete`
Uses GPU	`False`
Supported Types	`dna`, `protein`

​API Reference

​Usage

​Metadata

API Reference

Usage

Metadata