Skip to main content
Protein Repetitiveness

This constraint is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.


Source
proto-bio/proto-language/proto_language/constraint/protein_quality/protein_repetitiveness_constraint.py
View source
Evaluate protein sequence repetitiveness based on k-mer frequency analysis. This constraint function analyzes protein sequences for repetitive content by examining k-mer frequencies. It identifies sequences with excessive repetitive motifs, which may indicate low-complexity regions or non-functional proteins. The analysis scans multiple k-mer lengths to detect both short tandem repeats and larger sequence duplications. The repetitiveness score represents the maximum fraction of the sequence covered by any repeated k-mer. For example, if “SSS” appears 8 times in a 60-amino-acid sequence, the repetitiveness for 3-mers is (8 * 3) / 60 = 0.4 (40% of sequence).

API Reference

ConfigProteinRepetitivenessConfig Source
Configuration for protein repetitiveness constraint.This class defines configuration parameters for evaluating repetitive content in protein sequences using k-mer frequency analysis. The constraint detects and penalizes sequences with excessive tandem repeats or repetitive motifs, which may indicate low-complexity regions or non-functional proteins. The repetitiveness score is calculated as the maximum fraction of the sequence covered by any repeated k-mer. For example, if “AAA” appears 10 times in a 100-amino-acid sequence, the repetitiveness for 3-mers is (10 * 3) / 100 = 0.3 (30% of sequence).
max_repetitiveness
number
default:"0.1"
Maximum acceptable repetitiveness fraction (fraction of sequence covered by repeated k-mers)
min_repeat_length
integer
default:"1"
Smallest k-mer length treated as a repeat; the scan continues up to this length plus 6.
ReturnsConstraintOutput
One result per sequence. A score of 0.0 indicates acceptable repetitiveness (at or below threshold) and higher values indicate excessive repetitive content. Penalties scale linearly with excess repetitiveness: if max is 0.4 and actual is 0.6, the excess (0.2) is normalized by the remaining range (1.0 - 0.4 = 0.6), giving a score of 0.33. metadata carries:
  • repetitiveness_score: Float repetitiveness score (0.0-1.0) representing the maximum fraction of sequence covered by repeated k-mers
  • max_repetitive_fraction: Float identical to repetitiveness_score

Usage

Evaluating repetitiveness with default settings:
python
>>> from proto_language.core import Sequence, SequenceType
>>> config = ProteinRepetitivenessConfig(max_repetitiveness=0.4, min_repeat_length=3)
>>> seq = Sequence("MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSF", "protein")
>>> results = protein_repetitiveness_constraint([(seq,)], config)
>>> print(results[0].score)  # 0.0 if repetitiveness < 40%
>>> print(results[0].metadata["repetitiveness_score"])  # e.g., 0.15

Metadata

PropertyValue
Keyprotein-repetitiveness
Functionprotein_repetitiveness_constraint
Categoryprotein_quality
Modediscrete
Uses GPUFalse
Supported Typesprotein