Protein Repetitiveness

This constraint is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.

Source

proto-bio/proto-language/proto_language/constraint/protein_quality/protein_repetitiveness_constraint.py

View source Evaluate protein sequence repetitiveness based on k-mer frequency analysis. This constraint function analyzes protein sequences for repetitive content by examining k-mer frequencies. It identifies sequences with excessive repetitive motifs, which may indicate low-complexity regions or non-functional proteins. The analysis scans multiple k-mer lengths to detect both short tandem repeats and larger sequence duplications. The repetitiveness score represents the maximum fraction of the sequence covered by any repeated k-mer. For example, if “SSS” appears 8 times in a 60-amino-acid sequence, the repetitiveness for 3-mers is (8 * 3) / 60 = 0.4 (40% of sequence).

API Reference

ConfigProteinRepetitivenessConfig Source

Configuration for protein repetitiveness constraint.This class defines configuration parameters for evaluating repetitive content in protein sequences using k-mer frequency analysis. The constraint detects and penalizes sequences with excessive tandem repeats or repetitive motifs, which may indicate low-complexity regions or non-functional proteins. The repetitiveness score is calculated as the maximum fraction of the sequence covered by any repeated k-mer. For example, if “AAA” appears 10 times in a 100-amino-acid sequence, the repetitiveness for 3-mers is (10 * 3) / 100 = 0.3 (30% of sequence).

max_repetitiveness

number

default:"0.1"

Maximum acceptable repetitiveness fraction (fraction of sequence covered by repeated k-mers)

min_repeat_length

integer

default:"1"

Smallest k-mer length treated as a repeat; the scan continues up to this length plus 6.

ReturnsConstraintOutput

One result per sequence. A score of 0.0 indicates acceptable repetitiveness (at or below threshold) and higher values indicate excessive repetitive content. Penalties scale linearly with excess repetitiveness: if max is 0.4 and actual is 0.6, the excess (0.2) is normalized by the remaining range (1.0 - 0.4 = 0.6), giving a score of 0.33. metadata carries:

repetitiveness_score: Float repetitiveness score (0.0-1.0) representing the maximum fraction of sequence covered by repeated k-mers
max_repetitive_fraction: Float identical to repetitiveness_score

Usage

Evaluating repetitiveness with default settings:

python

>>> from proto_language.core import Sequence, SequenceType
>>> config = ProteinRepetitivenessConfig(max_repetitiveness=0.4, min_repeat_length=3)
>>> seq = Sequence("MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSF", "protein")
>>> results = protein_repetitiveness_constraint([(seq,)], config)
>>> print(results[0].score)  # 0.0 if repetitiveness < 40%
>>> print(results[0].metadata["repetitiveness_score"])  # e.g., 0.15

Metadata

Property	Value
Key	`protein-repetitiveness`
Function	`protein_repetitiveness_constraint`
Category	`protein_quality`
Mode	`discrete`
Uses GPU	`False`
Supported Types	`protein`

​API Reference

​Usage

​Metadata

API Reference

Usage

Metadata