Skip to main content
Protein Diversity

This constraint is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.


Source
proto-bio/proto-language/proto_language/constraint/protein_quality/protein_diversity_constraint.py
View source
Evaluate amino acid diversity in protein sequences. This constraint function measures the diversity of amino acid types present in protein sequences. It calculates diversity as the fraction of the 20 standard amino acids that appear in the sequence, and penalizes sequences that fall below a minimum diversity threshold. The penalty scales linearly with the deficit below the minimum diversity threshold.

API Reference

ConfigProteinDiversityConfig Source
Configuration for protein diversity constraint.This class defines configuration parameters for evaluating amino acid diversity in protein sequences. The constraint measures how many different amino acid types are present in the sequence and penalizes sequences with insufficient diversity, which may indicate poor protein quality, repetitive sequences, or non-functional proteins.
A diversity score of 1.0 means all 20 standard amino acids are present. The minimum for a non-empty sequence is 0.05 (1/20), reached by a homopolymer (only one amino acid type); a score of 0.0 is unreachable since empty sequences are rejected.
min_diversity
number
default:"0.7"
Minimum acceptable amino acid diversity. Calculated as (unique amino acids) / 20.
ReturnsConstraintOutput
One result per sequence. A score of 0.0 indicates sufficient diversity (diversity at or above threshold) and higher values indicate insufficient amino acid diversity. Scores scale linearly with the deficit below the threshold (e.g., if min_diversity is 0.5 and actual diversity is 0.25, the score is 0.5), capped at 1.0. metadata carries:
  • aa_diversity_score: Float diversity score (0.0-1.0) calculated as (unique amino acids) / 20
  • unique_amino_acid_count: Integer count of unique amino acid types present in the sequence (0-20)
  • unique_amino_acids: Sorted list of amino acid characters present in the sequence

Usage

Evaluating protein diversity:
python
>>> from proto_language.core import Sequence, SequenceType
>>> config = ProteinDiversityConfig(min_diversity=0.5)
>>> seq = Sequence("MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSF", "protein")
>>> results = protein_diversity_constraint([(seq,)], config)
>>> print(results[0].score)  # 0.0 (diversity 0.85 >= 0.5)
>>> print(results[0].metadata["aa_diversity_score"])  # 0.85
>>> print(results[0].metadata["unique_amino_acid_count"])  # 17
>>> print(results[0].metadata["unique_amino_acids"])  # ['A', 'D', 'E', 'F', ...]

Metadata

PropertyValue
Keyprotein-diversity
Functionprotein_diversity_constraint
Categoryprotein_quality
Modediscrete
Uses GPUFalse
Supported Typesprotein