Skip to main content
Balanced Amino Acid Representation

This constraint is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.


Source
proto-bio/proto-language/proto_language/constraint/protein_quality/balanced_aa_constraint.py
View source
Evaluate the presence of underrepresented amino acids in protein sequences. This constraint function assesses whether protein sequences have balanced representation of all amino acid types by identifying amino acids that appear below a minimum frequency threshold and penalizing sequences that have too many such underrepresented amino acids. The penalty is scaled based on both the number of excess underrepresented amino acids and the severity of their under-representation. For each input sequence, it calculates amino acid frequencies, identifies underrepresented amino acids, and computes a penalty score if the number of underrepresented amino acids exceeds the configured threshold.

API Reference

ConfigBalancedAaConfig Source
Configuration for balanced amino acid constraint.This class defines configuration parameters for evaluating whether a protein sequence has balanced representation of all amino acid types. The constraint penalizes sequences that have too many underrepresented amino acids (those appearing below a minimum frequency threshold). The penalty score increases both with the number of underrepresented amino acids beyond the threshold and with the severity of under-representation (how far below min_aa_frequency each amino acid falls).
min_aa_frequency
number
default:"0.02"
Minimum acceptable relative frequency for any amino acid type.
max_underrepresented_count
integer
default:"3"
Maximum acceptable number of underrepresented amino acid types. Sequences with more are penalized.
ReturnsConstraintOutput
One result per sequence. score ranges from 0.0 (best, acceptable number of underrepresented amino acids) to 1.0 (worst, many severely underrepresented amino acids), scaled by excess count and severity below the minimum frequency. metadata carries:
  • underrepresented_aa_score: Float score indicating overall underrepresentation severity
  • amino_acid_counts: Dictionary mapping amino acids to their counts
  • underrepresented_amino_acids: List of amino acids that are underrepresented
  • underrepresented_aa_count: Integer count of underrepresented amino acid types
  • min_aa_frequency_threshold: The minimum frequency threshold used

Usage

Evaluating amino acid balance in protein:
python
>>> from proto_language.core import Sequence, SequenceType
>>> config = BalancedAaConfig(min_aa_frequency=0.05, max_underrepresented_count=2)
>>> seq = Sequence("AAAAAACCCCCCDDDDDD", sequence_type="protein")
>>> results = balanced_aa_constraint([(seq,)], config)
>>> # This sequence has only 3 amino acid types, so 17 are underrepresented
>>> # This exceeds max_underrepresented_count=2, resulting in a penalty
>>> print(results[0].score)  # Will be > 0.0
>>> print(results[0].metadata["underrepresented_aa_count"])  # 17

Metadata

PropertyValue
Keybalanced-aa
Functionbalanced_aa_constraint
Categoryprotein_quality
Modediscrete
Uses GPUFalse
Supported Typesprotein