Skip to main content
Overall Protein Quality
License: This constraint can use multiple tools, each under its own license. See the Tools Used tab and each tool’s page for license details.

This constraint is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.


proto-bio/proto-language/proto_language/constraint/protein_quality/overall_protein_quality_constraint.py
View source
Evaluate overall protein quality using multiple configurable sub-constraints. This constraint function provides a comprehensive assessment of protein quality by evaluating multiple aspects including sequence length, structural complexity, repetitiveness, amino acid diversity, and balanced amino acid representation. For DNA sequences, it first predicts protein-coding regions using Prodigal, then evaluates all predicted proteins. For protein sequences, it evaluates them directly. The function aggregates scores from enabled sub-constraints by averaging them and clipping to [0.0, 1.0]. Use the native Constraint(threshold=...) parameter for pass/fail filtering.

API Reference

ConfigOverallProteinQualityConfig Source
Configuration for the overall protein quality constraint.This configuration class orchestrates multiple protein quality sub-constraints that can be enabled or disabled individually. It provides a flexible framework for comprehensive protein quality assessment by combining various metrics including sequence length, structural complexity, repetitiveness, amino acid diversity, and balanced amino acid representation.The configuration uses a nested structure where all sub-constraint parameters are exposed through a single protein_quality_config attribute of type ProteinQualitySubConfig. This design allows for easy serialization in UI/API schemas while maintaining clear organization of constraint-specific parameters.At least one sub-constraint must be enabled for the configuration to be valid. This is enforced through a model validator that runs after initialization.
The nested protein_quality_config provides access to:
  • Length constraint: Validates protein length against min/max range or target value
  • Complexity constraint: Detects low-complexity regions using segmasker
  • Repetitiveness constraint: Identifies repeated k-mer patterns
  • Diversity constraint: Ensures adequate amino acid type diversity
  • Balanced amino acids constraint: Checks for underrepresented amino acid types
Each sub-constraint can be independently enabled/disabled and configured with specific parameters. See ProteinQualitySubConfig documentation for complete parameter details.For more details, see:
  • ProteinQualitySubConfig: Detailed documentation of all sub-constraint parameters and configuration options
  • overall_protein_quality_constraint: The constraint function that uses this configuration
  • SequenceLengthConfig: Configuration for length constraint
  • ProteinComplexityConfig: Configuration for complexity constraint
  • ProteinRepetitivenessConfig: Configuration for repetitiveness constraint
  • ProteinDiversityConfig: Configuration for diversity constraint
  • BalancedAaConfig: Configuration for balanced amino acids constraint
protein_quality_config
ProteinQualitySubConfig
required
Nested configuration for protein quality checks
ReturnsConstraintOutput
One result per sequence. Scores range from 0.0 (best) to 1.0 (worst) and represent the average of all enabled sub-constraint scores, clipped to [0.0, 1.0]. For DNA sequences, the score reflects the average quality across all predicted proteins. metadata carries:For DNA sequences:
  • prodigal_proteins: List of dicts of predicted proteins from Prodigal, each with protein ID, sequence, length, etc. (or None if no ORFs were predicted)
  • prodigal_protein_count: Integer count of predicted ORFs
  • predicted_protein_count: Integer count of proteins (same as prodigal_protein_count)
  • avg_constraint_score: Float average quality score across all predicted proteins
  • protein_quality_details: List of dictionaries, one per predicted protein, each containing:
    • protein_id: String identifier from Prodigal
    • length: Integer protein length in amino acids
    • avg_constraint_score: Float average across enabled constraints
    • quality_scores: Dictionary mapping constraint names to scores
    • metadata: Dictionary of additional constraint-specific metadata
For protein sequences:
  • protein_quality_scores: Dictionary mapping constraint names (e.g., “length”, “complexity”, “repetitiveness”, “diversity”, “balanced_aas”) to their individual scores
  • avg_constraint_score: Float average across all enabled constraints

Usage

Using all available constraints with custom thresholds:
python
>>> quality_config = ProteinQualitySubConfig(
...     enable_length=True,
...     length_target_length=300,
...     enable_complexity=True,
...     complexity_max_low_complexity=0.25,
...     enable_repetitiveness=True,
...     repetitiveness_max_repetitiveness=0.08,
...     repetitiveness_min_repeat_length=3,
...     enable_diversity=True,
...     diversity_min_diversity=0.75,
...     enable_balanced_aas=True,
...     balanced_min_aa_frequency=0.03,
...     balanced_max_underrepresented_count=2,
... )
>>> overall_cfg = OverallProteinQualityConfig(protein_quality_config=quality_config)
>>> protein_seq = Sequence("MKYIVAVAG...", "protein")
>>> results = overall_protein_quality_constraint([(protein_seq,)], overall_cfg)

Metadata

PropertyValue
Keyoverall-protein-quality
Functionoverall_protein_quality_constraint
Categoryprotein_quality
Modediscrete
Uses GPUFalse
Supported Typesdna, protein