Skip to main content
Protein Complexity
License: Segmasker is licensed under Custom (NCBI BLAST+ public domain). Please refer to the license for full terms.

This constraint is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.


Go to Tool Page
proto-bio/proto-language/proto_language/constraint/protein_quality/protein_complexity_constraint.py
View source
@article{wootton1993seg,
  title={Statistics of local complexity in amino acid sequences and sequence databases},
  author={Wootton, John C and Federhen, Scott},
  journal={Computers \& Chemistry},
  volume={17},
  number={2},
  pages={149--163},
  year={1993},
  publisher={Elsevier},
  doi={10.1016/0097-8485(93)85006-x}
}

@article{camacho2009blastplus,
  title={BLAST+: architecture and applications},
  author={Camacho, Christiam and Coulouris, George and Avagyan, Vahram and Ma, Ning and Papadopoulos, Jason and Bealer, Kevin and Madden, Thomas L},
  journal={BMC Bioinformatics},
  volume={10},
  pages={421},
  year={2009},
  publisher={BioMed Central},
  doi={10.1186/1471-2105-10-421}
}
Copy citation
Evaluate protein sequence complexity using segmasker to detect low-complexity regions. This constraint function uses NCBI’s segmasker tool to identify low-complexity regions in protein sequences. Low-complexity regions contain repetitive or compositionally biased amino acid sequences that may indicate poor protein quality, tandem repeats, or non-functional segments. The constraint penalizes sequences where the fraction of low-complexity regions exceeds a specified threshold. The function processes multiple sequences simultaneously. Segmasker marks low-complexity regions by replacing them with lowercase characters, and the constraint calculates the fraction of masked positions.

API Reference

ConfigProteinComplexityConfig Source
Configuration for protein complexity constraint.This class defines configuration parameters for evaluating protein sequence complexity using NCBI’s segmasker tool. The constraint detects and penalizes low-complexity regions, which contain repetitive or biased amino acid compositions that may indicate poor protein quality or non-functional sequences.
max_low_complexity
number
default:"0.2"
Maximum acceptable fraction of low-complexity regions (repetitive/biased amino acid compositions)
ReturnsConstraintOutput
One result per sequence. A score of 0.0 indicates acceptable complexity (low-complexity fraction at or below threshold) and higher values indicate excessive low-complexity content. Scores scale linearly with excess low-complexity beyond the threshold, capped at 1.0. metadata carries:
  • low_complexity_fraction: Float fraction of sequence identified as low-complexity (0.0-1.0)
  • low_complexity_count: Integer count of positions masked as low-complexity (from segmasker)

Usage

Evaluating protein complexity:
python
>>> from proto_language.core import Sequence, SequenceType
>>> config = ProteinComplexityConfig(max_low_complexity=0.2)
>>> seq = Sequence("MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSF", "protein")
>>> results = protein_complexity_constraint([(seq,)], config)
>>> print(results[0].score)  # 0.0 if low-complexity <= 20%
>>> print(results[0].metadata["low_complexity_fraction"])  # e.g., 0.15
>>> print(results[0].metadata["low_complexity_count"])  # e.g., 5

Metadata

PropertyValue
Keyprotein-complexity
Functionprotein_complexity_constraint
Categoryprotein_quality
Modediscrete
Uses GPUFalse
Supported Typesprotein