Protein Complexity

License: Segmasker is licensed under Custom (NCBI BLAST+ public domain). Please refer to the license for full terms.

This constraint is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.

Tools Used Tools Used Source Source Cite Cite

Go to Tool Page

proto-bio/proto-language/proto_language/constraint/protein_quality/protein_complexity_constraint.py

View source

@article{wootton1993seg,
  title={Statistics of local complexity in amino acid sequences and sequence databases},
  author={Wootton, John C and Federhen, Scott},
  journal={Computers \& Chemistry},
  volume={17},
  number={2},
  pages={149--163},
  year={1993},
  publisher={Elsevier},
  doi={10.1016/0097-8485(93)85006-x}
}

@article{camacho2009blastplus,
  title={BLAST+: architecture and applications},
  author={Camacho, Christiam and Coulouris, George and Avagyan, Vahram and Ma, Ning and Papadopoulos, Jason and Bealer, Kevin and Madden, Thomas L},
  journal={BMC Bioinformatics},
  volume={10},
  pages={421},
  year={2009},
  publisher={BioMed Central},
  doi={10.1186/1471-2105-10-421}
}

Copy citation

Evaluate protein sequence complexity using segmasker to detect low-complexity regions. This constraint function uses NCBI’s segmasker tool to identify low-complexity regions in protein sequences. Low-complexity regions contain repetitive or compositionally biased amino acid sequences that may indicate poor protein quality, tandem repeats, or non-functional segments. The constraint penalizes sequences where the fraction of low-complexity regions exceeds a specified threshold. The function processes multiple sequences simultaneously. Segmasker marks low-complexity regions by replacing them with lowercase characters, and the constraint calculates the fraction of masked positions.

API Reference

ConfigProteinComplexityConfig Source

Configuration for protein complexity constraint.This class defines configuration parameters for evaluating protein sequence complexity using NCBI’s segmasker tool. The constraint detects and penalizes low-complexity regions, which contain repetitive or biased amino acid compositions that may indicate poor protein quality or non-functional sequences.

max_low_complexity

number

default:"0.2"

Maximum acceptable fraction of low-complexity regions (repetitive/biased amino acid compositions)

ReturnsConstraintOutput

One result per sequence. A score of 0.0 indicates acceptable complexity (low-complexity fraction at or below threshold) and higher values indicate excessive low-complexity content. Scores scale linearly with excess low-complexity beyond the threshold, capped at 1.0. metadata carries:

low_complexity_fraction: Float fraction of sequence identified as low-complexity (0.0-1.0)
low_complexity_count: Integer count of positions masked as low-complexity (from segmasker)

Usage

Evaluating protein complexity:

python

>>> from proto_language.core import Sequence, SequenceType
>>> config = ProteinComplexityConfig(max_low_complexity=0.2)
>>> seq = Sequence("MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSF", "protein")
>>> results = protein_complexity_constraint([(seq,)], config)
>>> print(results[0].score)  # 0.0 if low-complexity <= 20%
>>> print(results[0].metadata["low_complexity_fraction"])  # e.g., 0.15
>>> print(results[0].metadata["low_complexity_count"])  # e.g., 5

Metadata

Property	Value
Key	`protein-complexity`
Function	`protein_complexity_constraint`
Category	`protein_quality`
Mode	`discrete`
Uses GPU	`False`
Supported Types	`protein`

​API Reference

​Usage

​Metadata

API Reference

Usage

Metadata