Evaluate protein sequence complexity using segmasker to detect low-complexity regions
License: Segmasker is licensed under Custom (NCBI BLAST+ public domain). Please refer to the license for full terms.
This constraint is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.
@article{wootton1993seg, title={Statistics of local complexity in amino acid sequences and sequence databases}, author={Wootton, John C and Federhen, Scott}, journal={Computers \& Chemistry}, volume={17}, number={2}, pages={149--163}, year={1993}, publisher={Elsevier}, doi={10.1016/0097-8485(93)85006-x}}@article{camacho2009blastplus, title={BLAST+: architecture and applications}, author={Camacho, Christiam and Coulouris, George and Avagyan, Vahram and Ma, Ning and Papadopoulos, Jason and Bealer, Kevin and Madden, Thomas L}, journal={BMC Bioinformatics}, volume={10}, pages={421}, year={2009}, publisher={BioMed Central}, doi={10.1186/1471-2105-10-421}}
Copy citation
Evaluate protein sequence complexity using segmasker to detect low-complexity regions.This constraint function uses NCBI’s segmasker tool to identify low-complexity
regions in protein sequences. Low-complexity regions contain repetitive or
compositionally biased amino acid sequences that may indicate poor protein
quality, tandem repeats, or non-functional segments. The constraint penalizes
sequences where the fraction of low-complexity regions exceeds a specified
threshold.The function processes multiple sequences simultaneously. Segmasker marks
low-complexity regions by replacing them with lowercase characters, and
the constraint calculates the fraction of masked positions.
Configuration for protein complexity constraint.This class defines configuration parameters for evaluating protein sequence
complexity using NCBI’s segmasker tool. The constraint detects and penalizes
low-complexity regions, which contain repetitive or biased amino acid
compositions that may indicate poor protein quality or non-functional sequences.
Maximum acceptable fraction of low-complexity regions (repetitive/biased amino acid compositions)
ReturnsConstraintOutput
One result per sequence. A score of 0.0 indicates
acceptable complexity (low-complexity fraction at or below threshold)
and higher values indicate excessive low-complexity content. Scores
scale linearly with excess low-complexity beyond the threshold, capped
at 1.0. metadata carries:
low_complexity_fraction: Float fraction of sequence identified as
low-complexity (0.0-1.0)
low_complexity_count: Integer count of positions masked as
low-complexity (from segmasker)