Proto is not affiliated with NCBI. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.
Background
Most natural proteins contain regions whose amino acid composition is strongly biased, including homopolymeric runs, short-period repeats, and segments dominated by a few residue types. These low-complexity regions are biologically real but cause difficulty in sequence alignment, because their similarity is driven by shared composition rather than by common ancestry, which inflates the apparent significance of matches between unrelated sequences. The SEG algorithm (Wootton and Federhen, 1993) quantifies local compositional complexity along a protein sequence using a sliding window and partitions the sequence into segments of low and high complexity. Masking or down-weighting the low-complexity segments before a similarity search improves the specificity of the results. Segmasker is the SEG implementation distributed as a command-line program within the NCBI BLAST+ suite (Camacho et al., 2009), which reorganized the original BLAST applications into modular command-line tools. Within that suite, segmasker applies the SEG procedure to protein sequences and reports the low-complexity regions it identifies, which can then be excluded from similarity searches or used to flag compositionally biased designs.Learning Resources
- NCBI BLAST+ Command Line Applications User Manual - the reference manual for the BLAST+ suite that segmasker ships with, including its masking applications.
- BLAST Help (NCBI) - NCBI’s documentation hub for BLAST concepts, including low-complexity filtering.
Tools
Segmasker Low-Complexity Detection (segmasker-score)
Applies the SEG algorithm to one or more protein sequences and returns, for each sequence, the number of residues classified as low-complexity, the fraction of the sequence those residues represent, and the sequence length. The low-complexity fraction is the primary metric for ranking sequences by compositional bias.API Reference
Input: SegmaskerInput
Input: SegmaskerInput
Config: SegmaskerConfig
Config: SegmaskerConfig
locut.True is coerced to 1 and False to 0.None waits indefinitely.BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.Output: SegmaskerOutput
Output: SegmaskerOutput
inputs.sequences.| Metric | Type | Range | Availability |
|---|---|---|---|
low_complexity_fraction | float | 0.0 to 1.0 | always |
low_complexity_count | int | ≥ 0.0 | always |
sequence_length | int | ≥ 1.0 | always |
Applications
- Screening designed protein sequences for compositional bias before further analysis.
- Quantifying low-complexity content to flag homopolymeric runs or short-period repeats.
- Prioritizing sequences for masking ahead of a protein similarity search to reduce spurious matches.
Usage Tips
windowsets the scale of the regions detected. A larger window targets broader low-complexity stretches, while a smaller window resolves shorter runs.locutandhicutset how aggressively regions are flagged. Raising the cutoffs classifies more of the sequence as low-complexity, while lowering them applies a stricter criterion that flags only the most biased regions.hicutmust be greater than or equal tolocut.- Very short and empty sequences are limited. A sequence shorter than the window cannot be assessed reliably, and an empty sequence reports a low-complexity fraction of zero.
Toolkit Notes
- Detection runs on CPU and is deterministic. Segmasker takes only protein sequences, runs without a GPU, and returns the same values for identical inputs on repeated calls.
- Results are index-aligned with the input. Each result corresponds to the input sequence at the same position, so a batch of sequences returns metrics in the order they were supplied.

NCBI