Skip to main content
License: Segmasker is licensed under Custom (NCBI BLAST+ public domain). Please refer to the license for full terms.

Proto is not affiliated with NCBI. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.


Statistics of local complexity in amino acid sequences and sequence databases
John C Wootton and Scott Federhen
Computers & Chemistry (1993)
Read paper
@article{wootton1993seg,
  title={Statistics of local complexity in amino acid sequences and sequence databases},
  author={Wootton, John C and Federhen, Scott},
  journal={Computers \& Chemistry},
  volume={17},
  number={2},
  pages={149--163},
  year={1993},
  publisher={Elsevier},
  doi={10.1016/0097-8485(93)85006-x}
}

@article{camacho2009blastplus,
  title={BLAST+: architecture and applications},
  author={Camacho, Christiam and Coulouris, George and Avagyan, Vahram and Ma, Ning and Papadopoulos, Jason and Bealer, Kevin and Madden, Thomas L},
  journal={BMC Bioinformatics},
  volume={10},
  pages={421},
  year={2009},
  publisher={BioMed Central},
  doi={10.1186/1471-2105-10-421}
}
Copy citation
proto-bio/proto-tools/proto_tools/tools/sequence_scoring/segmasker
View source
Open Notebook
Open notebook
Coming soon!
Run this tool directly in Proto with no setup required.
FunctionDescription
run_segmasker()Detect low-complexity regions in protein sequences using NCBI segmasker Docs Source

Background

Most natural proteins contain regions whose amino acid composition is strongly biased, including homopolymeric runs, short-period repeats, and segments dominated by a few residue types. These low-complexity regions are biologically real but cause difficulty in sequence alignment, because their similarity is driven by shared composition rather than by common ancestry, which inflates the apparent significance of matches between unrelated sequences. The SEG algorithm (Wootton and Federhen, 1993) quantifies local compositional complexity along a protein sequence using a sliding window and partitions the sequence into segments of low and high complexity. Masking or down-weighting the low-complexity segments before a similarity search improves the specificity of the results. Segmasker is the SEG implementation distributed as a command-line program within the NCBI BLAST+ suite (Camacho et al., 2009), which reorganized the original BLAST applications into modular command-line tools. Within that suite, segmasker applies the SEG procedure to protein sequences and reports the low-complexity regions it identifies, which can then be excluded from similarity searches or used to flag compositionally biased designs.

Learning Resources

Tools

Segmasker Low-Complexity Detection (segmasker-score)

Applies the SEG algorithm to one or more protein sequences and returns, for each sequence, the number of residues classified as low-complexity, the fraction of the sequence those residues represent, and the sequence length. The low-complexity fraction is the primary metric for ranking sequences by compositional bias.

API Reference

Source
sequences
List[string]
required
Protein sequence(s) to analyze for low-complexity regions. Can be provided as:
Source
window
integer
default:"12"
Sliding-window size for SEG complexity analysis. Larger windows are less sensitive to short low-complexity stretches.
locut
number
default:"2.2"
Lower complexity cutoff. Regions scoring below this are classified as low-complexity.
hicut
number
default:"2.5"
Upper complexity cutoff. Defines the transition between masked and unmasked regions. Must be >= locut.
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cpu"
Device to run the tool on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
results
List[SegmaskerMetrics]
Per-sequence low-complexity metrics, index-aligned with inputs.sequences.
Metrics
MetricTypeRangeAvailability
low_complexity_fractionfloat0.0 to 1.0always
low_complexity_countint≥ 0.0always
sequence_lengthint≥ 1.0always

Applications

  • Screening designed protein sequences for compositional bias before further analysis.
  • Quantifying low-complexity content to flag homopolymeric runs or short-period repeats.
  • Prioritizing sequences for masking ahead of a protein similarity search to reduce spurious matches.

Usage Tips

  • window sets the scale of the regions detected. A larger window targets broader low-complexity stretches, while a smaller window resolves shorter runs.
  • locut and hicut set how aggressively regions are flagged. Raising the cutoffs classifies more of the sequence as low-complexity, while lowering them applies a stricter criterion that flags only the most biased regions. hicut must be greater than or equal to locut.
  • Very short and empty sequences are limited. A sequence shorter than the window cannot be assessed reliably, and an empty sequence reports a low-complexity fraction of zero.

Toolkit Notes

  • Detection runs on CPU and is deterministic. Segmasker takes only protein sequences, runs without a GPU, and returns the same values for identical inputs on repeated calls.
  • Results are index-aligned with the input. Each result corresponds to the input sequence at the same position, so a batch of sequences returns metrics in the order they were supplied.
Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.