Segmasker - Proto

License: Segmasker is licensed under Custom (NCBI BLAST+ public domain). Please refer to the license for full terms.

Proto is not affiliated with NCBI. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.

Publication Publication Cite Cite Tool Source Tool Source Open as Notebook Open as Notebook Open on Proto Open on Proto

Statistics of local complexity in amino acid sequences and sequence databases

John C Wootton and Scott Federhen

Computers & Chemistry (1993)

Read paper

@article{wootton1993seg,
  title={Statistics of local complexity in amino acid sequences and sequence databases},
  author={Wootton, John C and Federhen, Scott},
  journal={Computers \& Chemistry},
  volume={17},
  number={2},
  pages={149--163},
  year={1993},
  publisher={Elsevier},
  doi={10.1016/0097-8485(93)85006-x}
}

@article{camacho2009blastplus,
  title={BLAST+: architecture and applications},
  author={Camacho, Christiam and Coulouris, George and Avagyan, Vahram and Ma, Ning and Papadopoulos, Jason and Bealer, Kevin and Madden, Thomas L},
  journal={BMC Bioinformatics},
  volume={10},
  pages={421},
  year={2009},
  publisher={BioMed Central},
  doi={10.1186/1471-2105-10-421}
}

Copy citation

proto-bio/proto-tools/proto_tools/tools/sequence_scoring/segmasker

View source

Open Notebook

Open notebook

Coming soon!

Run this tool directly in Proto with no setup required.

Function	Description
`run_segmasker()`	Detect low-complexity regions in protein sequences using NCBI segmasker	Docs Source

Background

Most natural proteins contain regions whose amino acid composition is strongly biased, including homopolymeric runs, short-period repeats, and segments dominated by a few residue types. These low-complexity regions are biologically real but cause difficulty in sequence alignment, because their similarity is driven by shared composition rather than by common ancestry, which inflates the apparent significance of matches between unrelated sequences. The SEG algorithm (Wootton and Federhen, 1993) quantifies local compositional complexity along a protein sequence using a sliding window and partitions the sequence into segments of low and high complexity. Masking or down-weighting the low-complexity segments before a similarity search improves the specificity of the results. Segmasker is the SEG implementation distributed as a command-line program within the NCBI BLAST+ suite (Camacho et al., 2009), which reorganized the original BLAST applications into modular command-line tools. Within that suite, segmasker applies the SEG procedure to protein sequences and reports the low-complexity regions it identifies, which can then be excluded from similarity searches or used to flag compositionally biased designs.

Learning Resources

NCBI BLAST+ Command Line Applications User Manual - the reference manual for the BLAST+ suite that segmasker ships with, including its masking applications.
BLAST Help (NCBI) - NCBI’s documentation hub for BLAST concepts, including low-complexity filtering.

Tools

Segmasker Low-Complexity Detection (`segmasker-score`)

Applies the SEG algorithm to one or more protein sequences and returns, for each sequence, the number of residues classified as low-complexity, the fraction of the sequence those residues represent, and the sequence length. The low-complexity fraction is the primary metric for ranking sequences by compositional bias.

API Reference

Source

Input: SegmaskerInput

sequences

List[string]

required

Protein sequence(s) to analyze for low-complexity regions. Can be provided as:

Source

Config: SegmaskerConfig

window

integer

default:"12"

Sliding-window size for SEG complexity analysis. Larger windows are less sensitive to short low-complexity stretches.

locut

number

default:"2.2"

Lower complexity cutoff. Regions scoring below this are classified as low-complexity.

hicut

number

default:"2.5"

Upper complexity cutoff. Defines the transition between masked and unmasked regions. Must be >= locut.

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cpu"

Device to run the tool on.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.

Source

Output: SegmaskerOutput

results

List[SegmaskerMetrics]

Per-sequence low-complexity metrics, index-aligned with inputs.sequences.

Show SegmaskerMetrics

primary_metric

string

Name of the metric that best summarizes the result overall (e.g. "avg_plddt" for AlphaFold2). Used by downstream UI and reporting to pick a headline value.

Metrics

Metric	Type	Range	Availability
`low_complexity_fraction`	float	0.0 to 1.0	always
`low_complexity_count`	int	≥ 0.0	always
`sequence_length`	int	≥ 1.0	always

Applications

Screening designed protein sequences for compositional bias before further analysis.
Quantifying low-complexity content to flag homopolymeric runs or short-period repeats.
Prioritizing sequences for masking ahead of a protein similarity search to reduce spurious matches.

Usage Tips

window sets the scale of the regions detected. A larger window targets broader low-complexity stretches, while a smaller window resolves shorter runs.
locut and hicut set how aggressively regions are flagged. Raising the cutoffs classifies more of the sequence as low-complexity, while lowering them applies a stricter criterion that flags only the most biased regions. hicut must be greater than or equal to locut.
Very short and empty sequences are limited. A sequence shorter than the window cannot be assessed reliably, and an empty sequence reports a low-complexity fraction of zero.

Toolkit Notes

Detection runs on CPU and is deterministic. Segmasker takes only protein sequences, runs without a GPU, and returns the same values for identical inputs on repeated calls.
Results are index-aligned with the input. Each result corresponds to the input sequence at the same position, so a batch of sequences returns metrics in the order they were supplied.

Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.

​Background

​Learning Resources

​Tools

​Segmasker Low-Complexity Detection (segmasker-score)

​API Reference

​Applications

​Usage Tips

​Toolkit Notes

​Infrastructure Guides

Tool Persistence

Device Management

Parallel Execution

Cloud Inference

Background

Learning Resources

Tools

Segmasker Low-Complexity Detection (`segmasker-score`)

API Reference

Applications

Usage Tips

Toolkit Notes

Infrastructure Guides