Skip to main content
Homopolymer Length

This constraint is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.


Source
proto-bio/proto-language/proto_language/constraint/sequence_composition/max_homopolymer_constraint.py
View source
Penalize sequences containing homopolymers longer than specified maximum. This constraint function identifies the longest homopolymer (consecutive run of identical nucleotides or amino acids) in each sequence and penalizes sequences where this exceeds a specified maximum length. The penalty uses logarithmic scaling to provide graduated penalties: sequences slightly over the limit receive moderate penalties, while sequences far exceeding the limit receive strong penalties (capped at 1.0). This avoids extreme penalty values while still strongly discouraging very long homopolymers.

API Reference

ConfigMaxHomopolymerConfig Source
Configuration for maximum homopolymer constraint.This class defines configuration parameters for limiting homopolymer length in DNA, RNA, or protein sequences. Homopolymers are consecutive runs of the same nucleotide or amino acid (e.g., “AAAAA”, “GGGGGG”, “SSSSSS”). This constraint uses logarithmic scaling for penalties to avoid extreme values while still penalizing very long homopolymers, providing moderate penalties for slightly exceeding the limit and strong penalties for greatly exceeding the limit.
max_length
integer
required
Maximum allowed run of consecutive identical nucleotides or amino acids; longer runs are penalized.
ReturnsConstraintOutput
One result per sequence. A score of 0.0 indicates no homopolymers exceed the maximum length (pass). Higher scores indicate longer homopolymers with logarithmic scaling. metadata carries:
  • max_homopolymer_length: Integer length of the longest homopolymer found in the sequence. For example, “ATCGAAAAAGTC” would have value 5 (for the “AAAAA” run).

Usage

Avoiding long A/T runs for DNA synthesis:
python
>>> from proto_language.core import Sequence, SequenceType
>>> seq = Sequence("ATCGATCGTAGC", "dna")
>>> config = MaxHomopolymerConfig(max_length=4)
>>> results = max_homopolymer_constraint([(seq,)], config)
>>> print(results[0].score)  # 0.0 (no runs >4)
>>> print(results[0].metadata["max_homopolymer_length"])

Metadata

PropertyValue
Keymax-homopolymer
Functionmax_homopolymer_constraint
Categorysequence_composition
Modediscrete
Uses GPUFalse
Supported Typesdna, rna, protein