Skip to main content
Sigma70 Promoter Strength

This constraint is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.


Source
proto-bio/proto-language/proto_language/constraint/sequence_annotation/sigma70_promoter_constraint.py
View source
Evaluate E. coli sigma-70 promoter similarity using PWM-based scoring. This constraint function evaluates bacterial promoter similarity by scanning DNA sequences for sigma-70-dependent promoter elements. It identifies putative -35 and -10 boxes, scores them based on similarity to consensus sequences weighted by position-specific conservation probabilities, evaluates the spacer distance between them, and combines these scores into an overall promoter similarity prediction. The scoring model is based on RegulonDB experimental data for E. coli sigma-70 promoters and uses three components:
  1. PWM score: Position weight matrix score based on conservation probabilities
  2. Match count: Simple count of consensus matches (out of 12 positions)
  3. Spacer length: Deviation from optimal 17 bp spacer
The function scans sequences to find the best-scoring promoter configuration within the allowed spacer range [min_spacer, max_spacer]. For short sequences (≤32 bp), it treats the entire sequence as a fixed promoter. For longer sequences, it exhaustively scans all positions.

API Reference

ConfigSigma70PromoterConfig Source
Configuration for sigma-70 promoter similarity constraint.This class defines configuration parameters for evaluating bacterial promoter similarity using a position weight matrix (PWM) model of E. coli sigma-70 promoters. The model scores promoter elements based on similarity to consensus sequences for the -35 and -10 boxes, the spacer distance between them, and the total number of matches to consensus. This approach is based on RegulonDB experimental data for E. coli sigma-70-dependent promoters.The scoring combines three components:
  1. PWM score: Similarity to consensus sequences weighted by conservation
  2. Match count: Number of exact matches to consensus (out of 12 positions)
  3. Spacer length: Distance between -35 and -10 boxes
The constraint scans sequences to find the best-scoring promoter within the allowed spacer range. For sequences ≤32 bp, it treats the entire sequence as a single promoter (first 6 bp = -35, last 6 bp = -10). For longer sequences, it scans all possible positions.The final penalty combines three components:
  1. Box penalty = (1 - match_weight) * PWM_penalty + match_weight * match_penalty
  2. Total penalty = (1 - spacer_weight) * box_penalty + spacer_weight * spacer_penalty
consensus_35
string
default:"TTGACA"
-35 box consensus sequence (6 bp, typically TTGACA for E. coli sigma-70)
consensus_10
string
default:"TATAAT"
-10 box consensus sequence (6 bp Pribnow box, typically TATAAT for E. coli sigma-70)
probs_35
List[number]
default:"[0.69, 0.79, 0.61, 0.56, 0.54, 0.54]"
Position-specific conservation probabilities for -35 box (6 values). From RegulonDB.
probs_10
List[number]
default:"[0.77, 0.76, 0.6, 0.61, 0.56, 0.82]"
Position-specific conservation probabilities for -10 box (6 values). From RegulonDB.
optimal_spacer
integer
default:"17"
Optimal spacer length between -35 and -10 boxes in base pairs (typically 17±1 bp)
spacer_sigma
number
default:"1.5"
Standard deviation for spacer length penalty. Lower values = stricter spacing requirement.
spacer_weight
number
default:"0.3"
Weight (0-1) for spacer penalty in total score. Higher = spacing more important.
gamma
number
default:"0.1"
PWM score exponent for non-linearity. Lower values = more sensitive to mismatches.
k_opt
integer
default:"8"
Optimal number of matches to consensus (out of 12 total positions)
match_sigma
number
default:"2.0"
Standard deviation for match count penalty
match_weight
number
default:"0.3"
Weight (0-1) for match count penalty in total score
min_spacer
integer
default:"14"
Minimum acceptable spacer length in bp
max_spacer
integer
default:"20"
Maximum acceptable spacer length in bp
ReturnsConstraintOutput
One result per sequence. Score ranges from 0.0 (perfect promoter, exact consensus with optimal spacer) to 1.0 (poor/no promoter). metadata carries a single sigma70 dict with the following fields:For valid promoters found:
  • sigma70_score: Float overall penalty score (0.0-1.0)
  • pos: Integer start position of the -35 box in the sequence
  • box35: String sequence of the -35 box (6 bp)
  • box10: String sequence of the -10 box (6 bp)
  • spacer_len: Integer spacer length between boxes (bp)
  • total_matches: Integer total matches to consensus (out of 12)
  • pwm_penalty: Float PWM-based penalty component (0.0-1.0)
  • match_penalty: Float match count penalty component (0.0-1.0)
  • spacer_penalty: Float spacer length penalty component (0.0-1.0)
For sequences too short (<12 bp):
  • sigma70_score: Float 1.0 (maximum penalty)
  • reason: String “too_short”
For sequences with invalid spacer (12-32 bp range):
  • sigma70_score: Float 1.0 (maximum penalty)
  • reason: String “invalid_spacer”

Usage

Evaluating a canonical sigma-70 promoter:
python
>>> from proto_language.core import Sequence, SequenceType
>>> promoter_seq = Sequence(
...     "TTGACAATGATACTTAGATTCACTTATAATACTAGTAG",  # 17 bp spacer
...     "dna",
... )
>>> config = Sigma70PromoterConfig()
>>> results = sigma70_promoter_constraint([(promoter_seq,)], config)
>>> print(results[0].score)  # e.g., 0.08 (strong promoter)
>>> sigma70 = results[0].metadata["sigma70"]
>>> print(f"-35: {sigma70['box35']}, -10: {sigma70['box10']}")  # TTGACA, TATAAT

Metadata

PropertyValue
Keysigma70-promoter
Functionsigma70_promoter_constraint
Categorysequence_annotation
Modediscrete
Uses GPUFalse
Supported Typesdna