Skip to main content
K-mer Frequency

This constraint is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.


Source
proto-bio/proto-language/proto_language/constraint/sequence_composition/kmer_frequency_constraint.py
View source
Evaluate k-mer frequencies or usage deviations with configurable mer length and scoring modes. This constraint function analyzes k-mer (subsequences of length k) composition in DNA, RNA, or protein sequences using two possible scoring modes:
  1. Frequency mode: Evaluates raw k-mer frequencies (observed_count / total_kmers).
  2. Usage deviation mode: Evaluates observed/expected ratios using a zero-order Markov model where expected = product of individual nucleotide/amino acid frequencies. A ratio of 1.0 indicates observed matches expected composition,
    1.0 indicates overrepresentation, <1.0 indicates underrepresentation.
The penalty is the maximum deviation from the [min_value, max_value] band across observed k-mers; absent k-mers are not penalized. To target a single specific k-mer (including penalizing its absence), use specific_kmer_constraint instead.

API Reference

ConfigKmerFrequencyConfig Source
Configuration for k-mer frequency constraint.This class defines configuration parameters for evaluating k-mer composition in DNA, RNA, or protein sequences. K-mers are subsequences of length k, and their frequencies can indicate codon bias, tandem repeats, sequence composition biases, CpG islands, etc. The constraint supports two scoring modes: frequency-based (direct k-mer counts) and usage deviation (observed vs expected based on nucleotide/amino acid composition).
Frequency mode evaluates raw k-mer proportions (10/100 CG dinucleotides = 0.1). Only k-mers that occur in the sequence are scored; absent k-mers are not penalized.Usage deviation mode compares observed to expected frequencies under a zero-order Markov model. Expected frequency = product of individual nucleotide frequencies. For example, if a sequence is 40% G and 60% C, the expected CG dinucleotide frequency is 0.4 x 0.6 = 0.24. If observed is 0.12, usage_deviation = 0.12/0.24 = 0.5 (underrepresented).The penalty is the maximum deviation across observed k-mers. To evaluate a single specific k-mer (including penalizing its absence), use specific_kmer_constraint instead.
k
integer
required
Length of k-mer to analyze (e.g., 2 for dinucleotide, 3 for trinucleotide).
scoring_mode
enum
default:"frequency"
Scoring metric: ‘frequency’ uses raw k-mer counts; ‘usage_deviation’ uses observed/expected ratios.Options: frequency, usage_deviation
min_value
number
required
Minimum acceptable frequency/deviation based on scoring_mode
max_value
number
required
Maximum acceptable frequency/deviation based on scoring_mode
ReturnsConstraintOutput
One result per sequence. A score of 0.0 indicates every observed k-mer is within the acceptable range [min_value, max_value]. Higher scores indicate the maximum deviation across observed k-mers. The penalty scales linearly with deviation distance from the acceptable range, capped at 1.0. metadata carries (over observed k-mers only):For frequency mode:
  • {k}mer_frequencies: Dictionary mapping each observed k-mer to its frequency (0.0-1.0). For example, 2mer_frequencies for dinucleotides.
For usage_deviation mode:
  • {k}mer_usage_deviations: Dictionary mapping each observed k-mer to its observed/expected ratio
For sequences too short (<k length) or with no valid k-mers:
  • {k}mer_data: Empty dictionary

Usage

Analyzing codon usage (all trinucleotides):
python
>>> coding_seq = Sequence("ATGAAACGTATTGCGTCG", "dna")
>>> config = KmerFrequencyConfig(
...     k=3,
...     scoring_mode="usage_deviation",
...     min_value=0.5,  # Allow some underrepresentation
...     max_value=2.0,  # Allow some overrepresentation
... )
>>> results = kmer_frequency_constraint([(coding_seq,)], config)
>>> deviations = results[0].metadata["3mer_usage_deviations"]
>>> for codon, ratio in sorted(deviations.items(), key=lambda x: x[1], reverse=True):
...     print(f"{codon}: {ratio:.2f}x expected")

Metadata

PropertyValue
Keykmer-frequency
Functionkmer_frequency_constraint
Categorysequence_composition
Modediscrete
Uses GPUFalse
Supported Typesdna, rna, protein