
This constraint is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.
- Frequency mode: Evaluates raw k-mer frequencies (observed_count / total_kmers).
-
Usage deviation mode: Evaluates observed/expected ratios using a zero-order
Markov model where expected = product of individual nucleotide/amino acid
frequencies. A ratio of 1.0 indicates observed matches expected composition,
1.0 indicates overrepresentation, <1.0 indicates underrepresentation.
API Reference
Configuration for k-mer frequency constraint.This class defines configuration parameters for evaluating k-mer composition
in DNA, RNA, or protein sequences. K-mers are subsequences of length k, and
their frequencies can indicate codon bias, tandem repeats, sequence composition
biases, CpG islands, etc. The constraint supports two scoring modes:
frequency-based (direct k-mer counts) and usage deviation (observed vs expected
based on nucleotide/amino acid composition).
Frequency mode evaluates raw k-mer proportions (10/100 CG dinucleotides
= 0.1). Only k-mers that occur in the sequence are scored; absent k-mers
are not penalized.Usage deviation mode compares observed to expected frequencies under
a zero-order Markov model. Expected frequency = product of individual
nucleotide frequencies. For example, if a sequence is 40% G and 60% C,
the expected CG dinucleotide frequency is 0.4 x 0.6 = 0.24. If observed
is 0.12, usage_deviation = 0.12/0.24 = 0.5 (underrepresented).The penalty is the maximum deviation across observed k-mers. To evaluate a
single specific k-mer (including penalizing its absence), use
specific_kmer_constraint instead.
Length of k-mer to analyze (e.g., 2 for dinucleotide, 3 for trinucleotide).
Scoring metric: ‘frequency’ uses raw k-mer counts; ‘usage_deviation’ uses observed/expected ratios.Options:
frequency, usage_deviationMinimum acceptable frequency/deviation based on scoring_mode
Maximum acceptable frequency/deviation based on scoring_mode
ReturnsConstraintOutput
One result per sequence. A score of 0.0 indicates
every observed k-mer is within the acceptable range [min_value, max_value].
Higher scores indicate the maximum deviation across observed k-mers. The
penalty scales linearly with deviation distance from the acceptable
range, capped at 1.0. metadata carries (over observed k-mers only):For frequency mode:{k}mer_frequencies: Dictionary mapping each observed k-mer to its frequency (0.0-1.0). For example,2mer_frequenciesfor dinucleotides.
{k}mer_usage_deviations: Dictionary mapping each observed k-mer to its observed/expected ratio
{k}mer_data: Empty dictionary
Usage
Analyzing codon usage (all trinucleotides):python
Metadata
| Property | Value |
|---|---|
| Key | kmer-frequency |
| Function | kmer_frequency_constraint |
| Category | sequence_composition |
| Mode | discrete |
| Uses GPU | False |
| Supported Types | dna, rna, protein |