Random Nucleotide Sampling

License: Random Nucleotide Sampling is open source and free for academic and commercial use under an Apache-2.0 license. Please refer to the license for full terms.

Proto is not affiliated with Proto. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.

Tool Source Tool Source Open as Notebook Open as Notebook Open on Proto Open on Proto

proto-bio/proto-tools/proto_tools/tools/mutagenesis/random_nucleotide

View source

Open Notebook

Open notebook

Coming soon!

Run this tool directly in Proto with no setup required.

Function	Description
`run_random_nucleotide_sample()`	Sample nucleotide sequences by filling masked positions with random bases from an IUPAC substitut…	Docs Source

Background

Random Nucleotide Sampling performs random mutagenesis at the nucleotide level: it takes a DNA or RNA sequence, determines which positions are designable, and replaces each with a base drawn uniformly from a chosen IUPAC degenerate-base pool. It generates nucleotide diversity without any learned model, the simplest possible baseline against which model-guided generators can be compared. Internally, designable positions are either the _ characters already present in the input or, when none are present, positions chosen by the configured masking strategy. Each masked position is filled independently by drawing one base uniformly at random from the pool that the IUPAC code expands to: N expands to A/C/G/T, R to A/G, S to G/C, and so on. Sampling is uniform within the pool, with no frequency weighting. When the input is RNA, sampled T bases are converted to U. With a fixed seed the output is deterministic. This tool is original proto-tools code maintained by Proto.

Tools

Random Nucleotide Sampling (`random-nucleotide-sample`)

Fills every masked position in each input sequence with a random base from the configured IUPAC substitution pool, returning one filled sequence per input.

API Reference

Source

Input: RandomNucleotideSampleInput

sequences

List[string]

required

DNA or RNA sequences, possibly containing _ at positions to mutate. Accepts a single string or a list.

Source

Config: RandomNucleotideSampleConfig

masking_strategy

MaskingStrategy

Controls which positions to mask for sampling.

Show MaskingStrategy

method

enum

default:"random"

Scoring method for position selection. "random": uniform random, "entropy": highest model uncertainty, "max-logit": lowest model confidence.Available options: random, entropy, max-logit

num_mutations

integer

Exact number of positions to mask per sequence.

mask_fraction

number

Fraction of designable positions to mask (e.g. 0.15 for ~15%).

fixed_positions

array

1-indexed positions that must NOT be masked. Applied uniformly to all sequences.

temperature

number

default:"1.0"

Temperature for position selection. < 1.0 is greedy, 1.0 uses scores as-is, > 1.0 is more uniform.

model_name

string

Which masked model to use for scoring. Defaults to the sampling tool’s model when unset.

model_checkpoint

string

Model checkpoint override (uses tool default if None).

substitution_scheme

enum

default:"N"

IUPAC ambiguity code defining the nucleotide pool for substitutions. "N" = any base (ACGT); "R" = purines (AG); "Y" = pyrimidines (CT); etc.Available options: N, R, Y, S, W, K, M, B, D, H, V

sequence_type

enum

default:"auto"

How to interpret input sequences. "auto" detects DNA vs RNA by presence of U; "dna" or "rna" forces the type.Available options: auto, dna, rna

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cpu"

Device to run the tool on.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.

Source

Output: RandomNucleotideSampleOutput

sequences

List[string]

required

Nucleotide sequences with masked positions filled by random bases drawn from the configured substitution scheme.

Applications

Use this to build randomized nucleotide libraries: degenerate positions in promoters, ribosome binding sites, UTRs, or coding regions for directed-evolution and combinatorial-screening campaigns. It also serves as an unbiased random baseline for judging whether a model-guided generator produces better-than-chance sequences.

Usage Tips

substitution_scheme (default N) sets the substitution alphabet. N allows any base for maximum diversity; restrict it to bias the library, for example R for purines (A/G), S for strong pairs (G/C), or W for weak pairs (A/T).
_ masks override the masking strategy. If an input already contains _, exactly those positions are filled and masking_strategy is ignored; remove the _ characters to let the strategy choose positions instead.
sequence_type (default auto) controls RNA handling. auto treats the sequence as RNA only when it contains U; force it with dna or rna. In RNA mode sampled T bases are written as U, so set rna explicitly when the input is fully masked.
masking_strategy.fixed_positions are 1-indexed. Positions listed there are never mutated; they are specified using 1-based indexing to match biological residue selection conventions.
Set seed for reproducibility. Sampling is otherwise nondeterministic; a fixed seed makes the filled sequences reproducible across runs.

Toolkit Notes

These apply to every Random Nucleotide Sampling tool in this toolkit (random-nucleotide-sample).

Runs on CPU. The sampler is pure Python with no model and no external dependencies; execution is near-instant.
Deterministic only with a seed. Without a seed the filled positions differ every run; set one when you need reproducible libraries.

Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.

​Background

​Tools

​Random Nucleotide Sampling (random-nucleotide-sample)

​API Reference

​Applications

​Usage Tips

​Toolkit Notes

​Infrastructure Guides

Tool Persistence

Device Management

Parallel Execution

Cloud Inference

Background

Tools

Random Nucleotide Sampling (`random-nucleotide-sample`)

API Reference

Applications

Usage Tips

Toolkit Notes

Infrastructure Guides