Skip to main content
License: Random Nucleotide Sampling is open source and free for academic and commercial use under an Apache-2.0 license. Please refer to the license for full terms.

Proto is not affiliated with Proto. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.


proto-bio/proto-tools/proto_tools/tools/mutagenesis/random_nucleotide
View source
Open Notebook
Open notebook
Coming soon!
Run this tool directly in Proto with no setup required.
FunctionDescription
run_random_nucleotide_sample()Sample nucleotide sequences by filling masked positions with random bases from an IUPAC substitut… Docs Source

Background

Random Nucleotide Sampling performs random mutagenesis at the nucleotide level: it takes a DNA or RNA sequence, determines which positions are designable, and replaces each with a base drawn uniformly from a chosen IUPAC degenerate-base pool. It generates nucleotide diversity without any learned model, the simplest possible baseline against which model-guided generators can be compared. Internally, designable positions are either the _ characters already present in the input or, when none are present, positions chosen by the configured masking strategy. Each masked position is filled independently by drawing one base uniformly at random from the pool that the IUPAC code expands to: N expands to A/C/G/T, R to A/G, S to G/C, and so on. Sampling is uniform within the pool, with no frequency weighting. When the input is RNA, sampled T bases are converted to U. With a fixed seed the output is deterministic. This tool is original proto-tools code maintained by Proto.

Tools

Random Nucleotide Sampling (random-nucleotide-sample)

Fills every masked position in each input sequence with a random base from the configured IUPAC substitution pool, returning one filled sequence per input.

API Reference

Source
sequences
List[string]
required
DNA or RNA sequences, possibly containing _ at positions to mutate. Accepts a single string or a list.
Source
masking_strategy
MaskingStrategy
Controls which positions to mask for sampling.
substitution_scheme
enum
default:"N"
IUPAC ambiguity code defining the nucleotide pool for substitutions. "N" = any base (ACGT); "R" = purines (AG); "Y" = pyrimidines (CT); etc.Available options: N, R, Y, S, W, K, M, B, D, H, V
sequence_type
enum
default:"auto"
How to interpret input sequences. "auto" detects DNA vs RNA by presence of U; "dna" or "rna" forces the type.Available options: auto, dna, rna
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cpu"
Device to run the tool on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
sequences
List[string]
required
Nucleotide sequences with masked positions filled by random bases drawn from the configured substitution scheme.

Applications

Use this to build randomized nucleotide libraries: degenerate positions in promoters, ribosome binding sites, UTRs, or coding regions for directed-evolution and combinatorial-screening campaigns. It also serves as an unbiased random baseline for judging whether a model-guided generator produces better-than-chance sequences.

Usage Tips

  • substitution_scheme (default N) sets the substitution alphabet. N allows any base for maximum diversity; restrict it to bias the library, for example R for purines (A/G), S for strong pairs (G/C), or W for weak pairs (A/T).
  • _ masks override the masking strategy. If an input already contains _, exactly those positions are filled and masking_strategy is ignored; remove the _ characters to let the strategy choose positions instead.
  • sequence_type (default auto) controls RNA handling. auto treats the sequence as RNA only when it contains U; force it with dna or rna. In RNA mode sampled T bases are written as U, so set rna explicitly when the input is fully masked.
  • masking_strategy.fixed_positions are 1-indexed. Positions listed there are never mutated; they are specified using 1-based indexing to match biological residue selection conventions.
  • Set seed for reproducibility. Sampling is otherwise nondeterministic; a fixed seed makes the filled sequences reproducible across runs.

Toolkit Notes

These apply to every Random Nucleotide Sampling tool in this toolkit (random-nucleotide-sample).
  • Runs on CPU. The sampler is pure Python with no model and no external dependencies; execution is near-instant.
  • Deterministic only with a seed. Without a seed the filled positions differ every run; set one when you need reproducible libraries.
Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.