Random Protein Sampling

License: Random Protein Sampling is open source and free for academic and commercial use under an Apache-2.0 license. Please refer to the license for full terms.

Proto is not affiliated with Proto. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.

Tool Source Tool Source Open as Notebook Open as Notebook Open on Proto Open on Proto

proto-bio/proto-tools/proto_tools/tools/mutagenesis/random_protein

View source

Open Notebook

Open notebook

Coming soon!

Run this tool directly in Proto with no setup required.

Function	Description
`run_random_protein_sample()`	Sample protein sequences by filling masked positions with random amino acids drawn from a codon s…	Docs Source

Background

Random Protein Sampling performs random mutagenesis at the protein level: it takes a protein sequence, determines which positions are designable, and replaces each with an amino acid sampled from the distribution implied by a codon scheme. It generates protein-sequence diversity without any learned model, the simplest possible baseline against which model-guided designers can be compared. Internally, designable positions are either the _ characters already present in the input or, when none are present, positions chosen by the configured masking strategy. The codon scheme is expanded to its concrete codons, and each amino acid’s sampling weight is set proportional to the number of codons in the scheme that encode it, with stop codons excluded. UNIFORM instead assigns equal weight to all twenty standard amino acids. Each masked position is filled independently by a weighted random draw. With a fixed seed the output is deterministic. This tool is original proto-tools code maintained by Proto.

Tools

Random Protein Sampling (`random-protein-sample`)

Fills every masked position in each input sequence with a random amino acid drawn from the configured codon scheme, returning one filled sequence per input.

API Reference

Source

Input: MaskedModelInput

sequences

List[string]

required

Protein sequence(s) to process. Can be provided as:

Source

Config: RandomProteinSampleConfig

masking_strategy

MaskingStrategy

Controls which positions to mask for sampling.

Show MaskingStrategy

method

enum

default:"random"

Scoring method for position selection. "random": uniform random, "entropy": highest model uncertainty, "max-logit": lowest model confidence.Available options: random, entropy, max-logit

num_mutations

integer

Exact number of positions to mask per sequence.

mask_fraction

number

Fraction of designable positions to mask (e.g. 0.15 for ~15%).

fixed_positions

array

1-indexed positions that must NOT be masked. Applied uniformly to all sequences.

temperature

number

default:"1.0"

Temperature for position selection. < 1.0 is greedy, 1.0 uses scores as-is, > 1.0 is more uniform.

model_name

string

Which masked model to use for scoring. Defaults to the sampling tool’s model when unset.

model_checkpoint

string

Model checkpoint override (uses tool default if None).

codon_scheme

enum

default:"UNIFORM"

Codon scheme controlling amino acid sampling probabilities. "UNIFORM" gives equal weight to all 20 amino acids; other schemes (NNK, NNS, NDT, etc.) weight amino acids by the number of codons encoding them.Available options: UNIFORM, NNN, NNK, NNS, NDT, DBK, NRT

allow_stop_codons

boolean

default:"False"

If True, the stop symbol "*" is included in the sampling distribution. For degenerate schemes it is weighted by its stop-codon count; for "UNIFORM" it is an equally weighted 21st symbol. Default: False (stops never sampled).

verbose

integer

default:"0"

Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.

device

string

default:"cpu"

Device to run the tool on.

timeout

integer

default:"600"

Maximum execution time in seconds. None waits indefinitely.

seed

integer

Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.

Source

Output: RandomProteinSampleOutput

sequences

List[string]

required

Sampled protein sequences with masked positions filled by random amino acids drawn from the configured codon scheme.

Applications

Use this to build randomized protein libraries that mimic experimental degenerate-codon mutagenesis, for example NNK saturation at chosen positions for directed-evolution and combinatorial screening. It also serves as an unbiased random baseline for judging whether a model-guided designer beats chance.

Usage Tips

codon_scheme (default UNIFORM) sets the amino-acid distribution. UNIFORM draws all twenty amino acids equally; degenerate schemes (NNK, NNS, NDT, DBK, NRT) weight each amino acid by how many of the scheme’s codons encode it, so residues such as leucine, serine, and arginine appear more often than methionine or tryptophan.
NDT gives an even 12-amino-acid library. It encodes twelve amino acids with no codon redundancy, so each is equally likely; useful for small focused libraries.
Stop codons are excluded by default. Set allow_stop_codons to True to include the stop symbol * in the distribution: for degenerate schemes it is weighted by its stop-codon count, and for UNIFORM it is an equally weighted 21st symbol.
_ masks override the masking strategy. If an input already contains _, exactly those positions are filled and masking_strategy is ignored; remove the _ characters to let the strategy choose positions instead.
masking_strategy.fixed_positions are 1-indexed. Positions listed there are never mutated; they are specified using 1-based indexing to match biological residue selection conventions.
Set seed for reproducibility. Sampling is otherwise nondeterministic; a fixed seed makes the filled sequences reproducible across runs.

Toolkit Notes

These apply to every Random Protein Sampling tool in this toolkit (random-protein-sample).

Runs on CPU. The sampler is pure Python with no model and no external dependencies; execution is near-instant.
Deterministic only with a seed. Without a seed the filled positions differ every run; set one when you need reproducible libraries.

Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.

​Background

​Tools

​Random Protein Sampling (random-protein-sample)

​API Reference

​Applications

​Usage Tips

​Toolkit Notes

​Infrastructure Guides

Tool Persistence

Device Management

Parallel Execution

Cloud Inference

Background

Tools

Random Protein Sampling (`random-protein-sample`)

API Reference

Applications

Usage Tips

Toolkit Notes

Infrastructure Guides