Skip to main content
License: Random Protein Sampling is open source and free for academic and commercial use under an Apache-2.0 license. Please refer to the license for full terms.

Proto is not affiliated with Proto. This toolkit is open source and builds on the implementation produced by this organization. Product names, logos, and trademarks are the property of their respective owners.


proto-bio/proto-tools/proto_tools/tools/mutagenesis/random_protein
View source
Open Notebook
Open notebook
Coming soon!
Run this tool directly in Proto with no setup required.
FunctionDescription
run_random_protein_sample()Sample protein sequences by filling masked positions with random amino acids drawn from a codon s… Docs Source

Background

Random Protein Sampling performs random mutagenesis at the protein level: it takes a protein sequence, determines which positions are designable, and replaces each with an amino acid sampled from the distribution implied by a codon scheme. It generates protein-sequence diversity without any learned model, the simplest possible baseline against which model-guided designers can be compared. Internally, designable positions are either the _ characters already present in the input or, when none are present, positions chosen by the configured masking strategy. The codon scheme is expanded to its concrete codons, and each amino acid’s sampling weight is set proportional to the number of codons in the scheme that encode it, with stop codons excluded. UNIFORM instead assigns equal weight to all twenty standard amino acids. Each masked position is filled independently by a weighted random draw. With a fixed seed the output is deterministic. This tool is original proto-tools code maintained by Proto.

Tools

Random Protein Sampling (random-protein-sample)

Fills every masked position in each input sequence with a random amino acid drawn from the configured codon scheme, returning one filled sequence per input.

API Reference

Source
sequences
List[string]
required
Protein sequence(s) to process. Can be provided as:
Source
masking_strategy
MaskingStrategy
Controls which positions to mask for sampling.
codon_scheme
enum
default:"UNIFORM"
Codon scheme controlling amino acid sampling probabilities. "UNIFORM" gives equal weight to all 20 amino acids; other schemes (NNK, NNS, NDT, etc.) weight amino acids by the number of codons encoding them.Available options: UNIFORM, NNN, NNK, NNS, NDT, DBK, NRT
allow_stop_codons
boolean
default:"False"
If True, the stop symbol "*" is included in the sampling distribution. For degenerate schemes it is weighted by its stop-codon count; for "UNIFORM" it is an equally weighted 21st symbol. Default: False (stops never sampled).
verbose
integer
default:"0"
Verbosity level (0=quiet, 1=info, 2=debug, 3=raw subprocess stderr). True is coerced to 1 and False to 0.
device
string
default:"cpu"
Device to run the tool on.
timeout
integer
default:"600"
Maximum execution time in seconds. None waits indefinitely.
seed
integer
Random seed. When set, tools run reproducibly up to small GPU float noise (see BaseToolOutput.approx_equal), and the seed participates in cache keys. When None, cacheable seed-sensitive tools skip cache until seeded.
Source
sequences
List[string]
required
Sampled protein sequences with masked positions filled by random amino acids drawn from the configured codon scheme.

Applications

Use this to build randomized protein libraries that mimic experimental degenerate-codon mutagenesis, for example NNK saturation at chosen positions for directed-evolution and combinatorial screening. It also serves as an unbiased random baseline for judging whether a model-guided designer beats chance.

Usage Tips

  • codon_scheme (default UNIFORM) sets the amino-acid distribution. UNIFORM draws all twenty amino acids equally; degenerate schemes (NNK, NNS, NDT, DBK, NRT) weight each amino acid by how many of the scheme’s codons encode it, so residues such as leucine, serine, and arginine appear more often than methionine or tryptophan.
  • NDT gives an even 12-amino-acid library. It encodes twelve amino acids with no codon redundancy, so each is equally likely; useful for small focused libraries.
  • Stop codons are excluded by default. Set allow_stop_codons to True to include the stop symbol * in the distribution: for degenerate schemes it is weighted by its stop-codon count, and for UNIFORM it is an equally weighted 21st symbol.
  • _ masks override the masking strategy. If an input already contains _, exactly those positions are filled and masking_strategy is ignored; remove the _ characters to let the strategy choose positions instead.
  • masking_strategy.fixed_positions are 1-indexed. Positions listed there are never mutated; they are specified using 1-based indexing to match biological residue selection conventions.
  • Set seed for reproducibility. Sampling is otherwise nondeterministic; a fixed seed makes the filled sequences reproducible across runs.

Toolkit Notes

These apply to every Random Protein Sampling tool in this toolkit (random-protein-sample).
  • Runs on CPU. The sampler is pure Python with no model and no external dependencies; execution is near-instant.
  • Deterministic only with a seed. Without a seed the filled positions differ every run; set one when you need reproducible libraries.
Example notebook: See the full working example for a copy-paste-ready walkthrough.

Infrastructure Guides

The following guides cover how to run tools efficiently and at scale.

Tool Persistence

Keep a tool’s model warm across calls instead of reloading it every invocation.

Device Management

How GPUs are allocated to tools and how to target specific devices.

Parallel Execution

Fan a batch of inputs out across multiple GPUs.

Cloud Inference

Run tools on managed cloud infrastructure with no local setup.