RejectionSamplingOptimizer runs each
candidate through an ordered cascade of eight filters arranged cheap-to-expensive:
orf_filter: the locus must contain a long open reading framecas9_phmm_filter: the translated protein matches a Cas9 profile HMM (PyHMMER)crispr_array_filter: a CRISPR repeat array is present (MinCED)identity_filter: the protein is not too close to known Cas9s (MMseqs2)gap_gini_filter: the alignment gap pattern is well distributed (MMseqs2)domain_filter: the RuvC and HNH catalytic domains are present (PyHMMER)tracr_filter: a compatible tracrRNA can be folded (CRISPR-tracrRNA)structure_filter: the protein folds to a confident, compact structure (AlphaFold3)
Runtime: this walkthrough runs real models on a GPU and takes several minutes to complete. The first run is slower because it builds the tool environment and downloads model weights.
Building the program
build_program assembles the Evo1 generator and the eight filter constraints into a single
RejectionSamplingOptimizer. The generator is Evo1Generator running the evo-1-8k-crispr
checkpoint; the arguments passed here scale the search down. n_samples is the number of
candidate sequences drawn, temperature 0.5 controls the sharpness of the sampling distribution
(below 1 sharpens it toward the most likely tokens), and top_k_val=4 restricts sampling at each
step to the four most probable tokens. batch_size sets how many sequences the generator
produces per GPU batch, and af3_output_dir names the directory where the structure filter writes
its AlphaFold3 outputs. Each of the eight constraints is given threshold=0.5: a constraint
returns 0.0 for a candidate that passes and 1.0 for one that fails, so scores at or below the
threshold are accepted and scores above it are rejected, which is what turns the constraints into
ordered filters.
python
Running the cascade
program.run() draws the proposals and scores each one through the constraints. The
RejectionSamplingOptimizer generates the candidates in independent batches, keeping no state
between batches, and retains the best num_results by lowest energy. Because every constraint
carries a threshold, scoring short-circuits: a candidate that fails one filter is not evaluated by
the later, more expensive ones, so the AlphaFold3 fold runs only on candidates that already passed
the seven cheaper checks. Wrapping the call in ToolInstance.persist() keeps the tool
environments and loaded model weights alive for the whole run instead of tearing them down between
constraint calls.
python
Collecting survivors
collect_results reads each retained candidate back from the segment along with the metadata each
filter recorded, and survivors keeps only the entries that carry a DNA sequence, the candidates
that passed all eight filters including the AlphaFold3 structure check. The printout reports how
many of the N_SAMPLES candidates survived and, for each survivor, its pLDDT, the catalytic
domains found, and the predicted protein length. On a small batch this prints 0 passed all eight filters: most candidates are rejected by the early filters, which is the behavior the cheap-to-
expensive cascade is designed for. Surfacing Cas9-scale survivors that reach the fold is what
raising N_SAMPLES toward the thousands used by the full script does.
python
Next Steps
Protein Hunter
Structure-based protein design by cycling.
Using Optimizers
Rejection sampling and the other optimization strategies.