Skip to main content
This program searches for novel Cas9-like CRISPR loci. An autoregressive genome model (Evo1, trained on CRISPR loci) proposes candidate DNA, and a RejectionSamplingOptimizer runs each candidate through an ordered cascade of eight filters arranged cheap-to-expensive:
  1. orf_filter: the locus must contain a long open reading frame
  2. cas9_phmm_filter: the translated protein matches a Cas9 profile HMM (PyHMMER)
  3. crispr_array_filter: a CRISPR repeat array is present (MinCED)
  4. identity_filter: the protein is not too close to known Cas9s (MMseqs2)
  5. gap_gini_filter: the alignment gap pattern is well distributed (MMseqs2)
  6. domain_filter: the RuvC and HNH catalytic domains are present (PyHMMER)
  7. tracr_filter: a compatible tracrRNA can be folded (CRISPR-tracrRNA)
  8. structure_filter: the protein folds to a confident, compact structure (AlphaFold3)
Because every constraint carries a threshold, the optimizer treats them as ordered filters: a candidate that fails one filter is never evaluated by the later, more expensive ones. The costly AlphaFold3 fold therefore runs only on candidates that already look like complete Cas9 loci. The full script draws thousands of samples to find the rare survivors that reach the structure filter. This walkthrough imports the program builder and draws a small batch, so it runs on a single GPU in a few minutes; most candidates are rejected by the early filters, which is exactly the behavior the cascade is designed for. Open as a runnable notebook View as a Python script
Runtime: this walkthrough runs real models on a GPU and takes several minutes to complete. The first run is slower because it builds the tool environment and downloads model weights.

Building the program

build_program assembles the Evo1 generator and the eight filter constraints into a single RejectionSamplingOptimizer. The generator is Evo1Generator running the evo-1-8k-crispr checkpoint; the arguments passed here scale the search down. n_samples is the number of candidate sequences drawn, temperature 0.5 controls the sharpness of the sampling distribution (below 1 sharpens it toward the most likely tokens), and top_k_val=4 restricts sampling at each step to the four most probable tokens. batch_size sets how many sequences the generator produces per GPU batch, and af3_output_dir names the directory where the structure filter writes its AlphaFold3 outputs. Each of the eight constraints is given threshold=0.5: a constraint returns 0.0 for a candidate that passes and 1.0 for one that fails, so scores at or below the threshold are accepted and scores above it are rejected, which is what turns the constraints into ordered filters.
python
import sys
from pathlib import Path

# The eight-filter cascade and its data paths are assembled by the example script's builder.
sys.path.insert(0, str(Path.cwd().parents[1]))
from examples.scripts.evocas9_rejection_sampling import build_program, collect_results

N_SAMPLES = 8

program, locus = build_program(
    n_samples=N_SAMPLES,
    temperature=0.5,
    top_k_val=4,
    batch_size=N_SAMPLES,
    af3_output_dir="cas9_af3_pdbs",
)

Running the cascade

program.run() draws the proposals and scores each one through the constraints. The RejectionSamplingOptimizer generates the candidates in independent batches, keeping no state between batches, and retains the best num_results by lowest energy. Because every constraint carries a threshold, scoring short-circuits: a candidate that fails one filter is not evaluated by the later, more expensive ones, so the AlphaFold3 fold runs only on candidates that already passed the seven cheaper checks. Wrapping the call in ToolInstance.persist() keeps the tool environments and loaded model weights alive for the whole run instead of tearing them down between constraint calls.
python
from proto_tools.utils.tool_instance import ToolInstance

with ToolInstance.persist():
    program.run()

Collecting survivors

collect_results reads each retained candidate back from the segment along with the metadata each filter recorded, and survivors keeps only the entries that carry a DNA sequence, the candidates that passed all eight filters including the AlphaFold3 structure check. The printout reports how many of the N_SAMPLES candidates survived and, for each survivor, its pLDDT, the catalytic domains found, and the predicted protein length. On a small batch this prints 0 passed all eight filters: most candidates are rejected by the early filters, which is the behavior the cheap-to- expensive cascade is designed for. Surfacing Cas9-scale survivors that reach the fold is what raising N_SAMPLES toward the thousands used by the full script does.
python
results = collect_results(locus, 0.5, 4)
survivors = [r for r in results if r["dna_sequence"]]

print(f"sampled {N_SAMPLES} candidates; {len(survivors)} passed all eight filters")
for r in survivors:
    print(f"  pLDDT={r['plddt']} domains={r['domains_found']} protein_len={len(r['protein_sequence'] or '')}")
sampled 8 candidates; 0 passed all eight filters

Next Steps

Protein Hunter

Structure-based protein design by cycling.

Using Optimizers

Rejection sampling and the other optimization strategies.