Cell-Type-Specific Regulatory DNA

This program designs a 200 bp regulatory DNA sequence that is active in one cell type but not in others. It uses the Malinois activity model as a differentiable constraint and the GradientOptimizer to shape the sequence: the objective maximizes predicted K562 activity while minimizing activity in HepG2 and SKNSH, yielding a K562-specific enhancer. This runs three hundred gradient steps and returns twenty designs. It requires a GPU for the Malinois model. Open as a runnable notebook View as a Python script

Runtime: this walkthrough runs real models on a GPU and takes several minutes to complete. The first run is slower because it builds the tool environment and downloads model weights.

The design segment and generator

A Segment is the stretch of sequence being designed; a Construct groups the segments that make up one molecule. The segment is seeded with "A" * SEQ_LENGTH, a placeholder that fixes the length (200 bp) and sequence_type="dna" vocabulary; the optimizer overwrites the actual bases. Instead of holding discrete characters, the GradientOptimizer represents each position as a set of differentiable per-position logits and updates them by gradient descent. PositionWeightGenerator is the only component that maps those continuous logits back to a discrete sequence; with sampling_mode="argmax" it decodes the most likely base at each position. generator.assign(segment) binds it to the segment it decodes.

python

from proto_language.core import Segment, Construct
from proto_language.generator import PositionWeightGenerator, PositionWeightGeneratorConfig

SEQ_LENGTH = 200

segment = Segment(sequence="A" * SEQ_LENGTH, sequence_type="dna", label="enhancer_insert")
construct = Construct([segment], label="k562_specific_design")

generator = PositionWeightGenerator(PositionWeightGeneratorConfig(sampling_mode="argmax"))
generator.assign(segment)

The specificity objective

Constraints score how well a sequence meets the design objective; the optimizer searches for sequences that lower the combined loss. Malinois predicts MPRA activity for 200 bp DNA inserts in the K562, HepG2, and SK-N-SH (SKNSH) cell contexts, and each constraint maps the requested cell-type score to a bounded objective: direction="max" rewards higher predicted activity, direction="min" rewards lower activity. The helper builds three constraints over the same segment: maximize K562 activity, minimize HepG2 activity, and minimize SKNSH activity, each with seq_length=SEQ_LENGTH and weight=1.0. Each Constraint carries a label, the key its diagnostics appear under in the result metadata. All three back-propagate through the Malinois model, which is what lets the GradientOptimizer use them.

python

from proto_language.core import Constraint
from proto_language.constraint import MalinoisActivityConfig, malinois_activity_constraint


def malinois(cell_type, direction, label, weight):
    return Constraint(
        inputs=[segment],
        function=malinois_activity_constraint,
        function_config=MalinoisActivityConfig(cell_type=cell_type, direction=direction, seq_length=SEQ_LENGTH),
        label=label,
        weight=weight,
    )


constraints = [
    malinois("K562", "max", "k562_max", 1.0),
    malinois("HepG2", "min", "hepg2_min", 1.0),
    malinois("SKNSH", "min", "sknsh_min", 1.0),
]

Run the gradient optimization

The GradientOptimizer runs continuous gradient descent on the segment’s per-position logits, backpropagating each constraint into a logit gradient, merging the per-constraint gradients, and updating the logits each step. The config sets num_results=20 parallel trajectories over num_steps=300 steps with base learning rate lr=0.5. Updates use ml_optimizer="adam", and the per-constraint gradients are combined with merger="weighted_sum" (each scaled by its weight). lr_schedule="cosine" with scale_lr_by_temperature=True anneals the learning rate on a cosine curve across the run. gumbel_logit_init=True adds Gumbel noise to the initial logits so the 20 trajectories diverge, and save_best=True returns each trajectory’s lowest-loss design rather than its final step. The optional custom_logging callback fires at tracked steps; here track records each snapshot’s sequence and its K562 raw activity (read from the k562_max constraint metadata) into trajectory. Program(..., seed=0) makes the run deterministic.

python

from proto_language.core import Program
from proto_language.optimizer import GradientOptimizer, GradientOptimizerConfig

# Record the sequence (and its K562 activity, when available) at each tracked step.
trajectory = []


def track(step, segments):
    seq = segments[0].proposal_sequences[0]
    act = seq.metadata.get("constraints", {}).get("k562_max", {}).get("data", {}).get("malinois_raw_score")
    trajectory.append((step, str(seq.sequence), act))


optimizer = GradientOptimizer(
    target_segment=segment,
    constructs=[construct],
    generators=[generator],
    constraints=constraints,
    config=GradientOptimizerConfig(
        num_results=20,
        num_steps=300,
        lr=0.5,
        softmax_schedule="constant",
        lr_schedule="cosine",
        scale_lr_by_temperature=True,
        ml_optimizer="adam",
        merger="weighted_sum",
        gumbel_logit_init=True,
        save_best=True,
    ),
    custom_logging=track,
)

program = Program([optimizer], num_results=20, seed=0)
program.run()

Inspect the result

segment.result_sequences holds the returned designs; with save_best=True the first entry is the lowest-loss design. The first block prints a few evenly spaced snapshots from trajectory, each showing the step number, the K562 raw activity at that step, and the start of the sequence, so you can watch the activity rise as the logits are optimized. The final lines print the top design’s full sequence and its predicted K562 activity, read back from the k562_max constraint’s malinois_raw_score metadata.

python

best = segment.result_sequences[0]
k562 = best.metadata.get("constraints", {}).get("k562_max", {}).get("data", {}).get("malinois_raw_score")

def representative(traj, n=4):
    if len(traj) <= n:
        return traj
    idx = sorted({round(i * (len(traj) - 1) / (n - 1)) for i in range(n)})
    return [traj[i] for i in idx]

print("trajectory (the sequence is reshaped toward a K562-specific enhancer):")
for step, seq, act in representative(trajectory):
    act_str = f"{act:6.2f}" if act is not None else "   -  "
    print(f"  step {step:3d} | K562 {act_str} | {seq[:60]}...")

print(f"\ntop design: {best.sequence}")
print(f"K562 predicted activity: {k562}")

trajectory (the sequence is reshaped toward a K562-specific enhancer):
  step   1 | K562   0.59 | CCTGAGAAGGGTACGTCGCTATAATTCTCCTGATTGGTCGGTGCAGCCCAATTTGGGTTG...
  step 101 | K562   9.44 | CCTCGCCCCCATGAGGCGCTATAGAACACCTGACGGCTGCTTCCCGCCCAATTTGGGTTT...
  step 200 | K562  10.20 | CCTCGCCCACATGAGGCGCTATAGAACACCTGACGGCTGCTTCCCGCCCAATTTGGGTTT...
  step 300 | K562  10.31 | CCTCGCCCACATGAGGCGCTATAGAACACCTGACGGCTGCTTCCCGCCCAATTTGAGTTT...

top design: CCTCGCCCACATGAGGCGCTATAGAACACCTGACGGCTGCTTCCCGCCCAATTTGAGTTTGCGCGCGCCATTTGCATGACGGCGACTAATCCTGCCGGCTGTCAGAGTGGCGCGCGAAAGATAAGACCAAAAAACCGCTTAACCGCCCCTTATCTCTTGTACCAGATGCACAGGCCGCCGCCCACTTAGTGAAATTTGAG
K562 predicted activity: 10.392728805541992

Cell-Type-Specific Regulatory DNA

The design segment and generator

The specificity objective

Run the gradient optimization

Inspect the result

Next Steps

Gradient Protein Hallucination

Using Optimizers

​The design segment and generator

​The specificity objective

​Run the gradient optimization

​Inspect the result

​Next Steps

Gradient Protein Hallucination

Using Optimizers

The design segment and generator

The specificity objective

Run the gradient optimization

Inspect the result

Next Steps