This optimizer is open source. Any third-party models, product names, or trademarks referenced are the property of their respective owners, and Proto is not affiliated with them.
seq.logits: an (L, |vocab|) matrix carried on
each of num_results parallel proposal Sequences (one independent trajectory per
result). The companion PositionWeightGenerator is the only thing that maps those
continuous logits back to a discrete sequence — it never proposes; it merely decodes
(argmax or categorical) at tracked steps so snapshots, result_sequences, and the
next-stage handoff carry a real sequence. Logits are seeded once (zeros, or
initial_logits/sequence_bias, optionally plus per-trajectory Gumbel noise so
parallel trajectories diverge) and then mutated in place every step; they are never
re-proposed.
Each step: (1) interpolate soft (relax↔hard blend), hard (straight-through),
softmax temperature and learning rate from their start→end configs/schedules with
progress = step / num_steps; (2) ask every compiled gradient provider to backpropagate
its differentiable (gradient-mode or compiler-backed) constraint into a per-trajectory logit
gradient and per-trajectory loss, given the current temperature/soft/hard and the
step’s effective weight; non-finite gradients raise. Then per trajectory: (3) align per-constraint
gradient norms (norm_alignment), scale each by its effective weight, and merge them with the
configured merger (weighted_sum/pcgrad/mgda); (4) zero fixed_positions and
optionally normalize the merged gradient; (5) take one ml_optimizer step (SGD or Adam) at the
effective learning rate (_effective_lr optionally scales it by (1 - soft) + soft * temp,
floored at min_lr_scale). After updating, energy_scores is set to the summed weighted
constraint losses — the exact objective being minimized. At tracking_interval steps (and the
last step) the generator decodes logits, proposals sync to results, and a snapshot is saved.
Per-constraint weights can ramp over steps via constraint_weight_schedules (a
ConstraintWeightSchedule keyed by Constraint.label); unknown labels warn and are
ignored. With save_best=True (default) the lowest-loss logits per trajectory are restored
and re-decoded at the end instead of returning the final step. Constraints: single target
segment only; exactly one PositionWeightGenerator; every constraint must support gradient
evaluation. Chain stages in a Program for multi-phase pipelines (logit-relaxation phase
via germinal_logit_preset → softmax-annealing phase via germinal_softmax_preset).
How It Works
The gradient optimizer relaxes the sequence into continuous logits and takes gradient steps that lower the constraint loss, sharpening the relaxation from soft to hard before decoding. The discrete sequence is relaxed into a continuous logit matrixL×|V|, one per trajectory. Each step sharpens a softmax relaxation, backpropagates every differentiable constraint into a per-trajectory gradient, merges them, and applies an SGD or Adam update:
weighted_sum, pcgrad, or mgda; fixed_positions stay frozen (gradient set to 0). With save_best, the lowest-loss logits seen across all steps are decoded at the end through the PositionWeightGenerator (argmax by default, or categorical sampling).
API Reference
Configuration for gradient-based sequence optimization.Each GradientOptimizer runs one mode (fixed or ramping soft, with optional
temperature annealing). Chain multiple in a
Program for multi-phase
pipelines (e.g. logit phase → softmax phase).Ramps use
progress = step / num_steps with step starting at 1,
so step 1 evaluates to start + (end - start) / num_steps (not exactly
start); step num_steps evaluates exactly to end.Candidate designs for this optimizer. Overrides program-level count.
Number of gradient descent steps.
Base learning rate for gradient updates.
Per-position logit bias for the target vocabulary; added to initial logits to seed the search.
Soft sampling weight at the first step. 0 uses hard logits; 1 uses the full softmax over logits.
Soft sampling weight at the final step. 0 uses hard logits; 1 uses the full softmax.
Straight-through blend at step 1. 0 is fully relaxed; 1 is argmax forward + relaxed gradient.
Straight-through blend at the final step. 0 is fully relaxed; 1 = argmax forward + relaxed grad.
Softmax temperature at the first step. Lower values produce sharper distributions.
Softmax temperature at the final step. Lower values produce sharper distributions.
Curve interpolating the softmax temperature from start to end across optimization steps.Options:
constant, cosine, exponential, hinge, linear, quadraticLR curve over the temperature endpoints; only active when scale_lr_by_temperature=True.Options:
constant, cosine, exponential, hinge, linear, quadraticStrategy for merging gradients from multiple constraints.Options:
weighted_sum, pcgrad, mgdaGradient update rule applied each step. Currently ‘sgd’ or ‘adam’.Options:
sgd, adamBeta and epsilon parameters used when the update algorithm is ‘adam’.
How per-constraint gradients are rescaled before merging: as-is, unit-normalized, or match-first.Options:
none, unit, match_firstIn match_first mode, zero out gradients with norm below this threshold.
Normalize the merged gradient before each update.
‘unit’ rescales the gradient to unit L2 norm; ‘sqrt_length’ scales magnitude by sqrt(length).Options:
unit, sqrt_lengthZero-based positions to freeze during optimization. Pair with sequence_bias to anchor each position.
Multiply LR by a blend of soft weight and softmax temperature; slows updates as sharpness rises.
Lower bound on the learning-rate scale factor when temperature scaling is enabled.
Return the lowest-loss result instead of the last iteration.
Per-constraint weight schedules that override the constraint’s static weight at each step.
Add Gumbel noise to default-init logits (frozen positions excluded) to diverge trajectories.
Divisor for the default-path Gumbel init noise. 1.0 = unscaled; larger shrinks it.
Base logit matrix (rows=positions, cols=vocab) that replaces default initialization.
Zero-based positions perturbed with Gumbel noise and passed through a softmax over initial logits.
Random seed for reproducible optimization, generator, and constraint tool streams.
Save history and log progress every N steps. Step 0 and final step always saved.
Save granular per-proposal results (accept/reject) in history snapshots.
Emit per-step debug information about proposals, scores, and acceptance through the logger.
Usage
python
Metadata
| Property | Value |
|---|---|
| Key | gradient |
| Class | GradientOptimizer |
| Targets Single Segment | True |
| Uses GPU | False |
| Required Constraint Mode | gradient |
| Compatible Generators | position-weight |