Pattern
| Aspect | Value |
|---|---|
| Loop shape | Two-stage — N parallel solver calls, then 1 judge call |
| Tools | None — solver returns text, judge returns an index |
| Trajectory names | "solver" (one per attempt) + "judge" (one per task) |
| Termination | All solver + judge calls return |
| Reward shape | Per-trajectory — solvers scored on their own answer, judge on the answer it picked |
Architecture
The evaluator scores each trajectory independently. GRPO then groups by name across rollouts: allsolver trajectories for one task into one group; all judge trajectories into another.
Install
Dataset
Eval
Training
Key code
The flow:Files
| File | Description |
|---|---|
solver_judge_flow.py | Multi-agent AgentFlow (N parallel solvers + 1 judge) |
evaluator.py | Per-trajectory reward scoring |
train.py + train_{tinker,verl}.sh | Hydra entry points |
pyproject.toml | Plugin entry-point declarations |
test.py | Unit tests |
On GitHub
cookbooks/solver_judge_flow
Full source, README, and runnable launch scripts
See also
Solver-judge tutorial
Step-by-step walkthrough of the design from scratch

