Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.rllm-project.com/llms.txt

Use this file to discover all available pages before exploring further.

A multi-agent flow that trains a solver-judge system on the countdown task using the AgentFlow protocol. The solver generates N candidate solutions in parallel; the judge evaluates them and selects the best. The trainer scores each role separately so GRPO can compute advantages within each trajectory group. This cookbook is the canonical example of returning multiple named trajectories from a single AgentFlow. It pairs with the longer solver-judge tutorial, which walks through the design step by step.

Pattern

AspectValue
Loop shapeTwo-stage — N parallel solver calls, then 1 judge call
ToolsNone — solver returns text, judge returns an index
Trajectory names"solver" (one per attempt) + "judge" (one per task)
TerminationAll solver + judge calls return
Reward shapePer-trajectory — solvers scored on their own answer, judge on the answer it picked

Architecture

AgentFlow.run(task, config)

  ├── Solver (N parallel threads)
  │     └── client.chat.completions.create(...)
  │         → Trajectory(name="solver", steps=[Step(action=parsed_answer)])

  └── Judge
        └── client.chat.completions.create(...)
            → Trajectory(name="judge", steps=[Step(action=selected_answer)])

  └── Episode(trajectories=[solver_0, ..., solver_{N-1}, judge])
The evaluator scores each trajectory independently. GRPO then groups by name across rollouts: all solver trajectories for one task into one group; all judge trajectories into another.

Install

uv pip install -e ".[tinker]"                          # rllm + tinker backend
uv pip install -e cookbooks/solver_judge_flow          # this cookbook
rllm agent list                                        # should show "solver_judge"

Dataset

rllm dataset pull countdown
The countdown task asks the model to combine numbers with arithmetic to reach a target — a clean reasoning testbed.

Eval

rllm eval countdown \
    --agent solver_judge \
    --evaluator solver_judge_countdown \
    --model Qwen/Qwen3-8B \
    --base-url http://localhost:8000/v1 \
    --max-examples 20

Training

# Tinker (single-machine LoRA)
bash cookbooks/solver_judge_flow/train_tinker.sh

# Verl (distributed GPU)
bash cookbooks/solver_judge_flow/train_verl.sh

Key code

The flow:
N_SOLUTIONS = 2

@rllm.rollout(name="solver-judge")
async def solver_judge_flow(task: Task, config: AgentConfig) -> Episode:
    client = AsyncOpenAI(base_url=config.base_url, api_key="EMPTY")
    problem = task.instruction

    # 1. Solver runs N solutions in parallel.
    solver_trajectories = await _generate_solutions(client, config.model, problem)

    # 2. Judge picks one.
    solutions = [t.steps[0].action for t in solver_trajectories]
    judge_trajectory = await _judge_solutions(client, config.model, problem, solutions)

    selected = judge_trajectory.steps[0].action
    return Episode(
        trajectories=[*solver_trajectories, judge_trajectory],
        artifacts={"answer": selected},
    )
The evaluator scores each trajectory independently. Solver trajectories share the per-task ground truth; the judge gets its own reward depending on whether the selected solution was correct:
@rllm.evaluator
def solver_judge_countdown_evaluator(task: dict, episode: Episode) -> EvalOutput:
    ground_truth = {"target": task["target"], "numbers": task["nums"]}

    judge_reward = 0.0
    is_correct = False
    for traj in episode.trajectories:
        answer = traj.steps[-1].action if traj.steps else ""
        score = compute_score(str(answer), ground_truth)
        traj.reward = 1.0 if score >= 1.0 else 0.0
        if traj.name == "judge":
            judge_reward = traj.reward
            is_correct = traj.reward >= 1.0

    return EvalOutput(reward=judge_reward, is_correct=is_correct, signals=[...])

Files

FileDescription
solver_judge_flow.pyMulti-agent AgentFlow (N parallel solvers + 1 judge)
evaluator.pyPer-trajectory reward scoring
train.py + train_{tinker,verl}.shHydra entry points
pyproject.tomlPlugin entry-point declarations
test.pyUnit tests

On GitHub

cookbooks/solver_judge_flow

Full source, README, and runnable launch scripts

See also

Solver-judge tutorial

Step-by-step walkthrough of the design from scratch