Use this file to discover all available pages before exploring further.
A multi-agent flow that trains a solver-judge system on the countdown task using the AgentFlow protocol. The solver generates N candidate solutions in parallel; the judge evaluates them and selects the best. The trainer scores each role separately so GRPO can compute advantages within each trajectory group.This cookbook is the canonical example of returning multiple named trajectories from a single AgentFlow. It pairs with the longer solver-judge tutorial, which walks through the design step by step.
The evaluator scores each trajectory independently. GRPO then groups by name across rollouts: all solver trajectories for one task into one group; all judge trajectories into another.
N_SOLUTIONS = 2@rllm.rollout(name="solver-judge")async def solver_judge_flow(task: Task, config: AgentConfig) -> Episode: client = AsyncOpenAI(base_url=config.base_url, api_key="EMPTY") problem = task.instruction # 1. Solver runs N solutions in parallel. solver_trajectories = await _generate_solutions(client, config.model, problem) # 2. Judge picks one. solutions = [t.steps[0].action for t in solver_trajectories] judge_trajectory = await _judge_solutions(client, config.model, problem, solutions) selected = judge_trajectory.steps[0].action return Episode( trajectories=[*solver_trajectories, judge_trajectory], artifacts={"answer": selected}, )
The evaluator scores each trajectory independently. Solver trajectories share the per-task ground truth; the judge gets its own reward depending on whether the selected solution was correct: