AgentFlow is the recommended way to author an agent in rLLM. An AgentFlow is a plain async function that takes a Task and an AgentConfig and returns an Episode, a single Trajectory, or None. The same function runs both for evaluation and for training — at training time and at eval time the runner routes config.base_url through a model gateway that captures token IDs and logprobs transparently, so the flow code itself doesn’t change.
For a conceptual walkthrough see AgentFlow & Evaluator; for worked examples see cookbooks/.
Eval and training share one engine
Bothrllm eval and rllm train drive rllm.engine.agentflow_engine.AgentFlowEngine. The same _run_single loop is used end-to-end: gateway session → run flow → fetch traces → enrich Episode → evaluate. The eval-specific concerns (per-task verifier resolution, sandbox lifecycle) plug in via the engine’s optional TaskHooks parameter:
rllm.hooks.SandboxTaskHooks, which detects each task’s [verifier] block, builds a sandbox if needed, and resolves a per-task evaluator. Training leaves hooks=None and uses a single engine-bound evaluator. After the refactor that introduced the unified engine, rllm eval returns Episodes whose Steps are populated from gateway traces — flows that return None work identically at eval and training time.
For training agents that need a sandbox per rollout (sandboxed code agents, harbor tasks), wire the same hook style at trainer construction time. The engine handles per-rollout setup/teardown in a try/finally so retries get fresh sandboxes automatically.
The protocol
run (sync) or arun (async). The runner prefers arun when running inside an event loop. In practice you almost always write the async form.
For single-agent flows, returning None is the simplest path — the framework builds an Episode with one Trajectory, and gateway-captured traces fill in the Steps. For multi-trajectory flows (e.g. solver / judge), return an explicit Episode with named trajectories so the trainer can group them for advantage computation.
@rllm.rollout decorator
The simplest way to satisfy the AgentFlow protocol is to decorate a plain function:
AgentFlowFn object that exposes .run() (sync, blocks until done) and .arun() (async). Both are usable directly; the trainer/runner calls them automatically.
Bare and parameterized forms
name is what shows up on Trajectory.name when the framework auto-builds a trajectory (i.e. when the function returns None or a Trajectory whose name is unset). It is also the role the trainer uses to group rollouts of the same task into a TrajectoryGroup for advantage computation, so it must be stable across rollouts.
Return-value coercion
The same coercion applies whether you use@rllm.rollout or implement the AgentFlow protocol directly on a class — both go through rllm.types._coerce_to_episode.
| Function returns | Wrapped as |
|---|---|
Episode | passed through (multi-trajectory flows must use this) |
Trajectory | Episode(trajectories=[t]). The trajectory is left untouched — the evaluator parses whatever the user put on it. |
None | Episode(trajectories=[Trajectory(name=…, steps=[])]). Gateway traces fill in the Steps during enrichment; the evaluator reads what it needs from those steps (e.g. step.model_response, step.chat_completions). |
TypeError. The canonical patterns are: return None for single-agent flows where the gateway captures everything, and return Episode(...) when you need explicit artifacts or multiple named trajectories — see cookbooks/solver_judge_flow/.
Task
The first argument to every AgentFlow.run:
Task is pure data. The instruction is rendered ahead of time (from a JSONL row, an instruction.md, or an instruction.md.tpl template). metadata carries everything the verifier or the flow needs at runtime — the source row for catalog datasets, the parsed task.toml for sandbox tasks, the gym-env config for cookbooks/frozenlake.
AgentConfig
The second argument:
AsyncOpenAI(base_url=config.base_url, api_key="EMPTY") and call .chat.completions.create(model=config.model, …) — that’s the canonical wiring. Don’t hard-code a base_url or model in the flow body.
Evaluator protocol
Episode produced by an AgentFlow. Set traj.reward on each trajectory if you need per-trajectory rewards (e.g. solver vs judge in cookbooks/solver_judge_flow); set EvalOutput.reward for the episode-level scalar that rllm eval aggregates and rllm train feeds into advantage computation.
@rllm.evaluator decorator
@rllm.rollout, supports bare and parameterized forms (@rllm.evaluator(register="my_eval")).
EvalOutput
signals is the right place for per-axis metrics that aggregate across the eval — accuracy, table-access rate, judge-correctness, etc. rllm eval reports the mean of each signal across the dataset.
Return-value coercion
The decorator acceptsEvalOutput, a plain float (treated as reward), or a (reward: float, is_correct: bool) tuple. Returning the explicit EvalOutput keeps the signal/metadata channels available.
run_agent_flow helper
For ad-hoc use outside the trainer / runner:
arun when present, falls back to run in a thread executor so sync flows don’t block the event loop.
Data types
The shapes the protocols return and consume. All live inrllm.types and are re-exported from rllm.agents for backward compatibility.
Action
Wraps an arbitrary action emitted by an agent.
The action content (string, dict, or any type).
Step
A single LLM interaction. The first group of fields is what every flow populates; the second group is filled in transparently by the gateway during training.
Auto-generated UUID.
Optional structured input (rendered prompt, tool args, …).
Optional structured output (parsed answer, return value, …).
The action taken at this step (parsed answer, tool call, …).
Per-step reward (set by the evaluator if you score per-step).
Whether the episode ended at this step.
Arbitrary per-step metadata (also accessible as
step.info).The chat history at this step in OpenAI message format.
The raw assistant content from this step’s LLM call.
Reasoning text (e.g.
<think>…</think> content extracted from the response).Prompt token IDs.
Response token IDs.
Per-token logprobs.
The full structured output from the rollout engine.
Per-token or scalar advantage, populated by the trainer.
Model-weight version at generation time (used for async-staleness tracking).
Trajectory
A sequence of Steps with a name. The name is what the trainer uses to group trajectories across rollouts when computing advantages — see cookbooks/solver_judge_flow/ for an example with two named groups (solver / judge).
Auto-generated UUID.
Trajectory role name. Used for advantage grouping. Default:
"default_traj_name".Ordered list of steps in this trajectory.
Trajectory-level reward (set by the evaluator for per-trajectory scoring).
Optional final answer / return value.
Arbitrary per-trajectory metadata (also accessible as
traj.info).is_cumulative(): returns True if every step’s chat_completions is a strict superset of the previous step’s — useful for trainers that need to know whether the trajectory shares a single growing context vs. independent turns.
Episode
The top-level return shape of an AgentFlow. Bundles all trajectories from one rollout plus any artifacts the evaluator will read.
Auto-generated UUID. The runner overrides this to
f"{task.id}:{rollout_idx}".Task data (often
task.id or the metadata dict, depending on the flow).All trajectories produced during this rollout.
Free-form output bag the evaluator reads. Convention: store the agent’s final answer at
artifacts["answer"].Whether this episode counts as a correct solve. The evaluator typically writes this.
Why the episode ended (set by the trainer / runner, not usually by the flow).
Optional per-episode metrics that the trainer logs.
Arbitrary metadata.
TrajectoryGroup
The trainer reorganizes per-rollout Episode objects into per-task TrajectoryGroups for advantage computation — all solver trajectories for one task into one group, all judge trajectories into another, and so on. Most users don’t construct these directly; the trainer does.
All trajectories in this group (typically same
name, same task).Identifier in the form
{task_id}:{role} (e.g. "task1:solver").Per-trajectory metadata aligned with
trajectories.Episode artifacts convention
The convention across all rLLM cookbooks: the flow stores its final user-facing answer inepisode.artifacts["answer"], and the evaluator reads it from there. This keeps reward computation outside the flow (so the same flow is reusable with different graders) and gives rllm.eval.reward_fns._helpers.extract_answer_text a single place to look.
See also
Cookbooks
Seven worked AgentFlow examples
AgentFlow & Evaluator
Conceptual walkthrough of the protocol
Workflows
The legacy Workflow path (uses
BaseAgent + BaseEnv)Trainer
Wire an AgentFlow + Evaluator into RL training

