Documentation Index
Fetch the complete documentation index at: https://docs.rllm-project.com/llms.txt
Use this file to discover all available pages before exploring further.
AgentFlow is the recommended way to author an agent in rLLM. An AgentFlow is a plain async function that takes a Task and an AgentConfig and returns an Episode. The same function runs both for evaluation and for training — at training time the trainer routes config.base_url through a model gateway that captures token IDs and logprobs transparently, so the flow code itself doesn’t change.
For a conceptual walkthrough see AgentFlow & Evaluator; for worked examples see cookbooks/.
The protocol
run (sync) or arun (async). The runner prefers arun when running inside an event loop. In practice you almost always write the async form.
@rllm.rollout decorator
The simplest way to satisfy the AgentFlow protocol is to decorate a plain function:
AgentFlowFn object that exposes .run() (sync, blocks until done) and .arun() (async). Both are usable directly; the trainer/runner calls them automatically.
Bare and parameterized forms
name is what shows up on Trajectory.name when the function returns a non-Episode (raw str, dict, or list[Trajectory]) and the decorator coerces it. When the function already returns an Episode, the trajectory names you set inside the function are preserved.
Return-value coercion
The decorator accepts several return-value shapes for convenience:| Function returns | Decorator wraps as |
|---|---|
Episode | passed through unchanged |
Trajectory | wrapped in Episode(trajectories=[t]) |
list[Trajectory] | wrapped in Episode(trajectories=[…]) |
str / dict / anything else | wrapped in Episode(trajectories=[Trajectory(name=…, output=…)]), with the value placed on output |
Episode is the canonical form — see any flow under cookbooks/.
Task
The first argument to every AgentFlow.run:
Task is pure data. The instruction is rendered ahead of time (from a JSONL row, an instruction.md, or an instruction.md.tpl template). metadata carries everything the verifier or the flow needs at runtime — the source row for catalog datasets, the parsed task.toml for sandbox tasks, the gym-env config for cookbooks/frozenlake.
AgentConfig
The second argument:
AsyncOpenAI(base_url=config.base_url, api_key="EMPTY") and call .chat.completions.create(model=config.model, …) — that’s the canonical wiring. Don’t hard-code a base_url or model in the flow body.
Evaluator protocol
Episode produced by an AgentFlow. Set traj.reward on each trajectory if you need per-trajectory rewards (e.g. solver vs judge in cookbooks/solver_judge_flow); set EvalOutput.reward for the episode-level scalar that rllm eval aggregates and rllm train feeds into advantage computation.
@rllm.evaluator decorator
@rllm.rollout, supports bare and parameterized forms (@rllm.evaluator(register="my_eval")).
EvalOutput
signals is the right place for per-axis metrics that aggregate across the eval — accuracy, table-access rate, judge-correctness, etc. rllm eval reports the mean of each signal across the dataset.
Return-value coercion
The decorator acceptsEvalOutput, a plain float (treated as reward), or a (reward: float, is_correct: bool) tuple. Returning the explicit EvalOutput keeps the signal/metadata channels available.
run_agent_flow helper
For ad-hoc use outside the trainer / runner:
arun when present, falls back to run in a thread executor so sync flows don’t block the event loop.
Data types
The shapes the protocols return and consume. All live inrllm.types and are re-exported from rllm.agents for backward compatibility.
Action
Wraps an arbitrary action emitted by an agent.
The action content (string, dict, or any type).
Step
A single LLM interaction. The first group of fields is what every flow populates; the second group is filled in transparently by the gateway during training.
Auto-generated UUID.
Optional structured input (rendered prompt, tool args, …).
Optional structured output (parsed answer, return value, …).
The action taken at this step (parsed answer, tool call, …).
Per-step reward (set by the evaluator if you score per-step).
Whether the episode ended at this step.
Arbitrary per-step metadata (also accessible as
step.info).The chat history at this step in OpenAI message format.
The raw assistant content from this step’s LLM call.
Reasoning text (e.g.
<think>…</think> content extracted from the response).Prompt token IDs.
Response token IDs.
Per-token logprobs.
The full structured output from the rollout engine.
Per-token or scalar advantage, populated by the trainer.
Model-weight version at generation time (used for async-staleness tracking).
Trajectory
A sequence of Steps with a name. The name is what the trainer uses to group trajectories across rollouts when computing advantages — see cookbooks/solver_judge_flow/ for an example with two named groups (solver / judge).
Auto-generated UUID.
Trajectory role name. Used for advantage grouping. Default:
"default_traj_name".Ordered list of steps in this trajectory.
Trajectory-level reward (set by the evaluator for per-trajectory scoring).
Optional final answer / return value.
Arbitrary per-trajectory metadata (also accessible as
traj.info).is_cumulative(): returns True if every step’s chat_completions is a strict superset of the previous step’s — useful for trainers that need to know whether the trajectory shares a single growing context vs. independent turns.
Episode
The top-level return shape of an AgentFlow. Bundles all trajectories from one rollout plus any artifacts the evaluator will read.
Auto-generated UUID. The runner overrides this to
f"{task.id}:{rollout_idx}".Task data (often
task.id or the metadata dict, depending on the flow).All trajectories produced during this rollout.
Free-form output bag the evaluator reads. Convention: store the agent’s final answer at
artifacts["answer"].Whether this episode counts as a correct solve. The evaluator typically writes this.
Why the episode ended (set by the trainer / runner, not usually by the flow).
Optional per-episode metrics that the trainer logs.
Arbitrary metadata.
TrajectoryGroup
The trainer reorganizes per-rollout Episode objects into per-task TrajectoryGroups for advantage computation — all solver trajectories for one task into one group, all judge trajectories into another, and so on. Most users don’t construct these directly; the trainer does.
All trajectories in this group (typically same
name, same task).Identifier in the form
{task_id}:{role} (e.g. "task1:solver").Per-trajectory metadata aligned with
trajectories.Episode artifacts convention
The convention across all rLLM cookbooks: the flow stores its final user-facing answer inepisode.artifacts["answer"], and the evaluator reads it from there. This keeps reward computation outside the flow (so the same flow is reusable with different graders) and gives rllm.eval.reward_fns._helpers.extract_answer_text a single place to look.
See also
Cookbooks
Seven worked AgentFlow examples
AgentFlow & Evaluator
Conceptual walkthrough of the protocol
Workflows
The legacy Workflow path (uses
BaseAgent + BaseEnv)Trainer
Wire an AgentFlow + Evaluator into RL training

