rLLM represents all agent interactions through three nested data structures: Episode, Trajectory, and Step. These types are the common currency across evaluation, training, and the SDK.Documentation Index
Fetch the complete documentation index at: https://docs.rllm-project.com/llms.txt
Use this file to discover all available pages before exploring further.
The hierarchy
Step
A Step records a single interaction turn: the input sent to the model, the output it produced, and the reward assigned by the environment or evaluator.rllm.agents.agent adds token-level fields needed for RL:
Trajectory
A Trajectory groups the steps produced by a single agent across an episode. In a single-agent setup there is one Trajectory per Episode; in multi-agent workflows there may be several.name field identifies which agent produced this trajectory. During training, trajectories are grouped by {task_id}:{trajectory.name} for advantage computation — GRPO compares trajectories within the same group to determine which rollouts were better than average.
Episode
An Episode is the top-level container for a complete rollout on a single task. Its ID encodes both the task and rollout index.task_id and rollout_idx properties parse the ID string:
How they connect
Environment resets
The engine creates a fresh Episode and initializes the agent and environment for a new task.
Agent-environment loop
Each turn through the loop produces a Step: the agent receives an observation, calls the LLM, and returns an action. The environment evaluates the action and assigns a reward.
Trajectory accumulates
Steps are appended to the agent’s Trajectory until the environment signals
done=True or the step limit is reached.Single-agent vs multi-agent
In a single-agent task, an Episode contains exactly one Trajectory:Role in the training pipeline
These data structures flow through the training pipeline in order:- Generate: The execution engine runs agents and produces Episodes with Trajectories and Steps
- Transform: Trajectories are grouped by
{task_id}:{trajectory.name}intoTrajectoryGroups - Advantage: GRPO/REINFORCE computes advantages by comparing rewards within each group
- Update: The trainer uses the token-level data (prompt IDs, response IDs, logprobs, advantages) from each Step to compute policy gradients
The canonical type definitions live in
rllm/types.py. The training-extended versions with token-level fields are in rllm/agents/agent.py.
