BaseAgent
Abstract base class for all agents in the rLLM framework.Properties
Converts agent’s internal state into a list of OpenAI-style chat completions.
Converts agent’s internal state into a Trajectory object containing all steps.
Methods
update_from_env
Updates the agent’s internal state after an environment step.The observation after stepping through environment.
The reward received after taking the action.
Whether the episode has ended due to termination.
Additional metadata from the environment.
update_from_model
Updates the agent’s internal state after the model generates a response.The response from the model.
The action to execute in the environment.
reset
Resets the agent’s internal state, typically called at the beginning of a new episode.get_current_state
Returns the agent’s current state as a Step object.The agent’s current state, or None if no steps have been taken.
Action
Represents an action taken by an agent.The action content (can be string, dict, or any type).
Step
Represents a single step in an agent’s trajectory, containing the prompt, response, and reward information.Fields
Token IDs for the prompt. May contain special blocks for multimodal inputs.
Token IDs for the model’s response.
Log probabilities for each response token.
List of chat messages in OpenAI format.
The observation received from the environment.
The reasoning or thought process (if available from model).
The action taken at this step.
The raw model response text.
Complete model output including token IDs and metadata.
Additional metadata for this step.
Reward received at this step.
Whether the episode ended at this step.
Monte Carlo return computed for this step.
Advantage value(s) for policy gradient training.
Methods
to_dict
Serialize the step to a dictionary.from_dict
Create a Step from a dictionary.from_model_output
Create a Step from a ModelOutput object.Trajectory
Represents a sequence of steps taken by an agent during an episode.Fields
Unique identifier (auto-generated UUID).
Name of the trajectory (e.g., “solver”, “verifier”).
The task associated with this trajectory.
List of steps in the trajectory.
Trajectory-level reward (computed from steps).
Additional metadata.
Methods
to_dict / from_dict
Serialize and deserialize trajectories.is_cumulative
Check if each step’s chat_completions is a superset of the previous step.Episode
Represents a complete episode containing one or more trajectories.Fields
Episode identifier in format
task_id:rollout_idx.The task data for this episode.
Reason the episode ended (ENV_DONE, MAX_TURNS_EXCEEDED, etc.).
Whether the episode resulted in a correct solution.
All trajectories generated during the episode.
Computed metrics (accuracy, etc.).
Additional metadata including error details if any.
Properties
Extracted task ID from the episode ID.
Extracted rollout index from the episode ID.
Methods
Example Usage
MathAgent
A specialized agent for solving mathematical problems step by step.rllm/agents/math_agent.py
Constructor
Whether to accumulate thinking tokens in conversation history. If False, removes
<think>...</think> blocks from previous messages (except the last one).Properties
Conversation history for model interaction. If
accumulate_thinking is False, thinking is stripped from assistant messages.Methods
update_from_env
Process environment feedback (questions or rewards).Observation from environment. Can be:
dictwith “question” key: New problem to solveNoneor{}: Reward update for current stepstr: Question text
update_from_model
Process model response and extract thought/action.<think> and </think>) and final answer.
Example
ToolAgent
An agent that can use tools to interact with environments, supporting function calling.rllm/agents/tool_agent.py
Constructor
System prompt for the agent. Defaults to
TOOL_SYSTEM_PROMPT.Name of the parser to use for tool calls. Options: “qwen”, “r1”, etc.
List of tool names to load from the registry (e.g.,
["python", "google_search"]). Mutually exclusive with tool_map.Dictionary mapping tool names to Tool classes for custom tools. Mutually exclusive with
tools.Properties
MultiTool instance managing available tools.
Parser for extracting tool calls from model responses.
Conversation history including tool calls and responses.
Methods
update_from_env
Process environment feedback and format tool outputs.Environment observation. Can be:
dictwith “question”: Initial taskdictwith “tool_outputs”: Results from tool executionstr: Text observation

