Skip to main content
The agent module provides the foundational classes for building agents that interact with environments through actions, observations, and rewards.

BaseAgent

Abstract base class for all agents in the rLLM framework.
from rllm.agents import BaseAgent

Properties

chat_completions
list[dict[str, str]]
Converts agent’s internal state into a list of OpenAI-style chat completions.
trajectory
Trajectory
Converts agent’s internal state into a Trajectory object containing all steps.

Methods

update_from_env

Updates the agent’s internal state after an environment step.
def update_from_env(
    observation: Any,
    reward: float,
    done: bool,
    info: dict,
    **kwargs
) -> None
observation
Any
The observation after stepping through environment.
reward
float
The reward received after taking the action.
done
bool
Whether the episode has ended due to termination.
info
dict
Additional metadata from the environment.

update_from_model

Updates the agent’s internal state after the model generates a response.
def update_from_model(response: str, **kwargs) -> Action
response
str
The response from the model.
action
Action
The action to execute in the environment.

reset

Resets the agent’s internal state, typically called at the beginning of a new episode.
def reset() -> None

get_current_state

Returns the agent’s current state as a Step object.
def get_current_state() -> Step | None
step
Step | None
The agent’s current state, or None if no steps have been taken.

Action

Represents an action taken by an agent.
from rllm.agents import Action

action = Action(action="move_forward")
action
Any
The action content (can be string, dict, or any type).

Step

Represents a single step in an agent’s trajectory, containing the prompt, response, and reward information.
from rllm.agents import Step

Fields

prompt_ids
list[int] | list[Any]
Token IDs for the prompt. May contain special blocks for multimodal inputs.
response_ids
list[int]
Token IDs for the model’s response.
logprobs
list[float]
Log probabilities for each response token.
chat_completions
list[dict[str, str]]
List of chat messages in OpenAI format.
observation
Any
The observation received from the environment.
thought
str
The reasoning or thought process (if available from model).
action
Any
The action taken at this step.
model_response
str
The raw model response text.
model_output
ModelOutput | None
Complete model output including token IDs and metadata.
info
dict
Additional metadata for this step.
reward
float
default:"0.0"
Reward received at this step.
done
bool
default:"False"
Whether the episode ended at this step.
mc_return
float
default:"0.0"
Monte Carlo return computed for this step.
advantage
list[float] | float | None
Advantage value(s) for policy gradient training.

Methods

to_dict

Serialize the step to a dictionary.
step_dict = step.to_dict()

from_dict

Create a Step from a dictionary.
step = Step.from_dict(data)

from_model_output

Create a Step from a ModelOutput object.
step = Step.from_model_output(
    model_output=output,
    messages=messages,
    action=action
)

Trajectory

Represents a sequence of steps taken by an agent during an episode.
from rllm.agents import Trajectory

trajectory = Trajectory(
    name="solver",
    task={"question": "What is 2+2?"},
    steps=[step1, step2]
)

Fields

uid
str
Unique identifier (auto-generated UUID).
name
str
default:"default_traj_name"
Name of the trajectory (e.g., “solver”, “verifier”).
task
Any
The task associated with this trajectory.
steps
list[Step]
List of steps in the trajectory.
reward
float | None
Trajectory-level reward (computed from steps).
info
dict
Additional metadata.

Methods

to_dict / from_dict

Serialize and deserialize trajectories.
traj_dict = trajectory.to_dict()
trajectory = Trajectory.from_dict(traj_dict)

is_cumulative

Check if each step’s chat_completions is a superset of the previous step.
if trajectory.is_cumulative():
    print("Trajectory uses cumulative context")

Episode

Represents a complete episode containing one or more trajectories.
from rllm.agents import Episode

episode = Episode(
    id="task_123:0",
    task=task_data,
    trajectories=[trajectory1, trajectory2]
)

Fields

id
str
Episode identifier in format task_id:rollout_idx.
task
Any
The task data for this episode.
termination_reason
TerminationReason | None
Reason the episode ended (ENV_DONE, MAX_TURNS_EXCEEDED, etc.).
is_correct
bool
default:"False"
Whether the episode resulted in a correct solution.
trajectories
list[Trajectory]
All trajectories generated during the episode.
metrics
dict
Computed metrics (accuracy, etc.).
info
dict
Additional metadata including error details if any.

Properties

task_id
str
Extracted task ID from the episode ID.
rollout_idx
str
Extracted rollout index from the episode ID.

Methods

# Serialize episode
episode_dict = episode.to_dict()

# Deserialize episode
episode = Episode.from_dict(episode_dict)

Example Usage

from rllm.agents import BaseAgent, Step, Trajectory, Episode, Action

class MyAgent(BaseAgent):
    def __init__(self):
        self._trajectory = Trajectory(name="my_agent")
    
    def reset(self):
        self._trajectory = Trajectory(name="my_agent")
    
    @property
    def trajectory(self) -> Trajectory:
        return self._trajectory
    
    def update_from_model(self, response: str, **kwargs) -> Action:
        action = Action(action=response)
        step = Step(
            model_response=response,
            action=action,
            chat_completions=self._trajectory.steps[-1].chat_completions + [
                {"role": "assistant", "content": response}
            ]
        )
        self._trajectory.steps.append(step)
        return action
    
    def update_from_env(self, observation, reward, done, info, **kwargs):
        if self._trajectory.steps:
            self._trajectory.steps[-1].reward = reward
            self._trajectory.steps[-1].done = done

MathAgent

A specialized agent for solving mathematical problems step by step.
from rllm.agents import MathAgent
Source: rllm/agents/math_agent.py

Constructor

def __init__(accumulate_thinking: bool = True)
accumulate_thinking
bool
default:"True"
Whether to accumulate thinking tokens in conversation history. If False, removes <think>...</think> blocks from previous messages (except the last one).

Properties

chat_completions
list[dict[str, str]]
Conversation history for model interaction. If accumulate_thinking is False, thinking is stripped from assistant messages.

Methods

update_from_env

Process environment feedback (questions or rewards).
agent.update_from_env(observation, reward, done, info)
observation
Any
Observation from environment. Can be:
  • dict with “question” key: New problem to solve
  • None or {}: Reward update for current step
  • str: Question text

update_from_model

Process model response and extract thought/action.
action = agent.update_from_model(response)
Parses response to extract thinking (between <think> and </think>) and final answer.

Example

agent = MathAgent(accumulate_thinking=True)

# Reset for new episode
agent.reset()

# Receive problem from environment
agent.update_from_env(
    observation={"question": "What is 2+2?"}, 
    reward=0, 
    done=False, 
    info={}
)

# Get response from model
response = "<think>2+2 equals 4</think>4"
action = agent.update_from_model(response)

# Get final reward
agent.update_from_env(
    observation=None,
    reward=1.0,
    done=True,
    info={}
)

ToolAgent

An agent that can use tools to interact with environments, supporting function calling.
from rllm.agents import ToolAgent
Source: rllm/agents/tool_agent.py

Constructor

def __init__(
    system_prompt: str = TOOL_SYSTEM_PROMPT,
    parser_name: str = "qwen",
    tools: list[str] | None = None,
    tool_map: dict[str, type[Tool]] | None = None
)
system_prompt
str
System prompt for the agent. Defaults to TOOL_SYSTEM_PROMPT.
parser_name
str
default:"qwen"
Name of the parser to use for tool calls. Options: “qwen”, “r1”, etc.
tools
list[str] | None
List of tool names to load from the registry (e.g., ["python", "google_search"]). Mutually exclusive with tool_map.
tool_map
dict[str, type[Tool]] | None
Dictionary mapping tool names to Tool classes for custom tools. Mutually exclusive with tools.

Properties

tools
MultiTool
MultiTool instance managing available tools.
tool_parser
ToolParser
Parser for extracting tool calls from model responses.
chat_completions
list[dict[str, Any]]
Conversation history including tool calls and responses.

Methods

update_from_env

Process environment feedback and format tool outputs.
agent.update_from_env(observation, reward, done, info)
observation
Any
Environment observation. Can be:
  • dict with “question”: Initial task
  • dict with “tool_outputs”: Results from tool execution
  • str: Text observation

update_from_model

Extract and format tool calls from model response.
action = agent.update_from_model(response)
Parses the response to extract tool calls, which are returned as the action.

Example

from rllm.agents import ToolAgent
from rllm.tools import PythonInterpreter

# Create agent with Python tool
agent = ToolAgent(
    parser_name="qwen",
    tools=["python"]  # Load from registry
)

# Or with custom tool map
agent = ToolAgent(
    parser_name="qwen",
    tool_map={"python": PythonInterpreter}
)

# Reset for new task
agent.reset()

# Receive task
agent.update_from_env(
    observation={"question": "What is 10 factorial?"},
    reward=0,
    done=False,
    info={}
)

# Model generates tool call
response = '''<|tool_calls_begin|>
<|tool_call_begin|>function<|tool_sep|>python
{"code": "import math; print(math.factorial(10))"}
<|tool_call_end|>
<|tool_calls_end|>'''

action = agent.update_from_model(response)  # Returns tool calls

# Receive tool outputs from environment
agent.update_from_env(
    observation={
        "tool_outputs": {
            "call_123": "3628800"
        }
    },
    reward=0,
    done=False,
    info={}
)