Agents - rLLM

The agent module provides the foundational classes for building agents that interact with environments through actions, observations, and rewards.

BaseAgent

Abstract base class for all agents in the rLLM framework.

from rllm.agents import BaseAgent

Properties

chat_completions

list[dict[str, str]]

Converts agent’s internal state into a list of OpenAI-style chat completions.

trajectory

Trajectory

Converts agent’s internal state into a Trajectory object containing all steps.

Methods

update_from_env

Updates the agent’s internal state after an environment step.

def update_from_env(
    observation: Any,
    reward: float,
    done: bool,
    info: dict,
    **kwargs
) -> None

observation

Any

The observation after stepping through environment.

reward

float

The reward received after taking the action.

done

bool

Whether the episode has ended due to termination.

info

dict

Additional metadata from the environment.

update_from_model

Updates the agent’s internal state after the model generates a response.

def update_from_model(response: str, **kwargs) -> Action

response

str

The response from the model.

action

Action

The action to execute in the environment.

reset

Resets the agent’s internal state, typically called at the beginning of a new episode.

def reset() -> None

get_current_state

Returns the agent’s current state as a Step object.

def get_current_state() -> Step | None

step

Step | None

The agent’s current state, or None if no steps have been taken.

Action

Represents an action taken by an agent.

from rllm.agents import Action

action = Action(action="move_forward")

action

Any

The action content (can be string, dict, or any type).

Step

Represents a single step in an agent’s trajectory, containing the prompt, response, and reward information.

from rllm.agents import Step

Fields

prompt_ids

list[int] | list[Any]

Token IDs for the prompt. May contain special blocks for multimodal inputs.

response_ids

list[int]

Token IDs for the model’s response.

logprobs

list[float]

Log probabilities for each response token.

chat_completions

list[dict[str, str]]

List of chat messages in OpenAI format.

observation

Any

The observation received from the environment.

thought

str

The reasoning or thought process (if available from model).

action

Any

The action taken at this step.

model_response

str

The raw model response text.

model_output

ModelOutput | None

Complete model output including token IDs and metadata.

info

dict

Additional metadata for this step.

reward

float

default:"0.0"

Reward received at this step.

done

bool

default:"False"

Whether the episode ended at this step.

mc_return

float

default:"0.0"

Monte Carlo return computed for this step.

advantage

list[float] | float | None

Advantage value(s) for policy gradient training.

Methods

to_dict

Serialize the step to a dictionary.

step_dict = step.to_dict()

from_dict

Create a Step from a dictionary.

step = Step.from_dict(data)

from_model_output

Create a Step from a ModelOutput object.

step = Step.from_model_output(
    model_output=output,
    messages=messages,
    action=action
)

Trajectory

Represents a sequence of steps taken by an agent during an episode.

from rllm.agents import Trajectory

trajectory = Trajectory(
    name="solver",
    task={"question": "What is 2+2?"},
    steps=[step1, step2]
)

Fields

uid

str

Unique identifier (auto-generated UUID).

name

str

default:"default_traj_name"

Name of the trajectory (e.g., “solver”, “verifier”).

task

Any

The task associated with this trajectory.

steps

list[Step]

List of steps in the trajectory.

reward

float | None

Trajectory-level reward (computed from steps).

info

dict

Additional metadata.

Methods

to_dict / from_dict

Serialize and deserialize trajectories.

traj_dict = trajectory.to_dict()
trajectory = Trajectory.from_dict(traj_dict)

is_cumulative

Check if each step’s chat_completions is a superset of the previous step.

if trajectory.is_cumulative():
    print("Trajectory uses cumulative context")

Episode

Represents a complete episode containing one or more trajectories.

from rllm.agents import Episode

episode = Episode(
    id="task_123:0",
    task=task_data,
    trajectories=[trajectory1, trajectory2]
)

Fields

str

Episode identifier in format task_id:rollout_idx.

task

Any

The task data for this episode.

termination_reason

TerminationReason | None

Reason the episode ended (ENV_DONE, MAX_TURNS_EXCEEDED, etc.).

is_correct

bool

default:"False"

Whether the episode resulted in a correct solution.

trajectories

list[Trajectory]

All trajectories generated during the episode.

metrics

dict

Computed metrics (accuracy, etc.).

info

dict

Additional metadata including error details if any.

Properties

task_id

str

Extracted task ID from the episode ID.

rollout_idx

str

Extracted rollout index from the episode ID.

Methods

# Serialize episode
episode_dict = episode.to_dict()

# Deserialize episode
episode = Episode.from_dict(episode_dict)

Example Usage

from rllm.agents import BaseAgent, Step, Trajectory, Episode, Action

class MyAgent(BaseAgent):
    def __init__(self):
        self._trajectory = Trajectory(name="my_agent")
    
    def reset(self):
        self._trajectory = Trajectory(name="my_agent")
    
    @property
    def trajectory(self) -> Trajectory:
        return self._trajectory
    
    def update_from_model(self, response: str, **kwargs) -> Action:
        action = Action(action=response)
        step = Step(
            model_response=response,
            action=action,
            chat_completions=self._trajectory.steps[-1].chat_completions + [
                {"role": "assistant", "content": response}
            ]
        )
        self._trajectory.steps.append(step)
        return action
    
    def update_from_env(self, observation, reward, done, info, **kwargs):
        if self._trajectory.steps:
            self._trajectory.steps[-1].reward = reward
            self._trajectory.steps[-1].done = done

MathAgent

A specialized agent for solving mathematical problems step by step.

from rllm.agents import MathAgent

Source: rllm/agents/math_agent.py

Constructor

def __init__(accumulate_thinking: bool = True)

accumulate_thinking

bool

default:"True"

Whether to accumulate thinking tokens in conversation history. If False, removes <think>...</think> blocks from previous messages (except the last one).

Properties

chat_completions

list[dict[str, str]]

Conversation history for model interaction. If accumulate_thinking is False, thinking is stripped from assistant messages.

Methods

update_from_env

Process environment feedback (questions or rewards).

agent.update_from_env(observation, reward, done, info)

observation

Any

Observation from environment. Can be:

dict with “question” key: New problem to solve
None or {}: Reward update for current step
str: Question text

update_from_model

Process model response and extract thought/action.

action = agent.update_from_model(response)

Parses response to extract thinking (between <think> and </think>) and final answer.

Example

agent = MathAgent(accumulate_thinking=True)

# Reset for new episode
agent.reset()

# Receive problem from environment
agent.update_from_env(
    observation={"question": "What is 2+2?"}, 
    reward=0, 
    done=False, 
    info={}
)

# Get response from model
response = "<think>2+2 equals 4</think>4"
action = agent.update_from_model(response)

# Get final reward
agent.update_from_env(
    observation=None,
    reward=1.0,
    done=True,
    info={}
)

ToolAgent

An agent that can use tools to interact with environments, supporting function calling.

from rllm.agents import ToolAgent

Source: rllm/agents/tool_agent.py

Constructor

def __init__(
    system_prompt: str = TOOL_SYSTEM_PROMPT,
    parser_name: str = "qwen",
    tools: list[str] | None = None,
    tool_map: dict[str, type[Tool]] | None = None
)

system_prompt

str

System prompt for the agent. Defaults to TOOL_SYSTEM_PROMPT.

parser_name

str

default:"qwen"

Name of the parser to use for tool calls. Options: “qwen”, “r1”, etc.

tools

list[str] | None

List of tool names to load from the registry (e.g., ["python", "google_search"]). Mutually exclusive with tool_map.

tool_map

dict[str, type[Tool]] | None

Dictionary mapping tool names to Tool classes for custom tools. Mutually exclusive with tools.

Properties

tools

MultiTool

MultiTool instance managing available tools.

tool_parser

ToolParser

Parser for extracting tool calls from model responses.

chat_completions

list[dict[str, Any]]

Conversation history including tool calls and responses.

Methods

update_from_env

Process environment feedback and format tool outputs.

agent.update_from_env(observation, reward, done, info)

observation

Any

Environment observation. Can be:

dict with “question”: Initial task
dict with “tool_outputs”: Results from tool execution
str: Text observation

update_from_model

Extract and format tool calls from model response.

action = agent.update_from_model(response)

Parses the response to extract tool calls, which are returned as the action.

Example

from rllm.agents import ToolAgent
from rllm.tools import PythonInterpreter

# Create agent with Python tool
agent = ToolAgent(
    parser_name="qwen",
    tools=["python"]  # Load from registry
)

# Or with custom tool map
agent = ToolAgent(
    parser_name="qwen",
    tool_map={"python": PythonInterpreter}
)

# Reset for new task
agent.reset()

# Receive task
agent.update_from_env(
    observation={"question": "What is 10 factorial?"},
    reward=0,
    done=False,
    info={}
)

# Model generates tool call
response = '''<|tool_calls_begin|>
<|tool_call_begin|>function<|tool_sep|>python
{"code": "import math; print(math.factorial(10))"}
<|tool_call_end|>
<|tool_calls_end|>'''

action = agent.update_from_model(response)  # Returns tool calls

# Receive tool outputs from environment
agent.update_from_env(
    observation={
        "tool_outputs": {
            "call_123": "3628800"
        }
    },
    reward=0,
    done=False,
    info={}
)

Core API

Engines

Training

Tools & utilities

​BaseAgent

​Properties

​Methods

​update_from_env

​update_from_model

​reset

​get_current_state

​Action

​Step

​Fields

​Methods

​to_dict

​from_dict

​from_model_output

​Trajectory

​Fields

​Methods

​to_dict / from_dict

​is_cumulative

​Episode

​Fields

​Properties

​Methods

​Example Usage

​MathAgent

​Constructor

​Properties

​Methods

​update_from_env

​update_from_model

​Example

​ToolAgent

​Constructor

​Properties

​Methods

​update_from_env

​update_from_model

​Example

BaseAgent

Properties

Methods

update_from_env

update_from_model

reset

get_current_state

Action

Step

Fields

Methods

to_dict

from_dict

from_model_output

Trajectory

Fields

Methods

to_dict / from_dict

is_cumulative

Episode

Fields

Properties

Methods

Example Usage

MathAgent

Constructor

Properties

Methods

update_from_env

update_from_model

Example

ToolAgent

Constructor

Properties

Methods

update_from_env

update_from_model

Example