BaseEnv
Abstract base class for all environments following the Gym interface.Properties
The index or identifier of the environment, often used within a batch.
Methods
reset
Resets the environment to an initial state.The initial observation.
Auxiliary information.
step
Executes one time step within the environment.An action provided by the agent.
The next observation.
The reward for this step.
Whether the episode has ended.
Additional information.
close
Performs any necessary cleanup.from_dict
Static factory method to create an environment from a dictionary.Dictionary containing environment initialization data.
is_multithread_safe
Check if the environment can be used safely across multiple threads.Whether the environment is thread-safe.
MultiTurnEnvironment
Environment for multi-turn interactions with language models.Constructor
Dictionary containing task information. Should include relevant fields for your specific task.
Maximum number of turns before terminating the interaction.
Methods
reset
Reset the environment with a new task.Optional task to set. If None, uses the current task.
step
Take a step in the environment.Response string from the LLM or action object.
get_reward_and_next_obs
Abstract method to compute reward and next observation. Must be implemented by subclasses.The task dictionary containing relevant information.
The action taken by the agent.
The computed reward.
The next observation dictionary.
SingleTurnEnvironment
Simplified environment for single-turn interactions. This is a special case ofMultiTurnEnvironment where max_turns=1.
Constructor
Dictionary containing the task information, including at least a “question” field.
Custom reward function to evaluate agent responses. If None, uses zero reward with a warning.
Methods
get_reward_and_next_obs
Compute the reward based on the task and action.from_dict
Create environment from dictionary.Example: Custom Multi-Turn Environment
Example: Using SingleTurnEnvironment
ToolEnvironment
Environment for agents that use tools, handling tool execution and response evaluation.rllm/environments/tools/tool_env.py
Constructor
Task information dictionary. Typically includes “question” and “ground_truth” or “answer” fields.
List of tool names to load from the registry (e.g.,
["python", "google_search"]). Mutually exclusive with tool_map.Dictionary mapping tool names to Tool classes for custom tools. Mutually exclusive with
tools.Reward function for evaluating agent responses. If None, returns 0 reward with a warning.
Maximum number of steps before terminating the episode.
Properties
MultiTool instance managing available tools.
Current step number in the episode.
Methods
reset
Reset the environment to the initial state.The task dictionary.
Empty dictionary (for compatibility).
step
Execute tools based on agent action and return results.Agent action. Can be:
list[dict]: Tool calls to executestr: Final answer (triggers termination and reward computation)dict: Single tool call
- Agent provides a string response (final answer)
- Agent calls “finish” tool
max_stepsreached
Dictionary with “tool_outputs” mapping call IDs to results, or None if done.
Reward from reward function (only computed on final step).
Whether the episode has terminated.
Additional information from the environment.
Example
Tool Call Format
Tool calls should be dictionaries with this structure:finish - Signals completion and provides final answer:

