Math Agent with rLLM SDK

In this tutorial, you’ll build and train a single-step agent that solves math problems using the rLLM SDK. This is the simplest way to get started with RL training in rLLM.

Overview

By the end of this tutorial, you will have:

Created a simple agent function that solves math problems
Connected it to rLLM’s automatic tracing system
Trained the agent using GRPO on the Hendrycks MATH dataset

Training an RL agent requires two components:

Rollout function: Perform a sequence of actions using the LLM
Reward function: Evaluate how good the outcome is

The rLLM SDK handles the plumbing—you just define what to generate and how to score it.

Setup

Install rLLM

If you haven’t already, install rLLM:

pip install rllm

Prepare the dataset

Download and prepare the Hendrycks MATH dataset:

cd rllm
python -m examples.simple_math.prepare_math_dataset

Launch a vLLM server

Start a vLLM server for testing:

vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
    --host 0.0.0.0 \
    --port 4000

1. Define the Rollout Function

The rollout function generates a response from the LLM. This is what you want to train.

1.1 Import dependencies

from rllm.sdk import get_chat_client

1.2 Create the generation logic

def generate_response(question: str) -> str:
    """Generate a response to a math question.
    
    This is the core behavior you want to improve via RL.
    """
    # Create client INSIDE the function (important for Ray serialization)
    client = get_chat_client(
        base_url="http://localhost:4000/v1",
        api_key="token-abc123"
    )
    
    # Make the LLM call - automatically traced!
    response = client.chat.completions.create(
        model="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
        messages=[
            {"role": "user", "content": question},
        ],
    )
    
    return response.choices[0].message.content

1.3 Test the generation

print(generate_response("What is 2 + 2?"))

Expected output:

"\boxed{4}"

Important: Always create get_chat_client() inside the function. Creating it at module level causes Ray serialization errors.

2. Define the Reward Function

The reward function evaluates how good the response is. This is the training signal.

2.1 What the reward function does

The reward function is simple—it does two things:

Parse: Extract the answer from the model’s response (looks for \boxed{}, numbers, etc.)
Compare: Check if the extracted answer matches the ground truth

Model Response: "Let me solve this step by step... The answer is \boxed{4}"
                                                              ↓
                                               extract_answer() → "4"
                                                              ↓
                                               compare with ground_truth "4"
                                                              ↓
                                               Match? → reward = 1.0

2.2 Using the built-in math reward

rLLM provides math_reward_fn which handles common math answer formats:

from rllm.rewards.reward_fn import math_reward_fn

def evaluate_response(response: str, ground_truth: str) -> float:
    """Evaluate how correct the response is.
    
    Returns:
        1.0 if correct, 0.0 if incorrect
    """
    result = math_reward_fn(
        {"ground_truth": ground_truth}, 
        response  # The model's full response
    )
    return result.reward

2.3 Test the evaluation

# Correct answer (boxed format)
reward = evaluate_response("The answer is \\boxed{4}", ground_truth="4")
print(f"Reward for correct: {reward}")  # 1.0

# Wrong answer
reward = evaluate_response("The answer is \\boxed{5}", ground_truth="4")
print(f"Reward for wrong: {reward}")  # 0.0

3. Combine into a Rollout Function

Now combine generation + reward into a single rollout function:

from rllm.sdk import get_chat_client
from rllm.rewards.reward_fn import math_reward_fn

def rollout(**kwargs):
    """Complete training function: generate + evaluate.
    
    Args:
        question: The math problem to solve
        ground_truth: The correct answer
        
    Returns:
        float: Reward (1.0 for correct, 0.0 for incorrect)
    """
    question = kwargs["question"]
    ground_truth = kwargs["ground_truth"]
    
    # Step 1: Generate response (rollout)
    client = get_chat_client(
        base_url="http://localhost:4000/v1",
        api_key="EMPTY"
    )
    
    response = client.chat.completions.create(
        model="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
        messages=[{"role": "user", "content": question}],
    )
    response_text = response.choices[0].message.content
    
    # Step 2: Evaluate result (reward)
    reward = math_reward_fn(
        {"ground_truth": ground_truth}, 
        response_text
    ).reward
    
    return reward

3.1 Test the complete function

result = rollout(
    question="What is 2 + 2?",
    ground_truth="4"
)
print(f"Reward: {result}")

Expected output:

Reward: 1.0

4. Set Up the Trainer

Now wrap the agent function with AgentTrainer:

import hydra
from rllm.data.dataset import DatasetRegistry
from rllm.trainer.agent_trainer import AgentTrainer

@hydra.main(
    config_path="pkg://rllm.trainer.config", 
    config_name="agent_ppo_trainer", 
    version_base=None
)
def main(config):
    # Load datasets
    train_dataset = DatasetRegistry.load_dataset("hendrycks_math", "train")
    test_dataset = DatasetRegistry.load_dataset("math500", "test")
    
    # Create trainer with your agent function
    trainer = AgentTrainer(
        config=config,
        train_dataset=train_dataset,
        val_dataset=test_dataset,
        agent_run_func=rollout,  # Your function from step 3
    )
    
    # Start training
    trainer.train()

if __name__ == "__main__":
    main()

5. Run Training

Launch the training:

chmod +x examples/sdk/simple_math/train_hendrycks_math.sh
bash examples/sdk/simple_math/train_hendrycks_math.sh

6. Monitor Training

Training logs to WandB by default. Key metrics:

Metric	Description
`critic/score/mean`	Average reward per batch
`val/pass@1`	Validation accuracy

Next Steps

Tutorial 2: Solver-Judge: Multi-agent workflow with @trajectory decorator
Tutorial 3: LangGraph RAG: Train a LangGraph RAG agent with tool use
SDK Documentation: Full API reference

Getting started

Advanced examples

Math Agent with rLLM SDK

Overview

Setup

1. Define the Rollout Function

1.1 Import dependencies

1.2 Create the generation logic

1.3 Test the generation

2. Define the Reward Function

2.1 What the reward function does

2.2 Using the built-in math reward

2.3 Test the evaluation

3. Combine into a Rollout Function

3.1 Test the complete function

4. Set Up the Trainer

5. Run Training

6. Monitor Training

Next Steps

Resources

Getting started

Advanced examples

​Overview

​Setup

​1. Define the Rollout Function

​1.1 Import dependencies

​1.2 Create the generation logic

​1.3 Test the generation

​2. Define the Reward Function

​2.1 What the reward function does

​2.2 Using the built-in math reward

​2.3 Test the evaluation

​3. Combine into a Rollout Function

​3.1 Test the complete function

​4. Set Up the Trainer

​5. Run Training

​6. Monitor Training

​Next Steps

​Resources

Overview

Setup

1. Define the Rollout Function

1.1 Import dependencies

1.2 Create the generation logic

1.3 Test the generation

2. Define the Reward Function

2.1 What the reward function does

2.2 Using the built-in math reward

2.3 Test the evaluation

3. Combine into a Rollout Function

3.1 Test the complete function

4. Set Up the Trainer

5. Run Training

6. Monitor Training

Next Steps

Resources