DeepScaleR Math Agent

This example demonstrates training and running DeepScaleR, a reasoning LLM finetuned from DeepSeek-R1-Distill-1.5B on math competition problems using RL. The model achieves >40% Pass@1 on AIME2024, reaching o1-preview performance despite its small size.

Overview

The DeepScaleR example demonstrates:

How to use rLLM’s MathAgent for mathematical reasoning
How to train agents with iterative context lengthening (8K → 16K → 24K)
How to evaluate mathematical reasoning with Pass@K metrics
Scaling RL to achieve state-of-the-art performance on math competitions

Prerequisites

rLLM framework installed
vLLM or SGLang for model serving
Pre-trained model: agentica-org/DeepScaleR-1.5B-Preview
GPU with sufficient memory for 8K-24K context lengths

Setup

Prepare math datasets

Download and prepare mathematical competition datasets:

cd examples/deepscaler
python prepare_math_data.py

This will download:

AIME 2024 (test set)
Hendrycks MATH (training)
Math500 (validation)

Start model server

Launch a vLLM server:

python -m vllm.entrypoints.openai.api_server \
    --model agentica-org/DeepScaleR-1.5B-Preview \
    --host 0.0.0.0 \
    --port 30000 \
    --dtype bfloat16

Or use SGLang:

python -m sglang_router.launch_server \
    --model-path agentica-org/DeepScaleR-1.5B-Preview \
    --dp-size 1 \
    --dtype bfloat16

The server will be accessible at http://localhost:30000/v1

Running DeepScaleR

Execute the math reasoning agent:

cd examples/deepscaler
python run_deepscaler.py

Code Implementation

import asyncio
from transformers import AutoTokenizer
from rllm.agents.math_agent import MathAgent
from rllm.data.dataset import DatasetRegistry
from rllm.engine.agent_execution_engine import AgentExecutionEngine
from rllm.environments.base.single_turn_env import SingleTurnEnvironment
from rllm.rewards.reward_fn import math_reward_fn
from rllm.utils import compute_pass_at_k

n_parallel_agents = 64
model_name = "agentica-org/DeepScaleR-1.5B-Preview"

tokenizer = AutoTokenizer.from_pretrained(model_name)

env_args = {
    "reward_fn": math_reward_fn,
}

sampling_params = {"temperature": 0.6, "top_p": 0.95, "model": model_name}

engine = AgentExecutionEngine(
    agent_class=MathAgent,
    env_class=SingleTurnEnvironment,
    agent_args={},
    env_args=env_args,
    engine_name="openai",
    tokenizer=tokenizer,
    sampling_params=sampling_params,
    rollout_engine_args={
        "base_url": "http://localhost:30000/v1",
        "api_key": "None",
    },
    max_response_length=32768,
    max_prompt_length=2048,
    n_parallel_agents=n_parallel_agents,
)

test_dataset = DatasetRegistry.load_dataset("aime2024", "test")
tasks = test_dataset.repeat(n=16)  # repeat to evaluate pass@k

results = asyncio.run(engine.execute_tasks(tasks))
compute_pass_at_k(results)

Expected Results

DeepScaleR-1.5B-Preview on AIME 2024:

Metric	Performance
Pass@1	40.0%
Pass@16	65.0%
Pass@64	75.0%

This matches or exceeds OpenAI o1-preview on AIME 2024, despite being only 1.5B parameters.

Training DeepScaleR

Train your own DeepScaleR agent with iterative context lengthening:

Step 1: Train with 8K context

bash examples/deepscaler/train_deepscaler_8k.sh

Step 2: Train with 16K context

Modify MODEL_PATH in the script to point to your 8K checkpoint:

bash examples/deepscaler/train_deepscaler_16k.sh

Step 3: Train with 24K context

Modify MODEL_PATH to point to your 16K checkpoint:

bash examples/deepscaler/train_deepscaler_24k.sh

Training Configuration

Key hyperparameters:

Base Model: DeepSeek-R1-Distill-Qwen-1.5B
Algorithm: GRPO (Group Relative Policy Optimization)
Training Dataset: Hendrycks MATH + Math500
Evaluation Dataset: AIME 2024
Batch Size: 64
Learning Rate: 1e-6
Context Progression: 8K → 16K → 24K
Sampling: n=16 candidates per problem
Temperature: 0.6

Training Script Structure

import hydra
from rllm.data.dataset import DatasetRegistry
from rllm.trainer.agent_trainer import AgentTrainer
from rllm.agents.math_agent import MathAgent
from rllm.environments.base.single_turn_env import SingleTurnEnvironment

@hydra.main(
    config_path="pkg://rllm.trainer.config",
    config_name="agent_ppo_trainer",
    version_base=None
)
def main(config):
    train_dataset = DatasetRegistry.load_dataset("hendrycks_math", "train")
    test_dataset = DatasetRegistry.load_dataset("math500", "test")

    trainer = AgentTrainer(
        config=config,
        train_dataset=train_dataset,
        val_dataset=test_dataset,
        agent_class=MathAgent,
        env_class=SingleTurnEnvironment,
    )
    trainer.train()

if __name__ == "__main__":
    main()

Iterative Context Lengthening

DeepScaleR uses a curriculum learning approach:

8K Phase: Learn basic reasoning patterns
16K Phase: Handle more complex multi-step problems
24K Phase: Master extremely long reasoning chains

Each phase builds on the previous checkpoint, gradually increasing the model’s ability to handle longer reasoning chains.

Key Features

Test-Time Scaling

DeepScaleR improves with more compute at inference time:

# Sample multiple solutions and select the best
tasks = test_dataset.repeat(n=64)  # 64 attempts per problem
results = asyncio.run(engine.execute_tasks(tasks))

Long-Form Reasoning

The model generates detailed step-by-step solutions:

Problem: Find all real numbers x such that...

Solution:
<think>
Let me analyze this carefully. First, I'll consider...
[2000+ tokens of reasoning]
</think>

Therefore, the answer is \boxed{42}

Monitoring Training

Training logs to WandB. Key metrics to track:

Metric	Description
`critic/score/mean`	Average reward per batch
`val/pass@1`	AIME 2024 Pass@1 accuracy
`val/pass@16`	AIME 2024 Pass@16 accuracy
`train/response_length`	Average reasoning length

Next Steps

Explore DeepCoder for coding competitions
Try SDK examples for simplified workflows
Learn about RL algorithms

Getting started

Advanced examples

DeepScaleR Math Agent

Overview

Prerequisites

Setup

Running DeepScaleR

Code Implementation

Expected Results

Training DeepScaleR

Step 1: Train with 8K context

Step 2: Train with 16K context

Step 3: Train with 24K context

Training Configuration

Training Script Structure

Iterative Context Lengthening

Key Features

Test-Time Scaling

Long-Form Reasoning

Monitoring Training

Next Steps

Resources

Getting started

Advanced examples

​Overview

​Prerequisites

​Setup

​Running DeepScaleR

​Code Implementation

​Expected Results

​Training DeepScaleR

​Step 1: Train with 8K context

​Step 2: Train with 16K context

​Step 3: Train with 24K context

​Training Configuration

​Training Script Structure

​Iterative Context Lengthening

​Key Features

​Test-Time Scaling

​Long-Form Reasoning

​Monitoring Training

​Next Steps

​Resources

Overview

Prerequisites

Setup

Running DeepScaleR

Code Implementation

Expected Results

Training DeepScaleR

Step 1: Train with 8K context

Step 2: Train with 16K context

Step 3: Train with 24K context

Training Configuration

Training Script Structure

Iterative Context Lengthening

Key Features

Test-Time Scaling

Long-Form Reasoning

Monitoring Training

Next Steps

Resources