Skip to main content
This example demonstrates training and running DeepScaleR, a reasoning LLM finetuned from DeepSeek-R1-Distill-1.5B on math competition problems using RL. The model achieves >40% Pass@1 on AIME2024, reaching o1-preview performance despite its small size.

Overview

The DeepScaleR example demonstrates:
  • How to use rLLM’s MathAgent for mathematical reasoning
  • How to train agents with iterative context lengthening (8K → 16K → 24K)
  • How to evaluate mathematical reasoning with Pass@K metrics
  • Scaling RL to achieve state-of-the-art performance on math competitions

Prerequisites

  • rLLM framework installed
  • vLLM or SGLang for model serving
  • Pre-trained model: agentica-org/DeepScaleR-1.5B-Preview
  • GPU with sufficient memory for 8K-24K context lengths

Setup

1

Prepare math datasets

Download and prepare mathematical competition datasets:
cd examples/deepscaler
python prepare_math_data.py
This will download:
  • AIME 2024 (test set)
  • Hendrycks MATH (training)
  • Math500 (validation)
2

Start model server

Launch a vLLM server:
python -m vllm.entrypoints.openai.api_server \
    --model agentica-org/DeepScaleR-1.5B-Preview \
    --host 0.0.0.0 \
    --port 30000 \
    --dtype bfloat16
Or use SGLang:
python -m sglang_router.launch_server \
    --model-path agentica-org/DeepScaleR-1.5B-Preview \
    --dp-size 1 \
    --dtype bfloat16
The server will be accessible at http://localhost:30000/v1

Running DeepScaleR

Execute the math reasoning agent:
cd examples/deepscaler
python run_deepscaler.py

Code Implementation

import asyncio
from transformers import AutoTokenizer
from rllm.agents.math_agent import MathAgent
from rllm.data.dataset import DatasetRegistry
from rllm.engine.agent_execution_engine import AgentExecutionEngine
from rllm.environments.base.single_turn_env import SingleTurnEnvironment
from rllm.rewards.reward_fn import math_reward_fn
from rllm.utils import compute_pass_at_k

n_parallel_agents = 64
model_name = "agentica-org/DeepScaleR-1.5B-Preview"

tokenizer = AutoTokenizer.from_pretrained(model_name)

env_args = {
    "reward_fn": math_reward_fn,
}

sampling_params = {"temperature": 0.6, "top_p": 0.95, "model": model_name}

engine = AgentExecutionEngine(
    agent_class=MathAgent,
    env_class=SingleTurnEnvironment,
    agent_args={},
    env_args=env_args,
    engine_name="openai",
    tokenizer=tokenizer,
    sampling_params=sampling_params,
    rollout_engine_args={
        "base_url": "http://localhost:30000/v1",
        "api_key": "None",
    },
    max_response_length=32768,
    max_prompt_length=2048,
    n_parallel_agents=n_parallel_agents,
)

test_dataset = DatasetRegistry.load_dataset("aime2024", "test")
tasks = test_dataset.repeat(n=16)  # repeat to evaluate pass@k

results = asyncio.run(engine.execute_tasks(tasks))
compute_pass_at_k(results)

Expected Results

DeepScaleR-1.5B-Preview on AIME 2024:
MetricPerformance
Pass@140.0%
Pass@1665.0%
Pass@6475.0%
This matches or exceeds OpenAI o1-preview on AIME 2024, despite being only 1.5B parameters.

Training DeepScaleR

Train your own DeepScaleR agent with iterative context lengthening:

Step 1: Train with 8K context

bash examples/deepscaler/train_deepscaler_8k.sh

Step 2: Train with 16K context

Modify MODEL_PATH in the script to point to your 8K checkpoint:
bash examples/deepscaler/train_deepscaler_16k.sh

Step 3: Train with 24K context

Modify MODEL_PATH to point to your 16K checkpoint:
bash examples/deepscaler/train_deepscaler_24k.sh

Training Configuration

Key hyperparameters:
  • Base Model: DeepSeek-R1-Distill-Qwen-1.5B
  • Algorithm: GRPO (Group Relative Policy Optimization)
  • Training Dataset: Hendrycks MATH + Math500
  • Evaluation Dataset: AIME 2024
  • Batch Size: 64
  • Learning Rate: 1e-6
  • Context Progression: 8K → 16K → 24K
  • Sampling: n=16 candidates per problem
  • Temperature: 0.6

Training Script Structure

import hydra
from rllm.data.dataset import DatasetRegistry
from rllm.trainer.agent_trainer import AgentTrainer
from rllm.agents.math_agent import MathAgent
from rllm.environments.base.single_turn_env import SingleTurnEnvironment

@hydra.main(
    config_path="pkg://rllm.trainer.config",
    config_name="agent_ppo_trainer",
    version_base=None
)
def main(config):
    train_dataset = DatasetRegistry.load_dataset("hendrycks_math", "train")
    test_dataset = DatasetRegistry.load_dataset("math500", "test")

    trainer = AgentTrainer(
        config=config,
        train_dataset=train_dataset,
        val_dataset=test_dataset,
        agent_class=MathAgent,
        env_class=SingleTurnEnvironment,
    )
    trainer.train()

if __name__ == "__main__":
    main()

Iterative Context Lengthening

DeepScaleR uses a curriculum learning approach:
  1. 8K Phase: Learn basic reasoning patterns
  2. 16K Phase: Handle more complex multi-step problems
  3. 24K Phase: Master extremely long reasoning chains
Each phase builds on the previous checkpoint, gradually increasing the model’s ability to handle longer reasoning chains.

Key Features

Test-Time Scaling

DeepScaleR improves with more compute at inference time:
# Sample multiple solutions and select the best
tasks = test_dataset.repeat(n=64)  # 64 attempts per problem
results = asyncio.run(engine.execute_tasks(tasks))

Long-Form Reasoning

The model generates detailed step-by-step solutions:
Problem: Find all real numbers x such that...

Solution:
<think>
Let me analyze this carefully. First, I'll consider...
[2000+ tokens of reasoning]
</think>

Therefore, the answer is \boxed{42}

Monitoring Training

Training logs to WandB. Key metrics to track:
MetricDescription
critic/score/meanAverage reward per batch
val/pass@1AIME 2024 Pass@1 accuracy
val/pass@16AIME 2024 Pass@16 accuracy
train/response_lengthAverage reasoning length

Next Steps

Resources