Overview
The DeepScaleR example demonstrates:- How to use rLLM’s
MathAgentfor mathematical reasoning - How to train agents with iterative context lengthening (8K → 16K → 24K)
- How to evaluate mathematical reasoning with Pass@K metrics
- Scaling RL to achieve state-of-the-art performance on math competitions
Prerequisites
- rLLM framework installed
- vLLM or SGLang for model serving
- Pre-trained model:
agentica-org/DeepScaleR-1.5B-Preview - GPU with sufficient memory for 8K-24K context lengths
Setup
Prepare math datasets
Download and prepare mathematical competition datasets:This will download:
- AIME 2024 (test set)
- Hendrycks MATH (training)
- Math500 (validation)
Running DeepScaleR
Execute the math reasoning agent:Code Implementation
Expected Results
DeepScaleR-1.5B-Preview on AIME 2024:| Metric | Performance |
|---|---|
| Pass@1 | 40.0% |
| Pass@16 | 65.0% |
| Pass@64 | 75.0% |
Training DeepScaleR
Train your own DeepScaleR agent with iterative context lengthening:Step 1: Train with 8K context
Step 2: Train with 16K context
ModifyMODEL_PATH in the script to point to your 8K checkpoint:
Step 3: Train with 24K context
ModifyMODEL_PATH to point to your 16K checkpoint:
Training Configuration
Key hyperparameters:- Base Model: DeepSeek-R1-Distill-Qwen-1.5B
- Algorithm: GRPO (Group Relative Policy Optimization)
- Training Dataset: Hendrycks MATH + Math500
- Evaluation Dataset: AIME 2024
- Batch Size: 64
- Learning Rate: 1e-6
- Context Progression: 8K → 16K → 24K
- Sampling: n=16 candidates per problem
- Temperature: 0.6
Training Script Structure
Iterative Context Lengthening
DeepScaleR uses a curriculum learning approach:- 8K Phase: Learn basic reasoning patterns
- 16K Phase: Handle more complex multi-step problems
- 24K Phase: Master extremely long reasoning chains
Key Features
Test-Time Scaling
DeepScaleR improves with more compute at inference time:Long-Form Reasoning
The model generates detailed step-by-step solutions:Monitoring Training
Training logs to WandB. Key metrics to track:| Metric | Description |
|---|---|
critic/score/mean | Average reward per batch |
val/pass@1 | AIME 2024 Pass@1 accuracy |
val/pass@16 | AIME 2024 Pass@16 accuracy |
train/response_length | Average reasoning length |
Next Steps
- Explore DeepCoder for coding competitions
- Try SDK examples for simplified workflows
- Learn about RL algorithms

