Overview
The DeepSWE example demonstrates:- How to use rLLM’s
SWEAgentfor software engineering tasks - How to train agents with compact filtering for efficiency
- How to evaluate on SWE-Bench-Verified
- Scaling RL with Kubernetes and Docker environments
Prerequisites
- rLLM framework installed
- vLLM for model serving (8 GPUs recommended)
- Pre-trained model:
agentica-org/DeepSWE-Preview - Kubernetes cluster (for training)
- Docker (for environment isolation)
- R2E-Gym for SWE environments
Setup
Prepare SWE datasets
Download and prepare SWE-Bench datasets:This registers SWE-Bench-Verified with rLLM’s DatasetRegistry.
Running DeepSWE
Evaluate the DeepSWE agent on SWE-Bench-Verified:Code Implementation
Expected Results
DeepSWE-Preview on SWE-Bench-Verified:| Metric | Performance |
|---|---|
| Pass@1 | 42.2% |
| Pass@16 | 71.0% |
| Test-time scaled | 59.2% |
Full Evaluation with R2E-Gym
For complete evaluation replicating published results:--max_workers 48: Parallel workers (reduce if hitting timeouts)--k 500: Number of instances to evaluate (max 500 for SWE-Bench Verified)--max_steps_absolute 100: Hard limit on trajectory steps--backend "docker": Use Docker for environment isolation
Training DeepSWE
Local Testing with Kind
For local experimentation (not full training):Production Training
On a proper Kubernetes cluster:Training Configuration
Key hyperparameters:- Base Model: Qwen3-32B
- Algorithm: GRPO with compact filtering
- Training Dataset: R2E-Gym subset
- Evaluation Dataset: SWE-Bench-Verified
- Batch Size: 64
- Learning Rate: 1e-6
- Max Context: 65,536 tokens
- Parallel Environments: 512 Docker containers
- GPUs: 64 (8 nodes × 8 GPUs)
Compact Filtering
DeepSWE uses compact filtering to improve training efficiency:- Filters out failed trajectories before training
- Masks trajectories exceeding length limits
- Masks timeout trajectories
- Significantly reduces wasted compute
SWEEnv Integration
rLLM’sSWEEnv provides a clean wrapper over R2E-Gym:
Agent Actions
The SWEAgent can perform:- Search: Find relevant code locations
- View: Read file contents
- Edit: Modify code
- Create: Add new files
- Execute: Run commands and tests
Monitoring Training
Training logs to WandB. Key metrics:| Metric | Description |
|---|---|
critic/score/mean | Average success rate per batch |
val/pass@1 | SWE-Bench-Verified Pass@1 |
train/avg_steps | Average trajectory length |
train/timeout_rate | Fraction of timeouts |
Trajectory Visualization
Visualize generated trajectories using R2E-Gym’s visualization tool:Reproduction Guide
For detailed reproduction instructions:Next Steps
- Explore DeepCoder for competitive programming
- Try DeepScaleR for mathematical reasoning
- Learn about distributed training

