The verl backend is rLLM’s high-performance distributed training backend built on top of verl (v0.6.1). It provides efficient distributed reinforcement learning for language agents with support for vLLM and SGLang inference engines.Documentation Index
Fetch the complete documentation index at: https://docs.rllm-project.com/llms.txt
Use this file to discover all available pages before exploring further.
Overview
verl is designed for large-scale distributed training with the following architecture:- Actor-Rollout Workers: Handle policy updates and trajectory generation
- Critic Workers: Compute value estimates for advantage calculation
- Reference Policy Workers: Maintain frozen reference policy for KL divergence
- Ray-based Orchestration: Manages distributed worker groups and resource allocation
Key Features
Distributed Training
Multi-GPU and multi-node training with Ray-based orchestration
Hybrid Engine
Combined actor-rollout engine for efficient async trajectory generation
VLM Support
Native support for vision-language models (Qwen2-VL, Qwen3-VL)
LoRA Training
Parameter-efficient fine-tuning with LoRA adapters
Installation
Install rLLM with the verl backend:Megatron support — verl also supports Megatron for efficient large-scale training. Adding it requires a from-source install since the script lives in the rLLM repo:This installs nvidia-modelopt, transformer-engine, megatron-core, megatron-bridge, and NVIDIA Apex. The CUDA version you pass here must match the
--torch-backend flag in your rLLM install: e.g. cu128 for CUDA 12.8. Compilation may take a while.Dependencies
The verl backend includes the following key dependencies (frompyproject.toml):
Python Version: Requires Python >= 3.10
Basic Usage
Agent Training
Train a math agent with verl backend. The recommended path is to use thecookbooks/math cookbook
— install once, and the trainer wires up an AgentFlow + Evaluator for you:
train.py
LoRA Training
LoRA is enabled via Hydra overrides — no code change needed:Configuration
The verl backend uses Hydra configuration with defaults fromagent_ppo_trainer.yaml:
Key Configuration Options
Model path (HuggingFace or local)
Rollout mode - must be “async” for verl backend
Enable hybrid actor-rollout engine
Training batch size per update step
Maximum prompt length in tokens
Maximum response length in tokens
Advantage estimator: “grpo”, “gae”, or “reinforce”
Discount factor for rewards
GAE lambda parameter
Number of training epochs
Checkpoint save frequency (steps)
LoRA Configuration
LoRA rank (0 disables LoRA)
LoRA scaling parameter
Modules to apply LoRA (default: attention and MLP layers)
Vision-Language Models (VLM)
verl backend supports multimodal models like Qwen2-VL and Qwen3-VL:train_vlm.py
Distributed Training
verl backend uses Ray for distributed training across multiple GPUs and nodes:Multi-GPU Training
Resource Pool Configuration
Number of GPUs for actor-rollout workers
Number of GPUs for critic workers
Number of GPUs for reference policy workers
Advanced Features
Step-wise Advantage
For multi-step agent trajectories, enable step-wise advantage computation:Step-wise advantage mode:
- broadcast: Propagate final advantage to all steps (recommended for GRPO)
- per_step: Compute advantages independently per step
Rejection Sampling
Filter out trajectories with no correct or all correct solutions:Compact Filtering
Filter trajectories based on termination reasons:Checkpointing
verl backend automatically saves checkpoints during training:- Location:
{trainer.default_local_dir}/checkpoints/ - Frequency: Controlled by
trainer.save_freq - Resume: Automatically resumes from latest checkpoint if available
Manual Checkpoint Loading
Monitoring
Configure logging backends:Key Metrics
actor/entropy: Policy entropyactor/loss: Actor policy lossactor/ppo_ratio_mean: PPO clipping ratiocritic/loss: Critic value losscritic/full-score/mean: Average trajectory rewardval/test_score/*: Validation accuracy by data sourcetraining/global_step: Current training step
Performance Tips
Use Async Rollout
Always use
rollout.mode=async for better throughputTune Batch Size
Increase
train_batch_size to maximize GPU utilizationEnable FSDP
Use FSDP for models > 7B parameters
Optimize vLLM
Tune vLLM tensor parallel size and max tokens
Example Configuration
Complete configuration for training a math agent:config.yaml
Troubleshooting
Out of Memory Errors
Out of Memory Errors
- Reduce
data.train_batch_size - Enable FSDP parameter offloading:
actor_rollout_ref.actor.fsdp_config.param_offload=true - Reduce
data.max_prompt_lengthordata.max_response_length - Use LoRA instead of full fine-tuning
Slow Training
Slow Training
- Increase
data.train_batch_sizeif GPU memory allows - Use
rollout.mode=async(required for verl) - Tune vLLM parameters: increase
tensor_parallel_size - Check Ray resource allocation:
resource_pool_config.*
Ray Connection Errors
Ray Connection Errors
- Ensure Ray is properly initialized
- Check firewall settings for multi-node training
- Verify GPU availability:
ray.available_resources()
VLM Training Issues
VLM Training Issues
- Set
data.return_multi_modal_inputs=true - Install vision dependencies:
qwen-vl-utils - Verify image processor is loaded correctly
- Check dataset provides images in correct format
See Also
Tinker Backend
Alternative backend with async-first design
Backend Comparison
Compare verl vs tinker features
verl Documentation
Official verl repository and docs
Agent Trainer
Learn about AgentTrainer API

