Overview
verl is designed for large-scale distributed training with the following architecture:- Actor-Rollout Workers: Handle policy updates and trajectory generation
- Critic Workers: Compute value estimates for advantage calculation
- Reference Policy Workers: Maintain frozen reference policy for KL divergence
- Ray-based Orchestration: Manages distributed worker groups and resource allocation
Key Features
Distributed Training
Multi-GPU and multi-node training with Ray-based orchestration
Hybrid Engine
Combined actor-rollout engine for efficient async trajectory generation
VLM Support
Native support for vision-language models (Qwen2-VL, Qwen3-VL)
LoRA Training
Parameter-efficient fine-tuning with LoRA adapters
Installation
Install rLLM with the verl backend:Dependencies
The verl backend includes the following key dependencies (frompyproject.toml):
Python Version: Requires Python >= 3.10
Basic Usage
Agent Training
Train a math agent with verl backend:train_math_agent.py
LoRA Training
Enable LoRA for parameter-efficient training:train_with_lora.py
Configuration
The verl backend uses Hydra configuration with defaults fromagent_ppo_trainer.yaml:
Key Configuration Options
Model path (HuggingFace or local)
Rollout mode - must be “async” for verl backend
Enable hybrid actor-rollout engine
Training batch size per update step
Maximum prompt length in tokens
Maximum response length in tokens
Advantage estimator: “grpo”, “gae”, or “reinforce”
Discount factor for rewards
GAE lambda parameter
Number of training epochs
Checkpoint save frequency (steps)
LoRA Configuration
LoRA rank (0 disables LoRA)
LoRA scaling parameter
Modules to apply LoRA (default: attention and MLP layers)
Vision-Language Models (VLM)
verl backend supports multimodal models like Qwen2-VL and Qwen3-VL:train_vlm.py
Distributed Training
verl backend uses Ray for distributed training across multiple GPUs and nodes:Multi-GPU Training
Resource Pool Configuration
Number of GPUs for actor-rollout workers
Number of GPUs for critic workers
Number of GPUs for reference policy workers
Advanced Features
Step-wise Advantage
For multi-step agent trajectories, enable step-wise advantage computation:Step-wise advantage mode:
- broadcast: Propagate final advantage to all steps (recommended for GRPO)
- per_step: Compute advantages independently per step
Rejection Sampling
Filter out trajectories with no correct or all correct solutions:Compact Filtering
Filter trajectories based on termination reasons:Checkpointing
verl backend automatically saves checkpoints during training:- Location:
{trainer.default_local_dir}/checkpoints/ - Frequency: Controlled by
trainer.save_freq - Resume: Automatically resumes from latest checkpoint if available
Manual Checkpoint Loading
Monitoring
Configure logging backends:Key Metrics
actor/entropy: Policy entropyactor/loss: Actor policy lossactor/ppo_ratio_mean: PPO clipping ratiocritic/loss: Critic value losscritic/full-score/mean: Average trajectory rewardval/test_score/*: Validation accuracy by data sourcetraining/global_step: Current training step
Performance Tips
Use Async Rollout
Always use
rollout.mode=async for better throughputTune Batch Size
Increase
train_batch_size to maximize GPU utilizationEnable FSDP
Use FSDP for models > 7B parameters
Optimize vLLM
Tune vLLM tensor parallel size and max tokens
Example Configuration
Complete configuration for training a math agent:config.yaml
Troubleshooting
Out of Memory Errors
Out of Memory Errors
- Reduce
data.train_batch_size - Enable FSDP parameter offloading:
actor_rollout_ref.actor.fsdp_config.param_offload=true - Reduce
data.max_prompt_lengthordata.max_response_length - Use LoRA instead of full fine-tuning
Slow Training
Slow Training
- Increase
data.train_batch_sizeif GPU memory allows - Use
rollout.mode=async(required for verl) - Tune vLLM parameters: increase
tensor_parallel_size - Check Ray resource allocation:
resource_pool_config.*
Ray Connection Errors
Ray Connection Errors
- Ensure Ray is properly initialized
- Check firewall settings for multi-node training
- Verify GPU availability:
ray.available_resources()
VLM Training Issues
VLM Training Issues
- Set
data.return_multi_modal_inputs=true - Install vision dependencies:
qwen-vl-utils - Verify image processor is loaded correctly
- Check dataset provides images in correct format

