Overview
verl is designed for large-scale distributed training with the following architecture:- Actor-Rollout Workers: Handle policy updates and trajectory generation
- Critic Workers: Compute value estimates for advantage calculation
- Reference Policy Workers: Maintain frozen reference policy for KL divergence
- Ray-based Orchestration: Manages distributed worker groups and resource allocation
How rLLM uses verl
rLLM drives verl from its ownUnifiedTrainer rather than running verl’s RayPPOTrainer.fit() loop. The verl backend reuses verl’s worker groups, checkpoint engine, and async rollout manager, but the training lifecycle — batch shaping, advantage computation, weight sync, validation cadence — is owned by rLLM. As a result, a few verl features are intentionally not wired through on this path:
- No critic.
use_criticis forced toFalseand no critic worker is spawned. Value-based estimators (GAE, REMAX) cannot run as-is. - No reward model.
reward.reward_model.enable=Trueis rejected at startup. Compute rewards inside your workflow via aRewardFunctionso they flow through the same path as the Tinker backend. - No in-reward KL.
algorithm.use_kl_in_reward=Trueis rejected at startup. KL-in-loss is supported — settingrllm.algorithm.kl_beta>0(oractor_rollout_ref.actor.kl_loss_coef>0) automatically enables it; the loss term runs inside verl’s actor worker. - Async rollout only.
actor_rollout_ref.rollout.modemust beasync. - New EngineWorker path only.
trainer.use_legacy_worker_implis forced todisable. - Shared rLLM/Verl knobs. rLLM keeps a small table of shared keys (e.g.
algorithm.adv_estimator ↔ rllm.algorithm.adv_estimator,actor.kl_loss_coef ↔ rllm.algorithm.kl_beta,actor.clip_ratio ↔ rllm.algorithm.eps_clip) in sync at runtime. New configs should userllm.*. Existing Verl-native shared-key CLI overrides still work, but they warn because that path is deprecated. If both sides conflict, therllm.*value wins. See Configuration for the full list.
Need an estimator that requires a critic or per-token signal? See Pre-computed advantages for writing per-token advantages directly in the workflow, and the Advantage estimator page for registering a custom rLLM-native estimator (with a worked OPO port).
Key Features
Distributed Training
Multi-GPU and multi-node training with Ray-based orchestration
Hybrid Engine
Combined actor-rollout engine for efficient async trajectory generation
VLM Support
Native support for vision-language models (Qwen2-VL, Qwen3-VL)
LoRA Training
Parameter-efficient fine-tuning with LoRA adapters
Installation
Install rLLM with the verl backend:Megatron support — verl also supports Megatron for efficient large-scale training. Adding it requires a from-source install since the script lives in the rLLM repo:This installs nvidia-modelopt, transformer-engine, megatron-core, megatron-bridge, and NVIDIA Apex. The CUDA version you pass here must match the
--torch-backend flag in your rLLM install: e.g. cu128 for CUDA 12.8. Compilation may take a while.Dependencies
The verl backend includes the following key dependencies (frompyproject.toml):
Python Version: Requires Python >= 3.10
Basic Usage
Agent Training
Train a math agent with verl backend. The recommended path is to use thecookbooks/math cookbook
— install once, and the trainer wires up an AgentFlow + Evaluator for you:
train.py
LoRA Training
LoRA is enabled via Hydra overrides — no code change needed:Supported advantage estimators
Advantages on the unified Verl backend are computed through rLLM’s native estimator hook. The built-in estimators aregrpo, reinforce, reinforce_plus_plus_baseline, and rloo. Verl’s other estimators — gae, reinforce_plus_plus (proper), remax, opo, grpo_passk, gpg, gdpo, optimal_token_baseline, tir_optimal_token_baseline — are not available out of the box and the corresponding algorithm.gamma / algorithm.lam knobs are no longer wired through. To use one of those, register a custom estimator via the rLLM registry; see Advantage estimator for the contract and a worked OPO port.
Configuration
The verl backend uses Hydra configuration with defaults fromagent_ppo_trainer.yaml:
Key Configuration Options
Model path (HuggingFace or local)
Rollout mode - must be “async” for verl backend
Enable hybrid actor-rollout engine
Training batch size per update step
Maximum prompt length in tokens
Maximum response length in tokens
Advantage estimator:
grpo, reinforce, reinforce_plus_plus_baseline, or rlooNumber of training epochs
Checkpoint save frequency (steps)
LoRA Configuration
LoRA rank (0 disables LoRA)
LoRA scaling parameter
Modules to apply LoRA (default: attention and MLP layers)
Vision-Language Models (VLM)
verl backend supports multimodal models like Qwen2-VL and Qwen3-VL:train_vlm.py
Distributed Training
verl backend uses Ray for distributed training across multiple GPUs and nodes:Multi-GPU Training
Resource Pool Configuration
Number of GPUs for actor-rollout workers
Number of GPUs for reference policy workers
Advanced Features
Step-wise Advantage
For multi-step agent trajectories, enable step-wise advantage computation:Step-wise advantage mode:
- broadcast: Propagate final advantage to all steps (recommended for GRPO)
- per_step: Compute advantages independently per step
Rejection Sampling
Filter out trajectories with no correct or all correct solutions:Compact Filtering
Filter trajectories based on termination reasons:Checkpointing
verl backend automatically saves checkpoints during training:- Location:
{trainer.default_local_dir}/checkpoints/ - Frequency: Controlled by
trainer.save_freq - Resume: Automatically resumes from latest checkpoint if available
Manual Checkpoint Loading
Monitoring
Configure logging backends:Key Metrics
actor/entropy: Policy entropyactor/loss: Actor policy lossactor/ppo_ratio_mean: PPO clipping ratiocritic/full-score/mean: Average trajectory rewardval/test_score/*: Validation accuracy by data sourcetraining/global_step: Current training step
Performance Tips
Use Async Rollout
Always use
rollout.mode=async for better throughputTune Batch Size
Increase
train_batch_size to maximize GPU utilizationEnable FSDP
Use FSDP for models > 7B parameters
Optimize vLLM
Tune vLLM tensor parallel size and max tokens
Example Configuration
Complete configuration for training a math agent:config.yaml
Troubleshooting
Out of Memory Errors
Out of Memory Errors
- Reduce
data.train_batch_size - Enable FSDP parameter offloading:
actor_rollout_ref.actor.fsdp_config.param_offload=true - Reduce
data.max_prompt_lengthordata.max_response_length - Use LoRA instead of full fine-tuning
Slow Training
Slow Training
- Increase
data.train_batch_sizeif GPU memory allows - Use
rollout.mode=async(required for verl) - Tune vLLM parameters: increase
tensor_parallel_size - Check Ray resource allocation:
resource_pool_config.*
Ray Connection Errors
Ray Connection Errors
- Ensure Ray is properly initialized
- Check firewall settings for multi-node training
- Verify GPU availability:
ray.available_resources()
VLM Training Issues
VLM Training Issues
- Set
data.return_multi_modal_inputs=true - Install vision dependencies:
qwen-vl-utils - Verify image processor is loaded correctly
- Check dataset provides images in correct format
See Also
Tinker Backend
Alternative backend with async-first design
Backend Comparison
Compare verl vs tinker features
verl Documentation
Official verl repository and docs
Agent Trainer
Learn about AgentTrainer API

