Skip to main content
The verl backend is rLLM’s high-performance distributed training backend built on top of verl (v0.6.1). It provides efficient distributed reinforcement learning for language agents with support for vLLM and SGLang inference engines.

Overview

verl is designed for large-scale distributed training with the following architecture:
  • Actor-Rollout Workers: Handle policy updates and trajectory generation
  • Critic Workers: Compute value estimates for advantage calculation
  • Reference Policy Workers: Maintain frozen reference policy for KL divergence
  • Ray-based Orchestration: Manages distributed worker groups and resource allocation

Key Features

Distributed Training

Multi-GPU and multi-node training with Ray-based orchestration

Hybrid Engine

Combined actor-rollout engine for efficient async trajectory generation

VLM Support

Native support for vision-language models (Qwen2-VL, Qwen3-VL)

LoRA Training

Parameter-efficient fine-tuning with LoRA adapters

Installation

Install rLLM with the verl backend:
uv pip install "rllm[verl] @ git+https://github.com/rllm-org/rllm.git"

Dependencies

The verl backend includes the following key dependencies (from pyproject.toml):
verl = [
    "verl==0.6.1",
    "torch>=2.8.0",
    "torchvision>=0.23.0",
    "vllm>=0.10.2,<=0.11.0",
    "flash-attn>=2.8.1",
    "qwen-vl-utils",
]
Python Version: Requires Python >= 3.10

Basic Usage

Agent Training

Train a math agent with verl backend:
train_math_agent.py
import hydra
from rllm.agents.math_agent import MathAgent
from rllm.data.dataset import DatasetRegistry
from rllm.environments.base.single_turn_env import SingleTurnEnvironment
from rllm.rewards.reward_fn import math_reward_fn
from rllm.trainer.agent_trainer import AgentTrainer

@hydra.main(
    config_path="pkg://rllm.trainer.config",
    config_name="agent_ppo_trainer",
    version_base=None
)
def main(config):
    train_dataset = DatasetRegistry.load_dataset("hendrycks_math", "train")
    test_dataset = DatasetRegistry.load_dataset("math500", "test")

    trainer = AgentTrainer(
        agent_class=MathAgent,
        agent_args={},
        env_args={"reward_fn": math_reward_fn},
        env_class=SingleTurnEnvironment,
        config=config,
        train_dataset=train_dataset,
        val_dataset=test_dataset,
    )
    trainer.train()

if __name__ == "__main__":
    main()
Run with:
python train_math_agent.py \
  actor_rollout_ref.model.path=Qwen/Qwen2.5-Math-7B-Instruct \
  data.train_batch_size=16 \
  trainer.total_epochs=3

LoRA Training

Enable LoRA for parameter-efficient training:
train_with_lora.py
import hydra
from rllm.agents.math_agent import MathAgent
from rllm.data.dataset import DatasetRegistry
from rllm.environments.base.single_turn_env import SingleTurnEnvironment
from rllm.rewards.reward_fn import math_reward_fn
from rllm.trainer.agent_trainer import AgentTrainer

@hydra.main(
    config_path="pkg://rllm.trainer.config",
    config_name="agent_ppo_trainer",
    version_base=None
)
def main(config):
    train_dataset = DatasetRegistry.load_dataset("gsm8k", "train")
    test_dataset = DatasetRegistry.load_dataset("gsm8k", "test")

    trainer = AgentTrainer(
        agent_class=MathAgent,
        agent_args={},
        env_args={"reward_fn": math_reward_fn},
        env_class=SingleTurnEnvironment,
        config=config,
        train_dataset=train_dataset,
        val_dataset=test_dataset,
    )
    trainer.train()

if __name__ == "__main__":
    main()
Run with LoRA configuration:
python train_with_lora.py \
  actor_rollout_ref.model.path=Qwen/Qwen2.5-Math-7B-Instruct \
  actor_rollout_ref.model.lora.rank=64 \
  actor_rollout_ref.model.lora.alpha=128 \
  data.train_batch_size=32

Configuration

The verl backend uses Hydra configuration with defaults from agent_ppo_trainer.yaml:

Key Configuration Options

actor_rollout_ref.model.path
string
required
Model path (HuggingFace or local)
actor_rollout_ref.rollout.mode
string
default:"async"
Rollout mode - must be “async” for verl backend
actor_rollout_ref.hybrid_engine
boolean
default:"true"
Enable hybrid actor-rollout engine
data.train_batch_size
integer
default:"64"
Training batch size per update step
data.max_prompt_length
integer
default:"2048"
Maximum prompt length in tokens
data.max_response_length
integer
default:"2048"
Maximum response length in tokens
algorithm.adv_estimator
string
default:"grpo"
Advantage estimator: “grpo”, “gae”, or “reinforce”
algorithm.gamma
float
default:"1.0"
Discount factor for rewards
algorithm.lam
float
default:"0.95"
GAE lambda parameter
trainer.total_epochs
integer
default:"3"
Number of training epochs
trainer.save_freq
integer
default:"100"
Checkpoint save frequency (steps)

LoRA Configuration

actor_rollout_ref.model.lora.rank
integer
default:"0"
LoRA rank (0 disables LoRA)
actor_rollout_ref.model.lora.alpha
integer
default:"16"
LoRA scaling parameter
actor_rollout_ref.model.lora.target_modules
list
Modules to apply LoRA (default: attention and MLP layers)

Vision-Language Models (VLM)

verl backend supports multimodal models like Qwen2-VL and Qwen3-VL:
train_vlm.py
import hydra
from examples.geo3k.geo3k_workflow import Geo3KWorkflow
from rllm.data.dataset import DatasetRegistry
from rllm.rewards.reward_fn import f1_reward_fn
from rllm.trainer.agent_trainer import AgentTrainer

@hydra.main(
    config_path="pkg://rllm.trainer.config",
    config_name="agent_ppo_trainer",
    version_base=None
)
def main(config):
    train_dataset = DatasetRegistry.load_dataset("latex_ocr", "train")
    test_dataset = DatasetRegistry.load_dataset("latex_ocr", "test")

    trainer = AgentTrainer(
        workflow_class=Geo3KWorkflow,
        workflow_args={"reward_function": f1_reward_fn},
        config=config,
        train_dataset=train_dataset,
        val_dataset=test_dataset,
    )
    trainer.train()

if __name__ == "__main__":
    main()
Run with VLM:
python train_vlm.py \
  actor_rollout_ref.model.path=Qwen/Qwen2-VL-7B-Instruct \
  data.return_multi_modal_inputs=true
When training VLMs, ensure data.return_multi_modal_inputs=true is set and the dataset provides image inputs.

Distributed Training

verl backend uses Ray for distributed training across multiple GPUs and nodes:

Multi-GPU Training

python train_agent.py \
  actor_rollout_ref.actor.fsdp_config.param_offload=false \
  actor_rollout_ref.actor.fsdp_config.grad_offload=false \
  resource_pool_config.actor_rollout_gpu=4  # Use 4 GPUs

Resource Pool Configuration

resource_pool_config.actor_rollout_gpu
integer
Number of GPUs for actor-rollout workers
resource_pool_config.critic_gpu
integer
Number of GPUs for critic workers
resource_pool_config.ref_policy_gpu
integer
Number of GPUs for reference policy workers

Advanced Features

Step-wise Advantage

For multi-step agent trajectories, enable step-wise advantage computation:
python train_agent.py \
  rllm.stepwise_advantage.enable=true \
  rllm.stepwise_advantage.mode=broadcast \
  rllm.agent.max_steps=20
Step-wise advantage mode:
  • broadcast: Propagate final advantage to all steps (recommended for GRPO)
  • per_step: Compute advantages independently per step

Rejection Sampling

Filter out trajectories with no correct or all correct solutions:
python train_agent.py \
  rllm.rejection_sample.enable=true \
  rllm.rejection_sample.multiplier=2  # Generate 2x trajectories per prompt

Compact Filtering

Filter trajectories based on termination reasons:
rllm:
  compact_filtering:
    enable: true
    mask_timeout: true
    mask_error: true
    mask_max_turns_exceeded: true

Checkpointing

verl backend automatically saves checkpoints during training:
  • Location: {trainer.default_local_dir}/checkpoints/
  • Frequency: Controlled by trainer.save_freq
  • Resume: Automatically resumes from latest checkpoint if available

Manual Checkpoint Loading

python train_agent.py \
  trainer.default_local_dir=/path/to/checkpoint/dir

Monitoring

Configure logging backends:
trainer:
  logger: ["console", "wandb", "tensorboard"]
  project_name: "my-project"
  experiment_name: "math-agent-v1"

Key Metrics

  • actor/entropy: Policy entropy
  • actor/loss: Actor policy loss
  • actor/ppo_ratio_mean: PPO clipping ratio
  • critic/loss: Critic value loss
  • critic/full-score/mean: Average trajectory reward
  • val/test_score/*: Validation accuracy by data source
  • training/global_step: Current training step

Performance Tips

Use Async Rollout

Always use rollout.mode=async for better throughput

Tune Batch Size

Increase train_batch_size to maximize GPU utilization

Enable FSDP

Use FSDP for models > 7B parameters

Optimize vLLM

Tune vLLM tensor parallel size and max tokens

Example Configuration

Complete configuration for training a math agent:
config.yaml
actor_rollout_ref:
  model:
    path: Qwen/Qwen2.5-Math-7B-Instruct
    lora:
      rank: 64
      alpha: 128
  rollout:
    mode: async
    n: 16  # Generate 16 trajectories per prompt
    val_kwargs:
      n: 4  # Generate 4 trajectories for validation

data:
  train_batch_size: 32
  max_prompt_length: 2048
  max_response_length: 2048

algorithm:
  adv_estimator: grpo
  gamma: 1.0
  lam: 0.95

trainer:
  total_epochs: 3
  save_freq: 100
  test_freq: 50
  logger: ["wandb"]
  project_name: "math-rl"
  experiment_name: "qwen-math-7b"

rllm:
  stepwise_advantage:
    enable: false
  rejection_sample:
    enable: true
    multiplier: 2

Troubleshooting

  • Reduce data.train_batch_size
  • Enable FSDP parameter offloading: actor_rollout_ref.actor.fsdp_config.param_offload=true
  • Reduce data.max_prompt_length or data.max_response_length
  • Use LoRA instead of full fine-tuning
  • Increase data.train_batch_size if GPU memory allows
  • Use rollout.mode=async (required for verl)
  • Tune vLLM parameters: increase tensor_parallel_size
  • Check Ray resource allocation: resource_pool_config.*
  • Ensure Ray is properly initialized
  • Check firewall settings for multi-node training
  • Verify GPU availability: ray.available_resources()
  • Set data.return_multi_modal_inputs=true
  • Install vision dependencies: qwen-vl-utils
  • Verify image processor is loaded correctly
  • Check dataset provides images in correct format

See Also