verl Backend

The verl backend is rLLM’s high-performance distributed training backend built on top of verl (v0.6.1). It provides efficient distributed reinforcement learning for language agents with support for vLLM and SGLang inference engines.

Overview

verl is designed for large-scale distributed training with the following architecture:

Actor-Rollout Workers: Handle policy updates and trajectory generation
Critic Workers: Compute value estimates for advantage calculation
Reference Policy Workers: Maintain frozen reference policy for KL divergence
Ray-based Orchestration: Manages distributed worker groups and resource allocation

Key Features

Distributed Training

Multi-GPU and multi-node training with Ray-based orchestration

Hybrid Engine

Combined actor-rollout engine for efficient async trajectory generation

VLM Support

Native support for vision-language models (Qwen2-VL, Qwen3-VL)

LoRA Training

Parameter-efficient fine-tuning with LoRA adapters

Installation

Install rLLM with the verl backend:

uv pip install "rllm[verl] @ git+https://github.com/rllm-org/rllm.git"

Megatron support — verl also supports Megatron for efficient large-scale training. Adding it requires a from-source install since the script lives in the rLLM repo:

bash scripts/install_megatron.sh <cu128|cu129|cu130>

This installs nvidia-modelopt, transformer-engine, megatron-core, megatron-bridge, and NVIDIA Apex. The CUDA version you pass here must match the --torch-backend flag in your rLLM install: e.g. cu128 for CUDA 12.8. Compilation may take a while.

Dependencies

The verl backend includes the following key dependencies (from pyproject.toml):

verl = [
    "verl==0.6.1",
    "torch>=2.8.0",
    "torchvision>=0.23.0",
    "vllm>=0.10.2,<=0.11.0",
    "flash-attn>=2.8.1",
    "qwen-vl-utils",
]

Python Version: Requires Python >= 3.10

Basic Usage

Agent Training

Train a math agent with verl backend. The recommended path is to use the cookbooks/math cookbook — install once, and the trainer wires up an AgentFlow + Evaluator for you:

train.py

import hydra
from omegaconf import DictConfig

from math_flow import math_flow                # from cookbooks/math/
from math_eval import math_evaluator           # from cookbooks/math/

from rllm.data.dataset import DatasetRegistry
from rllm.experimental.unified_trainer import AgentTrainer

@hydra.main(config_path="pkg://rllm.experimental.config", config_name="unified", version_base=None)
def main(config: DictConfig):
    train_dataset = DatasetRegistry.load_dataset("hendrycks_math", "train")
    test_dataset = DatasetRegistry.load_dataset("math500", "test")

    trainer = AgentTrainer(
        backend="verl",
        agent_flow=math_flow,
        evaluator=math_evaluator,
        config=config,
        train_dataset=train_dataset,
        val_dataset=test_dataset,
    )
    trainer.train()

if __name__ == "__main__":
    main()

Run via the cookbook’s pre-baked verl launch script:

bash cookbooks/math/train_verl.sh \
  actor_rollout_ref.model.path=Qwen/Qwen2.5-Math-7B-Instruct \
  data.train_batch_size=16 \
  trainer.total_epochs=3

LoRA Training

LoRA is enabled via Hydra overrides — no code change needed:

bash cookbooks/math/train_verl.sh \
  +actor_rollout_ref.model.lora.rank=32 \
  +actor_rollout_ref.model.lora.alpha=32 \
  +actor_rollout_ref.model.lora.merge=true \
  data.train_batch_size=16

Run with LoRA configuration:

python train_with_lora.py \
  actor_rollout_ref.model.path=Qwen/Qwen2.5-Math-7B-Instruct \
  actor_rollout_ref.model.lora.rank=64 \
  actor_rollout_ref.model.lora.alpha=128 \
  data.train_batch_size=32

Configuration

The verl backend uses Hydra configuration with defaults from agent_ppo_trainer.yaml:

Key Configuration Options

actor_rollout_ref.model.path

string

required

Model path (HuggingFace or local)

actor_rollout_ref.rollout.mode

string

default:"async"

Rollout mode - must be “async” for verl backend

actor_rollout_ref.hybrid_engine

boolean

default:"true"

Enable hybrid actor-rollout engine

data.train_batch_size

integer

default:"64"

Training batch size per update step

data.max_prompt_length

integer

default:"2048"

Maximum prompt length in tokens

data.max_response_length

integer

default:"2048"

Maximum response length in tokens

algorithm.adv_estimator

string

default:"grpo"

Advantage estimator: “grpo”, “gae”, or “reinforce”

algorithm.gamma

float

default:"1.0"

Discount factor for rewards

algorithm.lam

float

default:"0.95"

GAE lambda parameter

trainer.total_epochs

integer

default:"3"

Number of training epochs

trainer.save_freq

integer

default:"100"

Checkpoint save frequency (steps)

LoRA Configuration

actor_rollout_ref.model.lora.rank

integer

default:"0"

LoRA rank (0 disables LoRA)

actor_rollout_ref.model.lora.alpha

integer

default:"16"

LoRA scaling parameter

actor_rollout_ref.model.lora.target_modules

list

Modules to apply LoRA (default: attention and MLP layers)

Vision-Language Models (VLM)

verl backend supports multimodal models like Qwen2-VL and Qwen3-VL:

train_vlm.py

import hydra
from examples.geo3k.geo3k_workflow import Geo3KWorkflow
from rllm.data.dataset import DatasetRegistry
from rllm.rewards.reward_fn import f1_reward_fn
from rllm.trainer.agent_trainer import AgentTrainer

@hydra.main(
    config_path="pkg://rllm.trainer.config",
    config_name="agent_ppo_trainer",
    version_base=None
)
def main(config):
    train_dataset = DatasetRegistry.load_dataset("latex_ocr", "train")
    test_dataset = DatasetRegistry.load_dataset("latex_ocr", "test")

    trainer = AgentTrainer(
        workflow_class=Geo3KWorkflow,
        workflow_args={"reward_function": f1_reward_fn},
        config=config,
        train_dataset=train_dataset,
        val_dataset=test_dataset,
    )
    trainer.train()

if __name__ == "__main__":
    main()

Run with VLM:

python train_vlm.py \
  actor_rollout_ref.model.path=Qwen/Qwen2-VL-7B-Instruct \
  data.return_multi_modal_inputs=true

When training VLMs, ensure data.return_multi_modal_inputs=true is set and the dataset provides image inputs.

Distributed Training

verl backend uses Ray for distributed training across multiple GPUs and nodes:

Multi-GPU Training

python train_agent.py \
  actor_rollout_ref.actor.fsdp_config.param_offload=false \
  actor_rollout_ref.actor.fsdp_config.grad_offload=false \
  resource_pool_config.actor_rollout_gpu=4  # Use 4 GPUs

Resource Pool Configuration

resource_pool_config.actor_rollout_gpu

integer

Number of GPUs for actor-rollout workers

resource_pool_config.critic_gpu

integer

Number of GPUs for critic workers

resource_pool_config.ref_policy_gpu

integer

Number of GPUs for reference policy workers

Advanced Features

Step-wise Advantage

For multi-step agent trajectories, enable step-wise advantage computation:

python train_agent.py \
  rllm.stepwise_advantage.enable=true \
  rllm.stepwise_advantage.mode=broadcast \
  rllm.agent.max_steps=20

Step-wise advantage mode:

broadcast: Propagate final advantage to all steps (recommended for GRPO)
per_step: Compute advantages independently per step

Rejection Sampling

Filter out trajectories with no correct or all correct solutions:

python train_agent.py \
  rllm.rejection_sample.enable=true \
  rllm.rejection_sample.multiplier=2  # Generate 2x trajectories per prompt

Compact Filtering

Filter trajectories based on termination reasons:

rllm:
  compact_filtering:
    enable: true
    mask_timeout: true
    mask_error: true
    mask_max_turns_exceeded: true

Checkpointing

verl backend automatically saves checkpoints during training:

Location: {trainer.default_local_dir}/checkpoints/
Frequency: Controlled by trainer.save_freq
Resume: Automatically resumes from latest checkpoint if available

Manual Checkpoint Loading

python train_agent.py \
  trainer.default_local_dir=/path/to/checkpoint/dir

Monitoring

Configure logging backends:

trainer:
  logger: ["console", "wandb", "tensorboard"]
  project_name: "my-project"
  experiment_name: "math-agent-v1"

Key Metrics

actor/entropy: Policy entropy
actor/loss: Actor policy loss
actor/ppo_ratio_mean: PPO clipping ratio
critic/loss: Critic value loss
critic/full-score/mean: Average trajectory reward
val/test_score/*: Validation accuracy by data source
training/global_step: Current training step

Performance Tips

Use Async Rollout

Always use rollout.mode=async for better throughput

Tune Batch Size

Increase train_batch_size to maximize GPU utilization

Enable FSDP

Use FSDP for models > 7B parameters

Optimize vLLM

Tune vLLM tensor parallel size and max tokens

Example Configuration

Complete configuration for training a math agent:

config.yaml

actor_rollout_ref:
  model:
    path: Qwen/Qwen2.5-Math-7B-Instruct
    lora:
      rank: 64
      alpha: 128
  rollout:
    mode: async
    n: 16  # Generate 16 trajectories per prompt
    val_kwargs:
      n: 4  # Generate 4 trajectories for validation

data:
  train_batch_size: 32
  max_prompt_length: 2048
  max_response_length: 2048

algorithm:
  adv_estimator: grpo
  gamma: 1.0
  lam: 0.95

trainer:
  total_epochs: 3
  save_freq: 100
  test_freq: 50
  logger: ["wandb"]
  project_name: "math-rl"
  experiment_name: "qwen-math-7b"

rllm:
  stepwise_advantage:
    enable: false
  rejection_sample:
    enable: true
    multiplier: 2

Troubleshooting

Out of Memory Errors

Reduce data.train_batch_size
Enable FSDP parameter offloading: actor_rollout_ref.actor.fsdp_config.param_offload=true
Reduce data.max_prompt_length or data.max_response_length
Use LoRA instead of full fine-tuning

Slow Training

Increase data.train_batch_size if GPU memory allows
Use rollout.mode=async (required for verl)
Tune vLLM parameters: increase tensor_parallel_size
Check Ray resource allocation: resource_pool_config.*

Ray Connection Errors

Ensure Ray is properly initialized
Check firewall settings for multi-node training
Verify GPU availability: ray.available_resources()

VLM Training Issues

Set data.return_multi_modal_inputs=true
Install vision dependencies: qwen-vl-utils
Verify image processor is loaded correctly
Check dataset provides images in correct format

Tinker Backend

Alternative backend with async-first design

Backend Comparison

Compare verl vs tinker features

verl Documentation

Official verl repository and docs

Agent Trainer

Learn about AgentTrainer API

Get started

Tutorials

rLLM CLI & UI

Core concepts

Agent runtimes

Training backends

Guides

Unified workflow trainer

Advanced algorithms

Documentation Index

​Overview

​Key Features

Distributed Training

Hybrid Engine

VLM Support

LoRA Training

​Installation

​Dependencies

​Basic Usage

​Agent Training

​LoRA Training

​Configuration

​Key Configuration Options

​LoRA Configuration

​Vision-Language Models (VLM)

​Distributed Training

​Multi-GPU Training

​Resource Pool Configuration

​Advanced Features

​Step-wise Advantage

​Rejection Sampling

​Compact Filtering

​Checkpointing

​Manual Checkpoint Loading

​Monitoring

​Key Metrics

​Performance Tips

Use Async Rollout

Tune Batch Size

Enable FSDP

Optimize vLLM

​Example Configuration

​Troubleshooting

​See Also

Tinker Backend

Backend Comparison

verl Documentation

Agent Trainer

Overview

Key Features

Installation

Dependencies

Basic Usage

Agent Training

LoRA Training

Configuration

Key Configuration Options

LoRA Configuration

Vision-Language Models (VLM)

Distributed Training

Multi-GPU Training

Resource Pool Configuration

Advanced Features

Step-wise Advantage

Rejection Sampling

Compact Filtering

Checkpointing

Manual Checkpoint Loading

Monitoring

Key Metrics

Performance Tips

Example Configuration

Troubleshooting

See Also