Skip to main content
rLLM’s training loop uses the same AgentFlow and Evaluator abstractions you use for evaluation. You pass your AgentFlow, Evaluator, and datasets into AgentTrainer — it handles episode generation, reward assignment, advantage computation, and policy updates. During eval, the pipeline is one-directional:
Dataset → AgentFlow.run(task) → Episode → Evaluator.evaluate(task, episode) → EvalOutput
During training, the same pipeline runs in a loop, with rewards flowing back into the model:
Dataset → AgentFlow.run(task) → Episode → Evaluator.evaluate() → reward
    → advantage computation → policy update → (repeat with updated model)
Your AgentFlow and Evaluator code stays the same. The AgentTrainer handles the additional machinery — routing LLM calls through a gateway that captures token-level data (prompt IDs, response IDs, logprobs) needed for policy gradients.

Basic usage

Pass an agent_flow and evaluator to AgentTrainer, then call train():
from rllm.experimental.unified_trainer import AgentTrainer
from rllm.experimental.eval.agent_loader import load_agent
from rllm.experimental.eval.evaluator_loader import load_evaluator
from rllm.experimental.cli.train import build_train_config
from rllm.data import DatasetRegistry

# Load your agent flow and evaluator
agent = load_agent("concierge")
evaluator = load_evaluator("relevance")

# Load datasets
train_dataset = DatasetRegistry.load_dataset("concierge", "train")
val_dataset = DatasetRegistry.load_dataset("concierge", "test")

# Build config
config = build_train_config(
    model_name="Qwen/Qwen3-8B",
    group_size=8,
    batch_size=32,
    lr=2e-5,
    lora_rank=32,
    total_epochs=1,
    project="concierge-train",
    experiment="concierge-rl",
)

# Train
trainer = AgentTrainer(
    backend="tinker",
    agent_flow=agent,
    evaluator=evaluator,
    config=config,
    train_dataset=train_dataset,
    val_dataset=val_dataset,
)
trainer.train()
You can also use custom AgentFlow and Evaluator classes directly:
trainer = AgentTrainer(
    backend="verl",
    agent_flow=MyAgentFlow(),
    evaluator=MyEvaluator(),
    config=config,
    train_dataset=train_dataset,
    val_dataset=val_dataset,
)
trainer.train()

The training loop

Each training iteration runs through these stages:
1

Generate episodes

For each task in the batch, the trainer calls agent_flow.run(task, config) to produce an Episode — just like during eval. The AgentConfig.base_url points to a gateway that transparently captures token-level traces (prompt IDs, response IDs, logprobs) from every LLM call.
2

Evaluate and assign rewards

The trainer calls evaluator.evaluate(task, episode) for each Episode, producing an EvalOutput with a reward and correctness flag. The reward is written back onto each Trajectory in the Episode.
3

Enrich with token data

The gateway’s captured traces are matched to Trajectories and converted into training-ready Steps with full token information. This is what makes the same AgentFlow work for both eval and training — your agent code doesn’t need to know about tokens or logprobs.
4

Compute advantages

Trajectories are grouped by {task_id}:{trajectory.name}. The RL algorithm (GRPO, REINFORCE, etc.) compares rewards within each group to compute advantages — determining which rollouts were better than average.
5

Update policy

The training backend uses the token-level data and advantages from each Step to compute policy gradients and update model weights.
6

Iterate

The updated model generates new Episodes on the next batch. The cycle repeats.

How the gateway works

Your AgentFlow makes LLM calls like normal — using an OpenAI-compatible client pointed at the base_url from AgentConfig. Behind the scenes, this URL routes through a gateway that:
  1. Forwards requests to the actual model server
  2. Records every request and response with token IDs and logprobs
  3. Associates traces with the correct Episode via the session_uid
After the AgentFlow completes, the trainer retrieves these traces and enriches the Episode’s Steps with the token-level data needed for training. Your agent code never needs to handle tokenization or logprob collection.

Training backends

AgentTrainer supports two backends:
Distributed RL training via Ray with vLLM/SGLang inference. Best for large-scale multi-GPU training.
trainer = AgentTrainer(
    backend="verl",
    agent_flow=agent,
    evaluator=evaluator,
    config=config,
    train_dataset=train_dataset,
    val_dataset=val_dataset,
)
See verl and tinker for backend-specific configuration.

Configuration

Training configs are OmegaConf/Hydra-based. The build_train_config helper covers common options:
config = build_train_config(
    model_name="Qwen/Qwen3-8B",    # Base model
    group_size=8,                    # Rollouts per task (for GRPO)
    batch_size=32,                   # Tasks per training batch
    lr=2e-5,                         # Learning rate
    lora_rank=32,                    # LoRA rank (None for full fine-tuning)
    total_epochs=1,                  # Training epochs
    total_steps=None,                # Or stop after N steps
    val_freq=5,                      # Validate every N steps
    save_freq=20,                    # Checkpoint every N steps
    project="my-project",           # W&B project name
    experiment="my-experiment",     # Experiment name
    output_dir=None,                # Checkpoint directory
    config_file=None,               # Optional YAML overrides
)
For full control, you can also provide a YAML config file or override individual values:
data:
  train_batch_size: 32
  max_prompt_length: 2048
  max_response_length: 1024

training:
  total_epochs: 1
  save_freq: 20
  val_freq: 5
  lr: 2e-5

model:
  name: "Qwen/Qwen3-8B"

rllm:
  algorithm:
    name: "grpo"
    group_size: 8
  workflow:
    n_parallel_tasks: 64
    retry_limit: 3

Next steps