AgentTrainer — it handles episode generation, reward assignment, advantage computation, and policy updates.
During eval, the pipeline is one-directional:
AgentTrainer handles the additional machinery — routing LLM calls through a gateway that captures token-level data (prompt IDs, response IDs, logprobs) needed for policy gradients.
Basic usage
Pass anagent_flow and evaluator to AgentTrainer, then call train():
The training loop
Each training iteration runs through these stages:Generate episodes
For each task in the batch, the trainer calls
agent_flow.run(task, config) to produce an Episode — just like during eval. The AgentConfig.base_url points to a gateway that transparently captures token-level traces (prompt IDs, response IDs, logprobs) from every LLM call.Evaluate and assign rewards
The trainer calls
evaluator.evaluate(task, episode) for each Episode, producing an EvalOutput with a reward and correctness flag. The reward is written back onto each Trajectory in the Episode.Enrich with token data
The gateway’s captured traces are matched to Trajectories and converted into training-ready Steps with full token information. This is what makes the same AgentFlow work for both eval and training — your agent code doesn’t need to know about tokens or logprobs.
Compute advantages
Trajectories are grouped by
{task_id}:{trajectory.name}. The RL algorithm (GRPO, REINFORCE, etc.) compares rewards within each group to compute advantages — determining which rollouts were better than average.Update policy
The training backend uses the token-level data and advantages from each Step to compute policy gradients and update model weights.
How the gateway works
Your AgentFlow makes LLM calls like normal — using an OpenAI-compatible client pointed at thebase_url from AgentConfig. Behind the scenes, this URL routes through a gateway that:
- Forwards requests to the actual model server
- Records every request and response with token IDs and logprobs
- Associates traces with the correct Episode via the
session_uid
Training backends
AgentTrainer supports two backends:
- verl
- tinker
Distributed RL training via Ray with vLLM/SGLang inference. Best for large-scale multi-GPU training.
Configuration
Training configs are OmegaConf/Hydra-based. Thebuild_train_config helper covers common options:
Key configuration sections
Key configuration sections

