Module:
rllm.trainer.algorithms.advantage- The estimator interface and its data contract
- Built-in estimators and what rLLM does (and doesn’t) support today
- Role-level estimator overrides
- Registering a custom estimator
- A worked example: porting OPO from Verl
Core concept
In the unified trainer, advantages are computed onTrajectoryGroups.
Groups are partitioned by group_role (for example, solver and judge); each role’s estimator is invoked once with all the groups belonging to that role.
The orchestrator passes three things to the estimator:
rewards: list[np.ndarray]— outer list aligned with the role’sTrajectoryGroups, inner array indexed by trajectory.algorithm_config: AlgorithmConfig— the resolved algorithm config, so the estimator can pull whatever it needs (for example,norm_adv_by_std_in_grpo).traj_groups: list[TrajectoryGroup]— same outer shape asrewards, exposed for estimators that need per-trajectory metadata (response lengths, step counts, etc.).
What rewards looks like
Concretely, if a training step produced 4 solver groups with 8 trajectories each, the solver call receives:
rewards— a length-4 list of 1-D numpy arrays of shape(8,).rewards[i][j]is the scalar reward for trajectoryjin groupi.
(advantages_by_group, returns_by_group) must align with rewards:
advantages_by_group[i].shape == rewards[i].shape- one scalar advantage per trajectory; the unified trainer broadcasts it across the trajectory’s response tokens later.
The estimator interface is trajectory-level scalar in, trajectory-level scalar out. Per-token signals (returns over time, GAE, K3 detector, etc.) cannot be expressed through this hook today. For per-token signals, see Pre-computing advantage in workflow.
Tinker loss mapping
For the Tinker backend, rLLM also maps each estimator to a default loss function inrllm/trainer/tinker/tinker_policy_trainer.py:
| Advantage estimator | Default Tinker loss fn |
|---|---|
REINFORCE | importance_sampling |
REINFORCE_PLUS_PLUS_BASELINE | importance_sampling |
GRPO | ppo |
RLOO | importance_sampling |
OTHER / unknown | importance_sampling (safe fallback) |
Built-in estimators
| Estimator enum | Value | Behavior |
|---|---|---|
GRPO | grpo | Per-group reward centering; optional std normalization (norm_adv_by_std_in_grpo) |
REINFORCE | reinforce | No baseline (adv = reward) |
REINFORCE_PLUS_PLUS_BASELINE | reinforce_plus_plus_baseline | Per-group centering, then role-batch std normalization |
RLOO | rloo | Leave-one-out baseline per group |
REINFORCE++ baseline, normalization uses batch-level statistics across all centered rewards in the role.
What rLLM supports today
The four estimators above are intentionally a subset of Verl’s full catalog. We expose what fits the rLLM hook contract — scalar reward per trajectory, scalar advantage per trajectory — so that the same code path serves both Tinker and Verl backends. Verl’s broader catalog includes GAE, REINFORCE++ (proper, per-token discounted reward-to-go), REMAX, OPO, GPG, GRPO-passk, GDPO, and OTB / TIR-OTB. Of these:- Scalar-per-trajectory estimators (OPO, GPG, GRPO-passk, GDPO) can be expressed through the current hook. The OPO worked example below walks through one such port. We plan to add more on request.
- Per-token or critic-dependent estimators (GAE, REINFORCE++ proper, OTB, TIR-OTB) need an interface extension and are a larger follow-up. If you need one of these today, write a per-token signal in your workflow and use
use_precomputed_advantageto bypass the estimator hook entirely.
Setting role-level estimators
The most powerful feature of the rLLM path is assigning different estimators to different trajectory roles in one training job. This is especially useful for multi-agent workflows where roles have different reward distributions. In a solver-judge workflow, for example, there are typically multiple solver trajectories per rollout (2 solver trajectories per rollout × N rollouts = 2N solver trajectories), so GRPO is a good fit for the solver. The judge depends on the solver’s outputs, so cross-rollout grouping is less meaningful for it; a vanilla REINFORCE is often the better choice.
Configure this via traj_group_adv_estimator_map in the trainer constructor:
Custom estimators
Use the registry helpers to register and retrieve custom advantage estimators:register_rllm_adv_estimator(name)get_rllm_adv_estimator(name)
**kwargs carries traj_groups: list[TrajectoryGroup] aligned with rewards. Pull it when you need per-trajectory metadata; ignore it otherwise.
A toy custom estimator that subtracts the role-batch mean:
traj_group_adv_estimator_map:
Worked example: porting OPO from Verl
OPO (https://arxiv.org/abs/2505.23585) computes a length-weighted baseline per group: Verl’s reference implementation iscompute_opo_outcome_advantage in verl/trainer/ppo/core_algos.py. It reads response lengths from response_mask. In rLLM, response lengths come from traj.steps[*].response_ids, which are reachable through traj_groups in **kwargs.
algorithm_config and traj_groups, return a list of advantage arrays aligned with rewards.
Related references
- Pre-computing advantage in workflow — for per-token signals (SFT-like, OPD)
- Unified trainer —
AlgorithmConfigreference - Configuration — full config field list

