Skip to main content
Module: rllm.experimental.common.advantage
Backend behavior differsThe phrase “rLLM advantage estimator” does not mean the same runtime path for all backends:
  • Tinker backend: advantage computation is entirely rLLM-based. The following sections naturally apply.
  • Verl backend: by default (rllm.algorithm.use_rllm=False), it uses Verl-native advantage computation; rLLM-based computation is used only when rllm.algorithm.use_rllm=True.
The rLLM estimator set is intentionally smaller than Verl’s full native set. The tradeoff is a unified interface with easier role-level customization.
This page walks through:
  1. How the rLLM estimator interface works
  2. How behavior differs between Tinker and Verl
  3. Why role-level estimator overrides are powerful
  4. How to register and use custom estimators

Core concept

In the unified trainer, reward comparison happens on TrajectoryGroups. Groups are partitioned by group_role (for example, solver and judge), and each role’s estimator receives:
  • a batch (list) of reward arrays (list[np.ndarray]), from a batch of TrajectoryGroups with the same group_role.
  • each reward array contains all the trajectories’ rewards in a single TrajectoryGroup.
Expected estimator signature:
def my_adv_estimator(
    rewards: list[np.ndarray],
    **kwargs,
) -> tuple[list[np.ndarray], list[np.ndarray]]:
    ...
Where outputs are aligned to input groups:
  • advantages_by_group[i] corresponds to rewards[i]
  • returns_by_group[i] has the same shape/alignment as advantages_by_group[i]

Backend-specific behavior

Tinker backend

Tinker uses the rLLM-based advantage path in the unified training flow. Your configured rllm.algorithm.adv_estimator and any role overrides apply directly.

Tinker loss mapping

For Tinker, rLLM also maps estimators to default loss functions in rllm/trainer/tinker/tinker_policy_trainer.py:
Advantage estimatorDefault Tinker loss fn
REINFORCEimportance_sampling
REINFORCE_PLUS_PLUS_BASELINEimportance_sampling
GRPOppo
RLOOimportance_sampling
OTHER / unknownimportance_sampling (safe fallback)
Override anytime via:
rllm:
  algorithm:
    loss_fn: ppo  # or importance_sampling / cispo / dro / cross_entropy

Verl backend

Verl has two possible paths:
  • rllm.algorithm.use_rllm=false (default): use Verl-native advantage computation.
  • rllm.algorithm.use_rllm=true: use rLLM-based advantage computation and then inject it back into the Verl batch.
Role-level estimator overrides (traj_group_adv_estimator_map) are an rLLM-path feature and require use_rllm=true.

Built-in rLLM estimators

Estimator enumValueBehavior
GRPOgrpoPer-group reward centering; optional std normalization (norm_adv_by_std_in_grpo)
REINFORCEreinforceNo baseline (adv = reward)
REINFORCE_PLUS_PLUS_BASELINEreinforce_plus_plus_baselinePer-group centering, then role-batch std normalization
RLOOrlooLeave-one-out baseline per group
For REINFORCE++ baseline, normalization uses batch-level statistics to calculate the standard deviation of all centered rewards.

Setting role-level advantage estimators

The most powerful feature of the rLLM path is assigning different estimators to different trajectory roles in one training job. This is especially useful for multi-agent workflows where roles have different reward/statistical properties. For instance, in a solver-judge workflow, we have an abundant number of solver trajectories (2 solver trajectories per rollout * N rollouts = 2N solver trajectories), so GRPO can be a good choice for optimizing the solver’s performance. The judge, on the other hand, depends on the rollout’s own solver answers, so it can be less reasonable to compare their relative advantage by grouping across rollouts. So we might want to use a more vanilla REINFORCE for the judge. This setup can be achieved by configuring the traj_group_adv_estimator_map in the trainer constructor:
from rllm.experimental.common.config import rLLMAdvantageEstimator
from rllm.experimental.unified_trainer import AgentTrainer

traj_group_adv_estimator_map = {
    "solver": rLLMAdvantageEstimator.GRPO,
    "judge": rLLMAdvantageEstimator.REINFORCE,
}

trainer = AgentTrainer(
    ...,
    backend="tinker",
    traj_group_adv_estimator_map=traj_group_adv_estimator_map,
)
If you pass traj_group_adv_estimator_map, set rllm.algorithm.use_rllm=true. UnifiedTrainer validates this.
Global default remains:
rllm:
  algorithm:
    use_rllm: true
    adv_estimator: grpo
  stepwise_advantage:
    mode: broadcast
    norm_adv_by_std_in_grpo: true

Custom estimators

Use the registry helpers to register and retrieve custom advantage estimators:
  • register_rllm_adv_estimator(name)
  • get_rllm_adv_estimator(name)
import numpy as np
from rllm.experimental.common.advantage import (
    register_rllm_adv_estimator,
    get_rllm_adv_estimator,
)


@register_rllm_adv_estimator("my_custom_adv")
def my_custom_adv(rewards: list[np.ndarray], **kwargs):
    advantages_by_group = [group_rewards - np.mean(group_rewards) for group_rewards in rewards]
    returns_by_group = advantages_by_group
    return advantages_by_group, returns_by_group


fn = get_rllm_adv_estimator("my_custom_adv")
You can use the custom estimator as a global default:
rllm:
  algorithm:
    use_rllm: true
    adv_estimator: my_custom_adv
Or as a role-specific override in traj_group_adv_estimator_map:
traj_group_adv_estimator_map = {
    "solver": "my_custom_adv",
    "judge": "reinforce",
}