Related modules:
rllm.agents.agent.Step, rllm.experimental.common.advantage, rllm.experimental.common.config.AlgorithmConfigstep.advantage during workflow rollout and let the
unified trainer consume it directly, instead of computing advantages later from
trajectory rewards.
Why this exists
In standard RL estimators (for example GRPO), advantages are computed after grouping trajectories intoTrajectoryGroups. This requires rollout results from multiple
samples, so the calculation naturally happens in the trainer pipeline.
For other post-training setups, however, “advantage” is better treated as a generic
per-token training signal that can be produced directly inside workflow logic.
Common examples:
- SFT-like supervision: token-level signals can be derived from demonstration targets.
- On-policy distillation (OPD): token-level reverse-KL-style signals can be computed from teacher/student log-probabilities.
How it works
The advantage collection logic is inrllm.experimental.common.advantage.collect_reward_and_advantage_from_trajectory_groups.
For each TrajectoryGroup:
- if any step has
step.advantage != Noneanduse_precomputed_advantage=true, rLLM consumes precomputed values from steps in that group - otherwise, rLLM falls back to RL estimator computation from trajectory rewards
- some roles/groups use precomputed step-level signals
- other roles/groups still use RL estimators (GRPO/REINFORCE/RLOO/custom)
Data contract for step.advantage
When precompute mode is active for a group:
step.advantagecan be:float: broadcast to all tokens instep.response_idslist[float]: must matchlen(step.response_ids)
- unsupported types raise an error
- length mismatches are replaced by zeros with a warning
use_precomputed_advantage=false, rLLM logs a warning
and overwrites with the configured RL estimator.
Solver-judge example (mixed mode)
The solver-judge workflow is a good example of why this is useful. Suppose each episode has twosolver trajectories and one judge trajectory, the workflow (with some simplification from examples.solver_judge.solver_judge_flow) looks like this:
Grouping intuition
Grouping intuition
If you sample
N rollouts per prompt:- solver group size is approximately
2Ntrajectories - judge group size is approximately
Ntrajectories
- keep
solveron GRPO (group-based RL) - precompute
judgestep advantages using OPD-style teacher signals
Configure the advantage estimator for solver (no precomputed advantage)
Precompute judge advantage in workflow
Here we assume that you have access to a generic “teacher” client that can evaluate the log probabilities of any token sequence. In reality, this might depend on the backend you use — for instance, in Tinker this can be atinker.SamplingClient instance.
We can then assign a reverse-KL-style advantage to the judge step in the workflow:
judge group will bypass the usual RL advantage computation process during the training loop, and directly use the precomputed advantages.
While the solver group will receive the usual group-based RL advantages from the trainer.
Configure trainer for mixed mode
Following the example above, we can take one step further by considering a scenario where we have an extra role in our workflow, avalidator role, that we want to equip with a different RL advantage estimator (e.g. REINFORCE, or a custom advantage estimator you registered). We can combine the precomputed advantage with the role-level advantage estimator to achieve this:
judgeuses your precomputedstep.advantagesolverandvalidatoruse estimator-based RL advantages
Responsibility and sanity checks
Precompute mode increases flexibility, but correctness becomes your responsibility. rLLM performs basic validation (type/length checks, fallback warnings), but it cannot verify whether your mathematical signal is “correct” for your intended algorithm. Recommended practice:- Start with small runs and inspect
advantage/*metrics. - Log representative
step.advantagesamples from workflow. - Validate shape/value ranges before scaling experiments.

