pip install rllm) and want to control training behavior from their own project, without forking the codebase.
Customization layers
rLLM offers three tiers of customization, from least to most effort:| Tier | Mechanism | What you can change |
|---|---|---|
| Config | YAML file or CLI flags | Algorithm selection, hyperparameters, filtering, rejection sampling |
| Python kwargs | AgentTrainer(...) or UnifiedTrainer(...) arguments | Trajectory grouping logic, per-group advantage estimator selection |
| Registry | @register_rllm_adv_estimator decorator | Custom advantage computation functions |
Tier 1: Config-level customization
The simplest path. Pass a YAML file via--config on the CLI, or set keys programmatically with OmegaConf before passing to AgentTrainer.
Selecting an advantage estimator
Setalgorithm.adv_estimator in your config to one of the built-in estimators:
my_config.yaml
| Estimator | Behavior |
|---|---|
grpo | (reward - group_mean) / group_std — default, works well for most tasks |
reinforce | advantage = reward — no baseline subtraction |
reinforce_plus_plus_baseline | Per-group mean baseline, then whiten by batch-level std |
rloo | Leave-one-out: baseline for trajectory i is the mean of all other trajectories in its group |
Advantage normalization
GRPO normalizes advantages by group standard deviation by default. Disable this if your reward distribution is already well-scaled:Using precomputed advantages
If your workflow computes per-step advantages internally (e.g., for distillation or supervised fine-tuning), skip the advantage estimator entirely:step.advantage from each step in your trajectories and uses those values directly. Steps with missing advantages default to 0.
Loss function (tinker backend)
The tinker backend supports multiple loss functions. Set via config:Rejection sampling
Control whether low-quality batches are discarded or accumulated:Compact filtering
Mask out trajectories that hit error conditions, so they don’t contribute to the gradient:Programmatic config
From Python, build the config the same way the CLI does:build_train_config() from the CLI module and merge your overrides on top:
Tier 2: Python API hooks
These require usingAgentTrainer or UnifiedTrainer directly from Python, not the CLI.
Custom trajectory grouping
By default, trajectories are grouped by{task_id}:{trajectory_name} — all rollouts for the same task and trajectory role end up in one group, and advantages are computed within that group.
Override this with traj_grouping_hook to control how trajectories are grouped:
Per-group advantage estimator map
In multi-agent workflows (e.g., solver-judge), different trajectory groups play different roles. You may want a different advantage estimator for each role:group_id of each TrajectoryGroup. The default grouping produces IDs like {task_id}:{trajectory_name}, and the role is the trajectory_name portion.
When using
traj_group_adv_estimator_map, you must set algorithm.use_rllm: true in your config. The trainer raises an error otherwise.Workflow arguments
Pass arbitrary arguments to yourWorkflow.__init__() via workflow_args:
Tier 3: Registering custom advantage estimators
The advantage estimator registry (RLLM_ADV_ESTIMATOR_REGISTRY) uses a decorator pattern. You can register your own estimator at import time, and then reference it by name in your config.
Defining a custom estimator
Using it
After importing the module that contains your registration, set the estimator name in config:Estimator function signature
Your function must accept and return these types:rewards is a 1-D array of trajectory rewards for one group. Return advantages and returns in the same list-of-arrays structure.
The
**kwargs currently receives norm_adv_by_std_in_grpo from the algorithm config. Future versions may pass additional parameters.What you cannot customize today
The following aspects of the training loop are fixed and require source modifications to change:| Area | Limitation |
|---|---|
| Training pipeline stages | The 8-stage loop (generate → transform → reject → backend batch → process → advantages → update → log) is hardcoded. You cannot add, remove, or reorder stages. |
| Rejection sampling logic | Only three modes exist: none, episode, group. You cannot provide a custom rejection function. |
| Transform pipeline internals | Name imputation happens before your grouping hook; reward validation happens after it. You cannot change this order. |
| Backend protocol | Implementing a new training backend requires subclassing BackendProtocol (8 abstract methods + 8 optional hooks). There is no lightweight plugin mechanism. |
| Loss functions | The tinker backend supports a fixed set (importance_sampling, ppo, cispo, dro, cross_entropy). Custom loss functions require backend modifications. |
| CLI flags | The rllm train CLI does not expose traj_grouping_hook or traj_group_adv_estimator_map. These are Python API-only. The CLI also doesn’t expose adv_estimator as a flag — it must be set via --config. |
| Advantage estimator kwargs | The _prepare_adv_estimator_input function passes a fixed set of kwargs to your estimator. You cannot inject custom config values without modifying this function. |
| Per-step advantage computation | The per_step mode is deprecated. Only broadcast mode (trajectory-level rewards applied to all steps) is supported in the unified trainer. |

