Skip to main content
This guide documents every customization surface available to users who install rLLM as a package (pip install rllm) and want to control training behavior from their own project, without forking the codebase.

Customization layers

rLLM offers three tiers of customization, from least to most effort:
TierMechanismWhat you can change
ConfigYAML file or CLI flagsAlgorithm selection, hyperparameters, filtering, rejection sampling
Python kwargsAgentTrainer(...) or UnifiedTrainer(...) argumentsTrajectory grouping logic, per-group advantage estimator selection
Registry@register_rllm_adv_estimator decoratorCustom advantage computation functions

Tier 1: Config-level customization

The simplest path. Pass a YAML file via --config on the CLI, or set keys programmatically with OmegaConf before passing to AgentTrainer.

Selecting an advantage estimator

Set algorithm.adv_estimator in your config to one of the built-in estimators:
my_config.yaml
rllm:
  algorithm:
    adv_estimator: rloo  # grpo | reinforce | reinforce_plus_plus_baseline | rloo
    use_rllm: true
EstimatorBehavior
grpo(reward - group_mean) / group_std — default, works well for most tasks
reinforceadvantage = reward — no baseline subtraction
reinforce_plus_plus_baselinePer-group mean baseline, then whiten by batch-level std
rlooLeave-one-out: baseline for trajectory i is the mean of all other trajectories in its group
From the CLI:
rllm train gsm8k --model Qwen/Qwen3-8B --config my_config.yaml

Advantage normalization

GRPO normalizes advantages by group standard deviation by default. Disable this if your reward distribution is already well-scaled:
rllm:
  algorithm:
    norm_adv_by_std_in_grpo: false

Using precomputed advantages

If your workflow computes per-step advantages internally (e.g., for distillation or supervised fine-tuning), skip the advantage estimator entirely:
rllm:
  algorithm:
    use_precomputed_advantage: true
When enabled, the trainer reads step.advantage from each step in your trajectories and uses those values directly. Steps with missing advantages default to 0.

Loss function (tinker backend)

The tinker backend supports multiple loss functions. Set via config:
rllm:
  algorithm:
    loss_fn: ppo  # importance_sampling | ppo | cispo | dro | cross_entropy
    lr_schedule: cosine  # constant | linear | cosine
    warmup_steps_ratio: 0.1

Rejection sampling

Control whether low-quality batches are discarded or accumulated:
rllm:
  rejection_sample:
    enable: true
    min_partial_solve_tasks: 1  # min tasks with non-zero reward before proceeding
    min_trajs_per_group: 2      # min trajectories per group

Compact filtering

Mask out trajectories that hit error conditions, so they don’t contribute to the gradient:
rllm:
  compact_filtering:
    enable: true
    mask_timeout: true
    mask_error: true
    mask_max_response_length_exceeded: true
    mask_max_turns_exceeded: true

Programmatic config

From Python, build the config the same way the CLI does:
from omegaconf import OmegaConf

base = OmegaConf.load("path/to/base.yaml")
overrides = OmegaConf.create({
    "rllm": {
        "algorithm": {
            "adv_estimator": "rloo",
            "use_rllm": True,
        },
        "rejection_sample": {"enable": True},
    }
})
config = OmegaConf.merge(base, overrides)
Or use build_train_config() from the CLI module and merge your overrides on top:
from rllm.experimental.cli.train import build_train_config
from omegaconf import OmegaConf

config = build_train_config(
    model_name="Qwen/Qwen3-8B",
    group_size=8, batch_size=32, lr=2e-5,
    lora_rank=32, total_epochs=1, total_steps=None,
    val_freq=5, save_freq=20,
    project="my-project", experiment="experiment-1",
    output_dir=None, config_file=None,
)

# Merge custom algorithm settings
config = OmegaConf.merge(config, OmegaConf.create({
    "rllm": {"algorithm": {"adv_estimator": "rloo", "use_rllm": True}}
}))

Tier 2: Python API hooks

These require using AgentTrainer or UnifiedTrainer directly from Python, not the CLI.

Custom trajectory grouping

By default, trajectories are grouped by {task_id}:{trajectory_name} — all rollouts for the same task and trajectory role end up in one group, and advantages are computed within that group. Override this with traj_grouping_hook to control how trajectories are grouped:
from rllm.agents.agent import Episode, TrajectoryGroup
from rllm.experimental.common.config import TransformConfig, CompactFilteringConfig

def my_grouping_hook(
    episodes: list[Episode],
    transform_config: TransformConfig,
    compact_filtering_config: CompactFilteringConfig | None = None,
) -> list[TrajectoryGroup]:
    """Group trajectories by task_id only, ignoring trajectory name."""
    from collections import defaultdict

    groups = defaultdict(list)
    metadata = defaultdict(list)

    for episode in episodes:
        for traj in episode.trajectories:
            if not traj.steps:
                continue
            key = episode.task_id  # group all trajectories for the same task
            groups[key].append(traj)
            metadata[key].append({
                "task_id": episode.task_id,
                "rollout_idx": episode.rollout_idx,
            })

    return [
        TrajectoryGroup(
            trajectories=trajs,
            group_id=key,
            metadata=metadata[key],
        )
        for key, trajs in groups.items()
    ]
Pass it to the trainer:
from rllm.experimental.unified_trainer import AgentTrainer

trainer = AgentTrainer(
    backend="tinker",
    workflow_class=MyWorkflow,
    config=config,
    train_dataset=train_dataset,
    traj_grouping_hook=my_grouping_hook,
)
trainer.train()
Your hook must return TrajectoryGroup objects with valid group_id and trajectories fields. Each trajectory must have a non-None reward when using broadcast mode (the default). The trainer validates rewards after your hook runs.

Per-group advantage estimator map

In multi-agent workflows (e.g., solver-judge), different trajectory groups play different roles. You may want a different advantage estimator for each role:
from rllm.experimental.common.config import rLLMAdvantageEstimator

trainer = AgentTrainer(
    backend="tinker",
    workflow_class=SolverJudgeWorkflow,
    config=config,
    train_dataset=train_dataset,
    traj_group_adv_estimator_map={
        "solver": rLLMAdvantageEstimator.GRPO,
        "judge": rLLMAdvantageEstimator.REINFORCE,
    },
)
trainer.train()
The map keys are group roles, derived from the group_id of each TrajectoryGroup. The default grouping produces IDs like {task_id}:{trajectory_name}, and the role is the trajectory_name portion.
When using traj_group_adv_estimator_map, you must set algorithm.use_rllm: true in your config. The trainer raises an error otherwise.

Workflow arguments

Pass arbitrary arguments to your Workflow.__init__() via workflow_args:
trainer = AgentTrainer(
    backend="tinker",
    workflow_class=DistillationWorkflow,
    workflow_args={
        "timeout": 600,
        "teacher_engine": teacher_engine,
        "clip_min": -5.0,
        "clip_max": 5.0,
    },
    config=config,
    train_dataset=train_dataset,
)

Tier 3: Registering custom advantage estimators

The advantage estimator registry (RLLM_ADV_ESTIMATOR_REGISTRY) uses a decorator pattern. You can register your own estimator at import time, and then reference it by name in your config.

Defining a custom estimator

import numpy as np
from rllm.experimental.common.advantage import register_rllm_adv_estimator

@register_rllm_adv_estimator("median_baseline")
def calculate_median_baseline_advantages(
    rewards: list[np.ndarray],
    **kwargs,
) -> tuple[list[np.ndarray], list[np.ndarray]]:
    """Use median reward as the baseline instead of mean."""
    advantages = []
    returns = []
    for group_rewards in rewards:
        baseline = np.median(group_rewards)
        adv = group_rewards - baseline
        advantages.append(adv)
        returns.append(group_rewards)
    return advantages, returns

Using it

After importing the module that contains your registration, set the estimator name in config:
# Must import before creating the trainer so the decorator runs
import my_project.custom_advantages  # registers "median_baseline"

from omegaconf import OmegaConf

config = OmegaConf.merge(config, OmegaConf.create({
    "rllm": {
        "algorithm": {
            "adv_estimator": "median_baseline",
            "use_rllm": True,
        }
    }
}))

trainer = AgentTrainer(
    backend="tinker",
    workflow_class=MyWorkflow,
    config=config,
    train_dataset=train_dataset,
)
trainer.train()

Estimator function signature

Your function must accept and return these types:
def my_estimator(
    rewards: list[np.ndarray],  # one array per trajectory group
    **kwargs,                    # receives norm_adv_by_std_in_grpo and other config values
) -> tuple[
    list[np.ndarray],           # advantages (same shape as rewards)
    list[np.ndarray],           # returns (same shape as rewards)
]:
    ...
Each element in rewards is a 1-D array of trajectory rewards for one group. Return advantages and returns in the same list-of-arrays structure.
The **kwargs currently receives norm_adv_by_std_in_grpo from the algorithm config. Future versions may pass additional parameters.

What you cannot customize today

The following aspects of the training loop are fixed and require source modifications to change:
AreaLimitation
Training pipeline stagesThe 8-stage loop (generate → transform → reject → backend batch → process → advantages → update → log) is hardcoded. You cannot add, remove, or reorder stages.
Rejection sampling logicOnly three modes exist: none, episode, group. You cannot provide a custom rejection function.
Transform pipeline internalsName imputation happens before your grouping hook; reward validation happens after it. You cannot change this order.
Backend protocolImplementing a new training backend requires subclassing BackendProtocol (8 abstract methods + 8 optional hooks). There is no lightweight plugin mechanism.
Loss functionsThe tinker backend supports a fixed set (importance_sampling, ppo, cispo, dro, cross_entropy). Custom loss functions require backend modifications.
CLI flagsThe rllm train CLI does not expose traj_grouping_hook or traj_group_adv_estimator_map. These are Python API-only. The CLI also doesn’t expose adv_estimator as a flag — it must be set via --config.
Advantage estimator kwargsThe _prepare_adv_estimator_input function passes a fixed set of kwargs to your estimator. You cannot inject custom config values without modifying this function.
Per-step advantage computationThe per_step mode is deprecated. Only broadcast mode (trajectory-level rewards applied to all steps) is supported in the unified trainer.

Summary: customization decision tree

Want to change the advantage algorithm?
├── Use a built-in one → set `algorithm.adv_estimator` in config (Tier 1)
├── Use different estimators for different agent roles → pass `traj_group_adv_estimator_map` (Tier 2)
├── Write your own math → register with `@register_rllm_adv_estimator` (Tier 3)
└── Bring precomputed advantages from your workflow → set `use_precomputed_advantage: true` (Tier 1)

Want to change how trajectories are grouped?
├── Default grouping is fine → do nothing
└── Custom logic needed → pass `traj_grouping_hook` (Tier 2)

Want to change training hyperparameters?
└── YAML config or CLI flags (Tier 1)

Want to change the training loop structure?
└── Not possible without source modifications

Next steps