rllm.trainer.unified_trainer, rllm.cli.trainUnifiedTrainer, but “unified” does not mean “uniform.” A feature can be available on one backend and a silent no-op on the other; reachable from a Python script but not the CLI; active for a sandboxed agent but not a plain workflow. This page is the map: it walks the four dimensions along which capabilities vary, so you can predict what a given combination will and won’t do before you launch a run.
This page is a cross-cutting view that assumes you already know rLLM’s runtime vocabulary. For the per-backend deep dives see Backend comparison; for the loop mechanics and the gateway see Unified trainer and Training concepts; for AgentFlow and hooks see the AgentFlow API; for the full field list see Configuration.
The four dimensions
Backend
Launch method
rllm train CLI vs Hydra/Python script — the CLI is a strict subset of the script surface.Execution flow
Dataset type
env_key/snapshots are meaningful and how rows become Tasks.Dimension 1 — Backend
Both backends share the same advantage-estimation layer and training loop. They diverge on infrastructure (only verl is distributed) and, more subtly, on whichalgorithm.* knobs are honored.
- tinker
- verl
GRPO → ppo, everything else → importance_sampling), an rLLM-side LR schedule with warmup, and a fused forward-backward-optim step (async overlap of the forward-backward and optimizer requests, per Tinker’s best practice).Algorithm knobs that are not portable
These live in the backend-agnosticAlgorithmConfig / base.yaml, so they look portable. On verl they take effect; on tinker several of them parse cleanly and then do nothing — there is no warning, because tinker has no validation guard for them.
algorithm.* knob | verl | tinker | If you set it on tinker… |
|---|---|---|---|
adv_estimator (GRPO / REINFORCE / RLOO / REINFORCE++) | ✅ | ✅ | Works (shared registry) |
kl_beta (KL in the loss) | ✅ | ❌ | Silent no-op. tinker logs KL as a diagnostic metric only; it is never added to the loss |
eps_clip / eps_clip_high (PPO clip) | ✅ | ❌ | Silent no-op. tinker applies its own fixed PPO clip inside the service; rLLM cannot read or set it |
loss_agg_mode | ✅ | ❌ | Silent no-op (tinker aggregation is fixed per-Datum) |
rollout_correction — truncated importance sampling, TIS (tis_mode, tis_cap, bypass_mode) | ✅ | ❌ | Silent no-op — see the callout below |
router_replay — R2 / R3 router-replay modes (MoE) | ✅ (Megatron only) | ❌ | Hard error: tinker rejects any non-disabled value at startup |
mask_truncated_samples | ✅ | ❌ | Silent no-op (only verl reads it) |
loss_fn | ✅ (verl loss names) | ✅ (tinker loss names) | Works, but the valid value space differs per backend |
lr_schedule + warmup | ✅ | ✅ | Works on both (different implementations) |
per_step stepwise advantage, and distillation are genuine verl capabilities — but they live on the legacy AgentWorkflowPPOTrainer (rllm.trainer.agent_trainer.AgentTrainer → train_agent_ppo.py), not on the UnifiedTrainer VerlBackend (which is the subject of this page; see the two-AgentTrainer-classes warning below). On the unified path use_critic is hardcoded False, KL-in-reward is rejected, and per_step is silently coerced to broadcast with a DeprecationWarning (it is never honored — a footgun in its own right).Dimension 2 — Launch method
workflow_class. An AgentFlow is the newer rollout abstraction (an async function decorated with @rllm.rollout); a workflow_class is the older class-based Workflow API. You pass one or the other to the trainer, and they select different engines (see Dimension 3). The CLI always uses an AgentFlow.- rllm train (CLI)
- AgentTrainer (script / Hydra)
backend="tinker", always builds the AgentFlow + evaluator/hooks execution flow, and writes only a fixed subset of config overrides (model, batch size, group size, epochs/steps, a few trainer.* keys, sampling). It is not a Hydra entry point — there is no key=value dotlist passthrough.The one escape hatch to other config keys is --config <your.yaml>, merged on top of the defaults. Note it cannot switch the backend (backend is a constructor argument, not read from config).Reachability at a glance
| You want to… | CLI | Script | How (script) |
|---|---|---|---|
| Train with verl | ❌ | ✅ | AgentTrainer(backend="verl") |
Use a workflow_class (no gateway) | ❌ | ✅ | AgentTrainer(workflow_class=...) |
| Use a remote runtime (harbor / agentcore) | ❌ | ✅ | rllm.remote_runtime.enabled=true |
Set adv_estimator / loss_fn / kl_beta / clip | ⚠️ --config only | ✅ | rllm.algorithm.*= |
| Rejection sampling / compact filtering / stepwise | ⚠️ --config only | ✅ | rllm.{rejection_sample,compact_filtering,stepwise_advantage}.*= |
| Fully-async training | ⚠️ --config only | ✅ | rllm.async_training.enable=true |
| Per-role advantage estimators / custom hooks | ❌ | ✅ | Python kwargs (below) |
LoRA per-module flags (train_attn/mlp/unembed) | ❌ | ✅ | model.train_*= (CLI exposes only --lora-rank) |
Resume mode resume_path / disable | ⚠️ --config only | ✅ | training.resume_mode= (CLI does auto by default) |
--config YAML.”)
Reaching a script-only feature from a CLI habit
Start from the CLI mental model
rllm train <dataset> --agent <agent> gives you tinker + GRPO + the AgentFlow path. Good for a vanilla run.Hit a wall (verl, a script-only knob, a remote runtime)
--config either can’t express it (backend) or you’d rather not hand-edit YAML.Dimension 3 — Execution flow
The trainer creates a gateway (a local proxy that captures each LLM call as a trainingStep and pushes sampling params / weight versions) for the agent-based flows, and auto-wires hooks (the SandboxTaskHooks lifecycle that provisions a sandbox per task) for the sandboxed flow. Which engine you get is decided from what you pass to the trainer:
Regular — plain AgentFlow (AgentFlowEngine, with gateway)
Regular — plain AgentFlow (AgentFlowEngine, with gateway)
agent_flow + an evaluator, no sandbox. Runs through AgentFlowEngine with a gateway, so trace→Step capture, session sampling params, and weight-version push all apply.Regular — workflow_class (UnifiedWorkflowEngine, no gateway)
Regular — workflow_class (UnifiedWorkflowEngine, no gateway)
workflow_class and no agent_flow. Runs through UnifiedWorkflowEngine without a gateway. Gateway-mediated features do not apply here, and there are no hooks — so no warm pool and no snapshot.Sandboxed — SandboxedAgentFlow + SandboxTaskHooks
Sandboxed — SandboxedAgentFlow + SandboxTaskHooks
SandboxedAgentFlow (or a harbor-task dataset), for which AgentTrainer auto-wires SandboxTaskHooks. The created/warm-pooled sandbox is the agent’s execution environment. This is the only flow where the warm pool and snapshot boot engage.Remote runtime — harbor / agentcore (RemoteAgentFlowEngine)
Remote runtime — harbor / agentcore (RemoteAgentFlowEngine)
rllm.remote_runtime.enabled=true. The in-container agent manages its own environment, so rLLM’s sandbox features are not applicable. Correctness comes from the runtime’s reward (a fixed reward >= 1.0 threshold), not a pluggable host-side evaluator. See AgentCore runtime.workflow_class does not. When you read “the regular path supports X,” check which regular path — they have opposite gateway behavior.What attaches to each flow
| Capability | regular (AgentFlow) | sandboxed | remote runtime | regular (workflow_class) |
|---|---|---|---|---|
| Gateway (trace capture, session params) | ✅ | ✅ | ✅ | ❌ |
SandboxTaskHooks lifecycle | — | ✅ | — | — |
| Warm-pool prefetch | ❌ | ✅ | ❌ | ❌ |
| Snapshot cold-start boot | ❌ | ✅ | ❌ | ❌ |
Pluggable host-side Evaluator | ✅ | ✅ | ❌ (reward threshold only) | ❌ (the workflow computes its own reward) |
| Per-rollout retry | ✅ | ✅ | ❌ (single attempt) | ✅ |
Dimension 4 — Dataset type
Anenv_key is the identity rLLM uses to decide which tasks can share a pre-built sandbox image: tasks with the same buildable environment get the same key and reuse one snapshot. rLLM-native parquet rows carry no per-task Dockerfile, so they all collapse to one constant env_key (a single default image) and a snapshot has nothing to differentiate — whereas harbor task dirs ship a Dockerfile, giving each environment a meaningful key.
| Capability | rLLM-native rows | harbor task dirs |
|---|---|---|
StatefulTaskDataLoader (deterministic, resumable) | ✅ | ✅ |
Meaningful env_key / snapshot acceleration | ⚠️ degenerate (one constant key) | ✅ |
| Remote-runtime (harbor) training | ❌ raises | ✅ |
CLI val-set Task-wrapping | ✅ | ⚠️ only if a harbor val_entry is resolved |
SandboxedAgentFlow, the warm pool still runs but the snapshot gives no real speedup (the single env_key). Plain rLLM-native rows on a non-sandboxed AgentFlow engage no sandbox or warm pool at all. The harbor remote-runtime training path additionally requires a task_path; plain parquet rows without one raise an error. On the harbor CLI path the train dataset is always Task-wrapped, but the val dataset is wrapped only when a harbor val_entry is resolved — otherwise a harbor val set stays as raw dicts and loses its verifier/environment resolution.
Gotchas worth memorizing
Silent no-ops on tinker
Silent no-ops on tinker
kl_beta, eps_clip/eps_clip_high, loss_agg_mode, rollout_correction/TIS, and mask_truncated_samples are honored on verl and silently ignored on tinker. They live in backend-agnostic config, so they look settable everywhere.per_step advantage is silently downgraded
per_step advantage is silently downgraded
stepwise_advantage.mode=per_step on the unified path does not error — it is silently coerced to broadcast with a DeprecationWarning on both backends. True per_step advantage exists only on the legacy verl AgentWorkflowPPOTrainer.filter_token_mismatch does nothing
filter_token_mismatch does nothing
filter_token_mismatch ships True in base.yaml but no rLLM-side Python reads it (it is a leftover verl-native knob, still present in agent_ppo_trainer.yaml). A dead knob on both unified backends.val_before_train default disagrees with itself
val_before_train default disagrees with itself
True if the key is absent, but the shipped base.yaml sets false. On normal CLI/Hydra runs the yaml wins; a hand-built config that omits the key will validate-before-train — the opposite of the shipped default. Set it explicitly in programmatic configs.CLI --sandbox-backend doesn't reach a harbor agent's environment
CLI --sandbox-backend doesn't reach a harbor agent's environment
harbor:<scaffold> agent, rllm eval maps --sandbox-backend onto the harbor environment type, but rllm train does not — train has no harbor-runtime handling and routes the agent through the local AgentFlow path instead. Configure harbor training as a remote runtime from a script.R3 router replay is incompatible with gateway rollout on verl
R3 router replay is incompatible with gateway rollout on verl
router_replay="R3" + verl + a gateway-based engine (sandboxed or remote-runtime flow) raises at construction. R3 on verl works only through the direct workflow_class path.Putting it together
A few worked combinations and what to expect:- Quick GRPO on a math dataset
- Multi-GPU verl run with KL-in-loss and TIS
- Sandboxed SWE agent with warm-pool acceleration
- Per-role estimators for a solver-judge workflow
rllm train → tinker + AgentFlowEngine + GRPO. Everything you need is a CLI flag. No clip/KL tuning available without --config, but GRPO doesn’t need them.See also
Training concepts
AgentFlow API
Backend comparison
Unified trainer
Configuration
sync_config table
