The rLLM framework provides a unified configuration system that separates backend-agnostic settings from backend-specific configurations. This design allows you to switch between different RL backends (Tinker, Verl) while maintaining consistent core training logic.
Configuration structure
The configuration system is organized into three main components:
- rLLM backend-agnostic configs: Core training settings shared across all backends
- Backend-specific configs: Settings specific to Tinker or Verl backends
- Forwarding mechanism: Allows backend-specific configs to override rLLM configs for backward compatibility
All configuration files are located in rllm/trainer/config/:
rllm/trainer/config/rllm/base.yaml: Backend-agnostic rLLM configurations
rllm/trainer/config/rllm/backend/tinker.yaml: Tinker-specific configurations
rllm/trainer/config/rllm/backend/verl.yaml: Verl-specific configurations
rllm/trainer/config/unified.yaml: Main entry point that combines all configs
rLLM backend-agnostic configurations
These configurations are defined in rllm/base.yaml and are used across different backends.
Agent configuration
Settings for the agent that interacts with the environment.
| Parameter | Type | Default | Description |
|---|
name | str | math_agent | Name of the agent |
max_steps | int | 20 | Maximum number of steps per trajectory |
trajectory_timeout | int | null | null | Timeout for trajectory execution (seconds) |
overlong_filter | bool | False | Whether to filter out overlong trajectories |
agent_args | dict | {} | Additional agent-specific arguments |
engine_args | dict | {} | Additional engine-specific arguments |
Environment configuration
Settings for the environment where the agent operates.
| Parameter | Type | Default | Description |
|---|
name | str | custom | Name of the environment |
env_args | dict | {} | Additional environment-specific arguments |
Workflow configuration
Settings for workflow-based training (alternative to agent-based training).
| Parameter | Type | Default | Description |
|---|
use_workflow | bool | False | Whether to use workflow mode instead of agent mode |
name | str | single_turn_workflow | Name of the workflow |
workflow_args.agent_cls | str | null | null | Agent class to use in workflow |
workflow_args.agent_args | dict | {} | Agent arguments in workflow |
workflow_args.env_cls | str | null | null | Environment class to use in workflow |
workflow_args.env_args | dict | {} | Environment arguments in workflow |
workflow_args.timeout | float | 1e6 | Workflow execution timeout |
workflow_args.gamma | float | 0.0 | Discount factor (0.0 = no discounting) |
workflow_args.reward_bonus_coeff | float | 0.0 | Reward shaping coefficient |
n_parallel_tasks | int | 256 | Number of parallel tasks to run |
retry_limit | int | 3 | Maximum number of retries on failure |
raise_on_error | bool | True | Whether to raise exceptions on errors |
Rollout configuration
Settings for trajectory rollouts during training and validation.
These settings are primarily for logging purposes. The actual rollout behavior is determined by backend-specific configurations.
| Parameter | Type | Default | Description |
|---|
n | int | 8 | Number of rollouts per prompt during training |
n_val | int | 1 | Number of rollouts per prompt during validation |
Trainer configuration
Core training loop settings.
| Parameter | Type | Default | Description |
|---|
total_epochs | int | 10 | Total number of training epochs |
total_batches | int | -1 | Total number of training batches (-1 = use epochs) |
logger | list[str] | ['console'] | Logging backends (options: console, wandb, tensorboard) |
project_name | str | rllm-training | Project name for logging |
experiment_name | str | default | Experiment name for logging |
test_freq | int | 5 | Frequency of validation (in epochs) |
save_freq | int | 20 | Frequency of checkpoint saving (in epochs) |
val_before_train | bool | True | Whether to run validation before training starts |
val_only | bool | False | Whether to only run validation (no training) |
Algorithm configuration
RL algorithm and advantage estimation settings.
| Parameter | Type | Default | Description |
|---|
adv_estimator | str | grpo | Advantage estimator (options: grpo, reinforce, reinforce_plus_plus_baseline, rloo) |
norm_adv_by_std_in_grpo | bool | True | Whether to normalize advantages by standard deviation in GRPO |
loss_fn | str | null | null | Loss function for Tinker backend (options: importance_sampling, ppo, cispo, dro, cross_entropy) |
Stepwise advantage configuration
Settings for computing advantages at each step in multi-step trajectories.
| Parameter | Type | Default | Description |
|---|
enable | bool | False | Whether to enable stepwise advantage computation |
mode | str | broadcast | Advantage computation mode (options: broadcast, per_step) |
normalize_by_steps | bool | False | Whether to normalize advantages by number of steps |
Trajectory processing flags
Top-level flags for trajectory processing and filtering.
| Parameter | Type | Default | Description |
|---|
disable_thinking | bool | False | Whether to disable thinking tokens in responses |
accumulate_reasoning | bool | False | Whether to accumulate reasoning across steps |
mask_truncated_samples | bool | False | Whether to mask trajectories that were truncated |
filter_token_mismatch | bool | True | Whether to filter out trajectories with token mismatches |
Compact filtering configuration
Fine-grained filtering of trajectories based on various termination conditions.
| Parameter | Type | Default | Description |
|---|
enable | bool | False | Whether to enable compact filtering |
mask_max_prompt_length_exceeded | bool | True | Mask trajectories that exceed max prompt length |
mask_max_response_length_exceeded | bool | True | Mask trajectories that exceed max response length |
mask_env_done | bool | False | Mask trajectories where environment signaled done |
mask_max_turns_exceeded | bool | True | Mask trajectories that exceed max turns |
mask_timeout | bool | True | Mask trajectories that timed out |
mask_unknown | bool | False | Mask trajectories with unknown termination reasons |
mask_error | bool | True | Mask trajectories that encountered errors |
Rejection sampling configuration
Settings for rejection sampling to improve training data quality.
| Parameter | Type | Default | Description |
|---|
enable | bool | False | Whether to enable rejection sampling |
multiplier | int | 1 | Multiplier for number of rollouts to generate |
min_partial_solve_tasks | int | 1 | Minimum number of tasks that must be partially solved |
min_trajs_per_group | int | 2 | Minimum number of trajectories per group to keep |
SDK configuration
Settings for the rLLM SDK, including trace storage and proxy server.
| Parameter | Type | Default | Description |
|---|
store.path | str | ~/.rllm/traces.db | Path to trace database |
processing.groupby_key | str | null | null | Key to group trajectories by |
processing.traj_name_key | str | null | null | Key to use as trajectory name |
proxy.host | str | 127.0.0.1 | Proxy server host |
proxy.port | int | 4000 | Proxy server port |
proxy.mode | str | subprocess | Proxy mode (options: subprocess, external) |
proxy.admin_token | str | my-shared-secret | Admin token for proxy authentication |
Episode logging configuration
Settings for logging full episode trajectories to disk.
| Parameter | Type | Default | Description |
|---|
log_episodes | bool | false | Whether to log full episodes to disk |
episode_log_dir | str | logs/${rllm.trainer.project_name}/${rllm.trainer.experiment_name} | Directory for episode logs |
Backend-specific configurations
Tinker backend
Tinker-specific settings live in rllm/trainer/config/rllm/backend/tinker.yaml.
This file contains:
- Tinker service and execution settings
- Model/LoRA training settings
- Sampling and rollout-engine settings
- Tinker-native training/data blocks
- Forwarding into
rllm.* common config keys
Top-level Tinker-specific keys
| Parameter | Type | Default | Description |
|---|
tinker_base_url | str | null | null | Tinker service URL (null for local/default) |
fuse_forward_backward_and_optim_step | bool | false | Whether to fuse train-step internals in backend |
Model block
| Parameter | Type | Default | Description |
|---|
model.name | str | Qwen/Qwen3-8B | Base model name |
model.lora_rank | int | 32 | LoRA rank |
model.train_unembed | bool | true | Train LoRA on output embedding layer |
model.train_attn | bool | true | Train LoRA on attention layers |
model.train_mlp | bool | true | Train LoRA on MLP layers |
Training block (Tinker-native)
| Parameter | Type | Default | Description |
|---|
training.group_size | int | ??? | Number of rollouts per prompt |
training.learning_rate | float | 2e-5 | Learning rate |
training.lr_schedule | str | constant | LR schedule (constant, linear, cosine) |
training.warmup_steps_ratio | float | 0.0 | Warmup ratio in [0, 1] |
training.beta1 | float | 0.9 | Adam beta1 |
training.beta2 | float | 0.95 | Adam beta2 |
training.eps | float | 1e-8 | Adam epsilon |
training.max_length | int | 32768 | Max model context length |
training.num_minibatches | int | 1 | Number of minibatches |
training.default_local_dir | str | /tmp/rllm-tinker-checkpoints | Local checkpoint directory |
training.resume_from_tinker_id | str | null | null | Optional checkpoint/model ID to resume |
Validation, sampling, rollout, and data blocks
| Parameter | Type | Default | Description |
|---|
validation.group_size | int | ??? | Rollouts per prompt for validation |
sampling.train.temperature | float | 1.0 | Train sampling temperature |
sampling.train.top_p | float | 1.0 | Train nucleus sampling threshold |
sampling.train.top_k | int | -1 | Train top-k |
sampling.val.temperature | float | 1.0 | Val sampling temperature |
sampling.val.top_p | float | 1.0 | Val nucleus sampling threshold |
sampling.val.top_k | int | -1 | Val top-k |
rollout_engine.reasoning_effort | str | medium | Reasoning effort mode |
rollout_engine.accumulate_reasoning | bool | false | Whether to accumulate reasoning across steps |
rollout_engine.disable_thinking | bool | false | Whether to disable thinking tokens |
rollout_engine.renderer_name | str | null | null | Optional renderer name |
data.max_prompt_length | int | 2048 | Max prompt length |
data.max_response_length | int | 2048 | Max response length |
data.train_batch_size | int | 64 | Train batch size |
data.val_batch_size | int | 32 | Validation batch size |
Forwarding to common rllm.*
Tinker backend forwards group-size settings into backend-agnostic rollout config:
rllm.rollout.n <- training.group_size
rllm.rollout.n_val <- validation.group_size
Verl backend
Verl-specific settings live in rllm/trainer/config/rllm/backend/verl.yaml.
This file is intentionally thin and composes Verl’s native PPO config via:
defaults:
- /ppo_trainer
- _self_
For detailed semantics of Verl-native fields, see the
Verl configuration docs.
In rLLM, the verl.yaml only does two things:
- Sets a small number of required overrides for unified-trainer compatibility (e.g.
actor_rollout_ref.rollout.mode=async, actor.use_rollout_log_probs=True).
- Pins one rllm-namespaced default that diverges from verl’s (
rllm.algorithm.rollout_correction.bypass_mode=False).
Everything else — propagating values between the verl-native namespace and the rllm.* namespace — happens at runtime via sync_config in rllm/trainer/verl/utils.py.
Key fields in verl.yaml
| Parameter | Type | Default | Description |
|---|
actor_rollout_ref.rollout.mode | str | async | Required mode for unified Verl backend |
actor_rollout_ref.rollout.agent.num_workers | int | 0 | Agent worker count |
actor_rollout_ref.rollout.calculate_log_probs | bool | True | Compute log-probs during rollout (needed for rollout-correction) |
actor_rollout_ref.rollout.val_kwargs.do_sample | bool | True | Use sampling during validation |
actor_rollout_ref.actor.use_rollout_log_probs | bool | True | Reuse rollout log-probs in the actor (bypass-mode default) |
data.gen_batch_size | int | ${mul:...} | Generated batch size |
data.return_multi_modal_inputs | bool | False | Include multimodal inputs in data path |
rllm.backend | str | verl | Backend selector |
rllm.algorithm.rollout_correction.bypass_mode | bool | False | Verl-side default differs from rLLM’s null; pinned here |
All other shared knobs (algorithm.adv_estimator, actor.kl_loss_coef, trainer.total_epochs, …) live in their natural locations — verl-native keys come from ppo_trainer.yaml, rllm-namespace keys from base.yaml — and are reconciled at startup by sync_config.
Bidirectional config sync
For a fixed table of “shared keys” (the same value, different paths in the two namespaces), sync_config mirrors the value between the verl-native side and the rllm.* side at trainer startup. New configs should use the rllm.* path. Existing Verl-style CLI overrides still work for backward compatibility, but they log a deprecation warning.
Per-key precedence:
rllm.* value explicitly set on the Hydra CLI
- Verl-native value explicitly set on the Hydra CLI
rllm.* value from base.yaml (when non-null)
- Verl-native value from
ppo_trainer.yaml
Verl-native shared-key CLI overrides are deprecated. If you set a shared key on the Verl-native side, rLLM will still sync it for now and log a warning. If both sides set conflicting values, the rllm.* value wins and rLLM logs a conflict warning. Passing an extra yaml/config group to override shared keys is not a supported migration path; pass shared values as individual Hydra overrides instead.
The shared-keys table:
| Verl-native path | rllm.* path |
|---|
algorithm.adv_estimator | rllm.algorithm.adv_estimator |
algorithm.norm_adv_by_std_in_grpo | rllm.algorithm.norm_adv_by_std_in_grpo |
algorithm.rollout_correction.bypass_mode | rllm.algorithm.rollout_correction.bypass_mode |
algorithm.rollout_correction.rollout_is | rllm.algorithm.rollout_correction.tis_mode |
algorithm.rollout_correction.rollout_is_threshold | rllm.algorithm.rollout_correction.tis_cap |
actor_rollout_ref.actor.kl_loss_coef | rllm.algorithm.kl_beta |
actor_rollout_ref.actor.policy_loss.loss_mode | rllm.algorithm.loss_fn |
actor_rollout_ref.actor.loss_agg_mode | rllm.algorithm.loss_agg_mode |
actor_rollout_ref.actor.clip_ratio_high | rllm.algorithm.eps_clip_high |
actor_rollout_ref.rollout.n | rllm.rollout.n |
actor_rollout_ref.rollout.val_kwargs.n | rllm.rollout.n_val |
trainer.total_epochs | rllm.trainer.total_epochs |
trainer.total_training_steps | rllm.trainer.total_batches |
trainer.logger | rllm.trainer.logger |
trainer.project_name | rllm.trainer.project_name |
trainer.experiment_name | rllm.trainer.experiment_name |
trainer.test_freq | rllm.trainer.test_freq |
trainer.save_freq | rllm.trainer.save_freq |
trainer.val_before_train | rllm.trainer.val_before_train |
trainer.val_only | rllm.trainer.val_only |
Two extra rules sit alongside this table:
actor.use_kl_loss is derived from kl_beta: if you do not explicitly set actor_rollout_ref.actor.use_kl_loss on the CLI, sync_config sets it to (kl_beta > 0). Setting rllm.algorithm.kl_beta also mirrors into actor_rollout_ref.actor.kl_loss_coef. Setting the Verl-native coefficient still backfills rllm.algorithm.kl_beta for now, with a deprecation warning.
clip_ratio family. rllm.algorithm.eps_clip mirrors to actor_rollout_ref.actor.clip_ratio and actor_rollout_ref.actor.clip_ratio_low. If rllm.algorithm.eps_clip_high is set, it mirrors to actor_rollout_ref.actor.clip_ratio_high; otherwise the upper bound mirrors eps_clip. Verl-native clip_ratio, clip_ratio_low, and clip_ratio_high still backfill the rLLM values for now when the rLLM side is not set, with deprecation warnings.
It also extends the Hydra search path so /ppo_trainer resolves:
rllm:
hydra.searchpath:
- pkg://verl.trainer.config
Config forwarding mechanism
The Verl backend uses bidirectional sync (described above): for any shared key, the preferred path is rllm.*, while legacy Verl-native shared-key CLI overrides still work with deprecation warnings. The Tinker backend does its own one-way forwarding from native group-size settings into rllm.rollout.{n,n_val} (see “Tinker backend” above).
Example: set the preferred shared knob
Set adv_estimator on the rllm.* side:
python train.py rllm/backend=verl rllm.algorithm.adv_estimator=rloo
A legacy Verl-native override still works for now, but logs a deprecation warning:
python train.py rllm/backend=verl algorithm.adv_estimator=rloo
If both are set and conflict, the rllm.* value wins.
Example: KL-in-loss
Setting rllm.algorithm.kl_beta=0.01 is enough — actor.kl_loss_coef is mirrored and actor.use_kl_loss is auto-set to True:
python train.py rllm/backend=verl rllm.algorithm.kl_beta=0.01
The legacy Verl-native equivalent still works for now, but logs a deprecation warning:
python train.py rllm/backend=verl actor_rollout_ref.actor.kl_loss_coef=0.01
Benefits
- Backward compatibility. Existing scripts that override on the verl-native side continue to work while logging deprecation warnings; the rllm-side value is mirrored automatically.
- Backend portability. New scripts can target the backend-agnostic
rllm.* namespace and run on either Tinker or Verl with the same flags.
- Single source of truth at runtime. No
oc.select interpolation in the yaml; the merged config that goes to verl’s workers and rLLM’s trainer holds the same values on both sides.
Configuration best practices
-
Use rLLM configs for new projects: If starting from scratch, use the rLLM backend-agnostic configs for better portability across backends.
-
Use rLLM paths for shared knobs: Existing Verl-native shared-key overrides still work for compatibility, but new configs should use
rllm.*.
-
Check the unified config: The
unified.yaml file shows how all configs are combined and is useful for debugging configuration issues.
-
Understand defaults hierarchy: Backend-specific configs override rLLM defaults, which in turn override Hydra’s base defaults.