Configuration

The rLLM framework provides a unified configuration system that separates backend-agnostic settings from backend-specific configurations. This design allows you to switch between different RL backends (Tinker, Verl) while maintaining consistent core training logic.

Configuration structure

The configuration system is organized into three main components:

rLLM backend-agnostic configs: Core training settings shared across all backends
Backend-specific configs: Settings specific to Tinker or Verl backends
Forwarding mechanism: Allows backend-specific configs to override rLLM configs for backward compatibility

All configuration files are located in rllm/trainer/config/:

rllm/trainer/config/rllm/base.yaml: Backend-agnostic rLLM configurations
rllm/trainer/config/rllm/backend/tinker.yaml: Tinker-specific configurations
rllm/trainer/config/rllm/backend/verl.yaml: Verl-specific configurations
rllm/trainer/config/unified.yaml: Main entry point that combines all configs

rLLM backend-agnostic configurations

These configurations are defined in rllm/base.yaml and are used across different backends.

Agent configuration

Settings for the agent that interacts with the environment.

Parameter	Type	Default	Description
`name`	`str`	`math_agent`	Name of the agent
`max_steps`	`int`	`20`	Maximum number of steps per trajectory
`trajectory_timeout`	`int \| null`	`null`	Timeout for trajectory execution (seconds)
`overlong_filter`	`bool`	`False`	Whether to filter out overlong trajectories
`agent_args`	`dict`	`{}`	Additional agent-specific arguments
`engine_args`	`dict`	`{}`	Additional engine-specific arguments

Environment configuration

Settings for the environment where the agent operates.

Parameter	Type	Default	Description
`name`	`str`	`custom`	Name of the environment
`env_args`	`dict`	`{}`	Additional environment-specific arguments

Workflow configuration

Settings for workflow-based training (alternative to agent-based training).

Parameter	Type	Default	Description
`use_workflow`	`bool`	`False`	Whether to use workflow mode instead of agent mode
`name`	`str`	`single_turn_workflow`	Name of the workflow
`workflow_args.agent_cls`	`str \| null`	`null`	Agent class to use in workflow
`workflow_args.agent_args`	`dict`	`{}`	Agent arguments in workflow
`workflow_args.env_cls`	`str \| null`	`null`	Environment class to use in workflow
`workflow_args.env_args`	`dict`	`{}`	Environment arguments in workflow
`workflow_args.timeout`	`float`	`1e6`	Workflow execution timeout
`workflow_args.gamma`	`float`	`0.0`	Discount factor (0.0 = no discounting)
`workflow_args.reward_bonus_coeff`	`float`	`0.0`	Reward shaping coefficient
`n_parallel_tasks`	`int`	`256`	Number of parallel tasks to run
`retry_limit`	`int`	`3`	Maximum number of retries on failure
`raise_on_error`	`bool`	`True`	Whether to raise exceptions on errors

Rollout configuration

Settings for trajectory rollouts during training and validation.

These settings are primarily for logging purposes. The actual rollout behavior is determined by backend-specific configurations.

Parameter	Type	Default	Description
`n`	`int`	`8`	Number of rollouts per prompt during training
`n_val`	`int`	`1`	Number of rollouts per prompt during validation

Trainer configuration

Core training loop settings.

Parameter	Type	Default	Description
`total_epochs`	`int`	`10`	Total number of training epochs
`total_batches`	`int`	`-1`	Total number of training batches (-1 = use epochs)
`logger`	`list[str]`	`['console']`	Logging backends (options: `console`, `wandb`, `tensorboard`)
`project_name`	`str`	`rllm-training`	Project name for logging
`experiment_name`	`str`	`default`	Experiment name for logging
`test_freq`	`int`	`5`	Frequency of validation (in epochs)
`save_freq`	`int`	`20`	Frequency of checkpoint saving (in epochs)
`val_before_train`	`bool`	`True`	Whether to run validation before training starts
`val_only`	`bool`	`False`	Whether to only run validation (no training)

Algorithm configuration

RL algorithm and advantage estimation settings.

Parameter	Type	Default	Description
`adv_estimator`	`str`	`grpo`	Advantage estimator (options: `grpo`, `reinforce`, `reinforce_plus_plus_baseline`, `rloo`)
`norm_adv_by_std_in_grpo`	`bool`	`True`	Whether to normalize advantages by standard deviation in GRPO
`loss_fn`	`str \| null`	`null`	Loss function for Tinker backend (options: `importance_sampling`, `ppo`, `cispo`, `dro`, `cross_entropy`)

Stepwise advantage configuration

Settings for computing advantages at each step in multi-step trajectories.

Parameter	Type	Default	Description
`enable`	`bool`	`False`	Whether to enable stepwise advantage computation
`mode`	`str`	`broadcast`	Advantage computation mode (options: `broadcast`, `per_step`)
`normalize_by_steps`	`bool`	`False`	Whether to normalize advantages by number of steps

Trajectory processing flags

Top-level flags for trajectory processing and filtering.

Parameter	Type	Default	Description
`disable_thinking`	`bool`	`False`	Whether to disable thinking tokens in responses
`accumulate_reasoning`	`bool`	`False`	Whether to accumulate reasoning across steps
`mask_truncated_samples`	`bool`	`False`	Whether to mask trajectories that were truncated
`filter_token_mismatch`	`bool`	`True`	Whether to filter out trajectories with token mismatches

Compact filtering configuration

Fine-grained filtering of trajectories based on various termination conditions.

Parameter	Type	Default	Description
`enable`	`bool`	`False`	Whether to enable compact filtering
`mask_max_prompt_length_exceeded`	`bool`	`True`	Mask trajectories that exceed max prompt length
`mask_max_response_length_exceeded`	`bool`	`True`	Mask trajectories that exceed max response length
`mask_env_done`	`bool`	`False`	Mask trajectories where environment signaled done
`mask_max_turns_exceeded`	`bool`	`True`	Mask trajectories that exceed max turns
`mask_timeout`	`bool`	`True`	Mask trajectories that timed out
`mask_unknown`	`bool`	`False`	Mask trajectories with unknown termination reasons
`mask_error`	`bool`	`True`	Mask trajectories that encountered errors

Rejection sampling configuration

Settings for rejection sampling to improve training data quality.

Parameter	Type	Default	Description
`enable`	`bool`	`False`	Whether to enable rejection sampling
`multiplier`	`int`	`1`	Multiplier for number of rollouts to generate
`min_partial_solve_tasks`	`int`	`1`	Minimum number of tasks that must be partially solved
`min_trajs_per_group`	`int`	`2`	Minimum number of trajectories per group to keep

SDK configuration

Settings for the rLLM SDK, including trace storage and proxy server.

Parameter	Type	Default	Description
`store.path`	`str`	`~/.rllm/traces.db`	Path to trace database
`processing.groupby_key`	`str \| null`	`null`	Key to group trajectories by
`processing.traj_name_key`	`str \| null`	`null`	Key to use as trajectory name
`proxy.host`	`str`	`127.0.0.1`	Proxy server host
`proxy.port`	`int`	`4000`	Proxy server port
`proxy.mode`	`str`	`subprocess`	Proxy mode (options: `subprocess`, `external`)
`proxy.admin_token`	`str`	`my-shared-secret`	Admin token for proxy authentication

Episode logging configuration

Settings for logging full episode trajectories to disk.

Parameter	Type	Default	Description
`log_episodes`	`bool`	`false`	Whether to log full episodes to disk
`episode_log_dir`	`str`	`logs/${rllm.trainer.project_name}/${rllm.trainer.experiment_name}`	Directory for episode logs

Backend-specific configurations

Tinker backend

Tinker-specific settings live in rllm/trainer/config/rllm/backend/tinker.yaml. This file contains:

Tinker service and execution settings
Model/LoRA training settings
Sampling and rollout-engine settings
Tinker-native training/data blocks
Forwarding into rllm.* common config keys

Top-level Tinker-specific keys

Parameter	Type	Default	Description
`tinker_base_url`	`str \| null`	`null`	Tinker service URL (`null` for local/default)
`fuse_forward_backward_and_optim_step`	`bool`	`false`	Whether to fuse train-step internals in backend

Model block

Parameter	Type	Default	Description
`model.name`	`str`	`Qwen/Qwen3-8B`	Base model name
`model.lora_rank`	`int`	`32`	LoRA rank
`model.train_unembed`	`bool`	`true`	Train LoRA on output embedding layer
`model.train_attn`	`bool`	`true`	Train LoRA on attention layers
`model.train_mlp`	`bool`	`true`	Train LoRA on MLP layers

Training block (Tinker-native)

Parameter	Type	Default	Description
`training.group_size`	`int`	`???`	Number of rollouts per prompt
`training.learning_rate`	`float`	`2e-5`	Learning rate
`training.lr_schedule`	`str`	`constant`	LR schedule (`constant`, `linear`, `cosine`)
`training.warmup_steps_ratio`	`float`	`0.0`	Warmup ratio in `[0, 1]`
`training.beta1`	`float`	`0.9`	Adam beta1
`training.beta2`	`float`	`0.95`	Adam beta2
`training.eps`	`float`	`1e-8`	Adam epsilon
`training.max_length`	`int`	`32768`	Max model context length
`training.num_minibatches`	`int`	`1`	Number of minibatches
`training.default_local_dir`	`str`	`/tmp/rllm-tinker-checkpoints`	Local checkpoint directory
`training.resume_from_tinker_id`	`str \| null`	`null`	Optional checkpoint/model ID to resume

Validation, sampling, rollout, and data blocks

Parameter	Type	Default	Description
`validation.group_size`	`int`	`???`	Rollouts per prompt for validation
`sampling.train.temperature`	`float`	`1.0`	Train sampling temperature
`sampling.train.top_p`	`float`	`1.0`	Train nucleus sampling threshold
`sampling.train.top_k`	`int`	`-1`	Train top-k
`sampling.val.temperature`	`float`	`1.0`	Val sampling temperature
`sampling.val.top_p`	`float`	`1.0`	Val nucleus sampling threshold
`sampling.val.top_k`	`int`	`-1`	Val top-k
`rollout_engine.reasoning_effort`	`str`	`medium`	Reasoning effort mode
`rollout_engine.accumulate_reasoning`	`bool`	`false`	Whether to accumulate reasoning across steps
`rollout_engine.disable_thinking`	`bool`	`false`	Whether to disable thinking tokens
`rollout_engine.renderer_name`	`str \| null`	`null`	Optional renderer name
`data.max_prompt_length`	`int`	`2048`	Max prompt length
`data.max_response_length`	`int`	`2048`	Max response length
`data.train_batch_size`	`int`	`64`	Train batch size
`data.val_batch_size`	`int`	`32`	Validation batch size

Forwarding to common `rllm.*`

Tinker backend forwards group-size settings into backend-agnostic rollout config:

rllm.rollout.n <- training.group_size
rllm.rollout.n_val <- validation.group_size

Verl backend

Verl-specific settings live in rllm/trainer/config/rllm/backend/verl.yaml. This file is intentionally thin and composes Verl’s native PPO config via:

defaults:
  - /ppo_trainer
  - _self_

For detailed semantics of Verl-native fields, see the Verl configuration docs. In rLLM, the verl.yaml only does two things:

Sets a small number of required overrides for unified-trainer compatibility (e.g. actor_rollout_ref.rollout.mode=async, actor.use_rollout_log_probs=True).
Pins one rllm-namespaced default that diverges from verl’s (rllm.algorithm.rollout_correction.bypass_mode=False).

Everything else — propagating values between the verl-native namespace and the rllm.* namespace — happens at runtime via sync_config in rllm/trainer/verl/utils.py.

Key fields in `verl.yaml`

Parameter	Type	Default	Description
`actor_rollout_ref.rollout.mode`	`str`	`async`	Required mode for unified Verl backend
`actor_rollout_ref.rollout.agent.num_workers`	`int`	`0`	Agent worker count
`actor_rollout_ref.rollout.calculate_log_probs`	`bool`	`True`	Compute log-probs during rollout (needed for rollout-correction)
`actor_rollout_ref.rollout.val_kwargs.do_sample`	`bool`	`True`	Use sampling during validation
`actor_rollout_ref.actor.use_rollout_log_probs`	`bool`	`True`	Reuse rollout log-probs in the actor (bypass-mode default)
`data.gen_batch_size`	`int`	`${mul:...}`	Generated batch size
`data.return_multi_modal_inputs`	`bool`	`False`	Include multimodal inputs in data path
`rllm.backend`	`str`	`verl`	Backend selector
`rllm.algorithm.rollout_correction.bypass_mode`	`bool`	`False`	Verl-side default differs from rLLM’s `null`; pinned here

All other shared knobs (algorithm.adv_estimator, actor.kl_loss_coef, trainer.total_epochs, …) live in their natural locations — verl-native keys come from ppo_trainer.yaml, rllm-namespace keys from base.yaml — and are reconciled at startup by sync_config.

Bidirectional config sync

For a fixed table of “shared keys” (the same value, different paths in the two namespaces), sync_config mirrors the value between the verl-native side and the rllm.* side at trainer startup. New configs should use the rllm.* path. Existing Verl-style CLI overrides still work for backward compatibility, but they log a deprecation warning. Per-key precedence:

rllm.* value explicitly set on the Hydra CLI
Verl-native value explicitly set on the Hydra CLI
rllm.* value from base.yaml (when non-null)
Verl-native value from ppo_trainer.yaml

Verl-native shared-key CLI overrides are deprecated. If you set a shared key on the Verl-native side, rLLM will still sync it for now and log a warning. If both sides set conflicting values, the rllm.* value wins and rLLM logs a conflict warning. Passing an extra yaml/config group to override shared keys is not a supported migration path; pass shared values as individual Hydra overrides instead.

The shared-keys table:

Verl-native path	`rllm.*` path
`algorithm.adv_estimator`	`rllm.algorithm.adv_estimator`
`algorithm.norm_adv_by_std_in_grpo`	`rllm.algorithm.norm_adv_by_std_in_grpo`
`algorithm.rollout_correction.bypass_mode`	`rllm.algorithm.rollout_correction.bypass_mode`
`algorithm.rollout_correction.rollout_is`	`rllm.algorithm.rollout_correction.tis_mode`
`algorithm.rollout_correction.rollout_is_threshold`	`rllm.algorithm.rollout_correction.tis_cap`
`actor_rollout_ref.actor.kl_loss_coef`	`rllm.algorithm.kl_beta`
`actor_rollout_ref.actor.policy_loss.loss_mode`	`rllm.algorithm.loss_fn`
`actor_rollout_ref.actor.loss_agg_mode`	`rllm.algorithm.loss_agg_mode`
`actor_rollout_ref.actor.clip_ratio_high`	`rllm.algorithm.eps_clip_high`
`actor_rollout_ref.rollout.n`	`rllm.rollout.n`
`actor_rollout_ref.rollout.val_kwargs.n`	`rllm.rollout.n_val`
`trainer.total_epochs`	`rllm.trainer.total_epochs`
`trainer.total_training_steps`	`rllm.trainer.total_batches`
`trainer.logger`	`rllm.trainer.logger`
`trainer.project_name`	`rllm.trainer.project_name`
`trainer.experiment_name`	`rllm.trainer.experiment_name`
`trainer.test_freq`	`rllm.trainer.test_freq`
`trainer.save_freq`	`rllm.trainer.save_freq`
`trainer.val_before_train`	`rllm.trainer.val_before_train`
`trainer.val_only`	`rllm.trainer.val_only`

Two extra rules sit alongside this table:

actor.use_kl_loss is derived from kl_beta: if you do not explicitly set actor_rollout_ref.actor.use_kl_loss on the CLI, sync_config sets it to (kl_beta > 0). Setting rllm.algorithm.kl_beta also mirrors into actor_rollout_ref.actor.kl_loss_coef. Setting the Verl-native coefficient still backfills rllm.algorithm.kl_beta for now, with a deprecation warning.
clip_ratio family. rllm.algorithm.eps_clip mirrors to actor_rollout_ref.actor.clip_ratio and actor_rollout_ref.actor.clip_ratio_low. If rllm.algorithm.eps_clip_high is set, it mirrors to actor_rollout_ref.actor.clip_ratio_high; otherwise the upper bound mirrors eps_clip. Verl-native clip_ratio, clip_ratio_low, and clip_ratio_high still backfill the rLLM values for now when the rLLM side is not set, with deprecation warnings.

It also extends the Hydra search path so /ppo_trainer resolves:

rllm:
  hydra.searchpath:
    - pkg://verl.trainer.config

Config forwarding mechanism

The Verl backend uses bidirectional sync (described above): for any shared key, the preferred path is rllm.*, while legacy Verl-native shared-key CLI overrides still work with deprecation warnings. The Tinker backend does its own one-way forwarding from native group-size settings into rllm.rollout.{n,n_val} (see “Tinker backend” above).

Example: set the preferred shared knob

Set adv_estimator on the rllm.* side:

python train.py rllm/backend=verl rllm.algorithm.adv_estimator=rloo

A legacy Verl-native override still works for now, but logs a deprecation warning:

python train.py rllm/backend=verl algorithm.adv_estimator=rloo

If both are set and conflict, the rllm.* value wins.

Example: KL-in-loss

Setting rllm.algorithm.kl_beta=0.01 is enough — actor.kl_loss_coef is mirrored and actor.use_kl_loss is auto-set to True:

python train.py rllm/backend=verl rllm.algorithm.kl_beta=0.01

The legacy Verl-native equivalent still works for now, but logs a deprecation warning:

python train.py rllm/backend=verl actor_rollout_ref.actor.kl_loss_coef=0.01

Benefits

Backward compatibility. Existing scripts that override on the verl-native side continue to work while logging deprecation warnings; the rllm-side value is mirrored automatically.
Backend portability. New scripts can target the backend-agnostic rllm.* namespace and run on either Tinker or Verl with the same flags.
Single source of truth at runtime. No oc.select interpolation in the yaml; the merged config that goes to verl’s workers and rLLM’s trainer holds the same values on both sides.

Configuration best practices

Use rLLM configs for new projects: If starting from scratch, use the rLLM backend-agnostic configs for better portability across backends.
Use rLLM paths for shared knobs: Existing Verl-native shared-key overrides still work for compatibility, but new configs should use rllm.*.
Check the unified config: The unified.yaml file shows how all configs are combined and is useful for debugging configuration issues.
Understand defaults hierarchy: Backend-specific configs override rLLM defaults, which in turn override Hydra’s base defaults.

Get started

Tutorials

rLLM CLI & UI

Core concepts

Datasets & Evaluation

Agent runtimes

Training backends

Guides

Unified workflow trainer

Advanced algorithms

Configuration

Configuration structure

rLLM backend-agnostic configurations

Agent configuration

Environment configuration

Workflow configuration

Rollout configuration

Trainer configuration

Algorithm configuration

Stepwise advantage configuration

Trajectory processing flags

Compact filtering configuration

Rejection sampling configuration

SDK configuration

Episode logging configuration

Backend-specific configurations

Tinker backend

Top-level Tinker-specific keys

Model block

Training block (Tinker-native)

Validation, sampling, rollout, and data blocks

Forwarding to common `rllm.*`

Verl backend

Key fields in `verl.yaml`

Bidirectional config sync

Config forwarding mechanism

Example: set the preferred shared knob

Example: KL-in-loss

Benefits

Configuration best practices

​Configuration structure

​rLLM backend-agnostic configurations

​Agent configuration

​Environment configuration

​Workflow configuration

​Rollout configuration

​Trainer configuration

​Algorithm configuration

​Stepwise advantage configuration

​Trajectory processing flags

​Compact filtering configuration

​Rejection sampling configuration

​SDK configuration

​Episode logging configuration

​Backend-specific configurations

​Tinker backend

​Top-level Tinker-specific keys

​Model block

​Training block (Tinker-native)

​Validation, sampling, rollout, and data blocks

​Forwarding to common rllm.*

​Verl backend

​Key fields in verl.yaml

​Bidirectional config sync

​Config forwarding mechanism

​Example: set the preferred shared knob

​Example: KL-in-loss

​Benefits

​Configuration best practices

Configuration structure

rLLM backend-agnostic configurations

Agent configuration

Environment configuration

Workflow configuration

Rollout configuration

Trainer configuration

Algorithm configuration

Stepwise advantage configuration

Trajectory processing flags

Compact filtering configuration

Rejection sampling configuration

SDK configuration

Episode logging configuration

Backend-specific configurations

Tinker backend

Top-level Tinker-specific keys

Model block

Training block (Tinker-native)

Validation, sampling, rollout, and data blocks

Forwarding to common `rllm.*`

Verl backend

Key fields in `verl.yaml`

Bidirectional config sync

Config forwarding mechanism

Example: set the preferred shared knob

Example: KL-in-loss

Benefits

Configuration best practices