Configuration

The rLLM framework provides a unified configuration system that separates backend-agnostic settings from backend-specific configurations. This design allows you to switch between different RL backends (Tinker, Verl) while maintaining consistent core training logic.

Configuration structure

The configuration system is organized into three main components:

rLLM backend-agnostic configs: Core training settings shared across all backends
Backend-specific configs: Settings specific to Tinker or Verl backends
Forwarding mechanism: Allows backend-specific configs to override rLLM configs for backward compatibility

All configuration files are located in rllm/experimental/config/:

rllm/experimental/config/rllm/base.yaml: Backend-agnostic rLLM configurations
rllm/experimental/config/rllm/backend/tinker.yaml: Tinker-specific configurations
rllm/experimental/config/rllm/backend/verl.yaml: Verl-specific configurations
rllm/experimental/config/unified.yaml: Main entry point that combines all configs

rLLM backend-agnostic configurations

These configurations are defined in rllm/base.yaml and are used across different backends.

Agent configuration

Settings for the agent that interacts with the environment.

Parameter	Type	Default	Description
`name`	`str`	`math_agent`	Name of the agent
`max_steps`	`int`	`20`	Maximum number of steps per trajectory
`trajectory_timeout`	`int \| null`	`null`	Timeout for trajectory execution (seconds)
`overlong_filter`	`bool`	`False`	Whether to filter out overlong trajectories
`agent_args`	`dict`	`{}`	Additional agent-specific arguments
`engine_args`	`dict`	`{}`	Additional engine-specific arguments

Environment configuration

Settings for the environment where the agent operates.

Parameter	Type	Default	Description
`name`	`str`	`custom`	Name of the environment
`env_args`	`dict`	`{}`	Additional environment-specific arguments

Workflow configuration

Settings for workflow-based training (alternative to agent-based training).

Parameter	Type	Default	Description
`use_workflow`	`bool`	`False`	Whether to use workflow mode instead of agent mode
`name`	`str`	`single_turn_workflow`	Name of the workflow
`workflow_args.agent_cls`	`str \| null`	`null`	Agent class to use in workflow
`workflow_args.agent_args`	`dict`	`{}`	Agent arguments in workflow
`workflow_args.env_cls`	`str \| null`	`null`	Environment class to use in workflow
`workflow_args.env_args`	`dict`	`{}`	Environment arguments in workflow
`workflow_args.timeout`	`float`	`1e6`	Workflow execution timeout
`workflow_args.gamma`	`float`	`0.0`	Discount factor (0.0 = no discounting)
`workflow_args.reward_bonus_coeff`	`float`	`0.0`	Reward shaping coefficient
`n_parallel_tasks`	`int`	`256`	Number of parallel tasks to run
`retry_limit`	`int`	`3`	Maximum number of retries on failure
`raise_on_error`	`bool`	`True`	Whether to raise exceptions on errors

Rollout configuration

Settings for trajectory rollouts during training and validation.

These settings are primarily for logging purposes. The actual rollout behavior is determined by backend-specific configurations.

Parameter	Type	Default	Description
`n`	`int`	`8`	Number of rollouts per prompt during training
`n_val`	`int`	`1`	Number of rollouts per prompt during validation

Trainer configuration

Core training loop settings.

Parameter	Type	Default	Description
`total_epochs`	`int`	`10`	Total number of training epochs
`total_batches`	`int`	`-1`	Total number of training batches (-1 = use epochs)
`logger`	`list[str]`	`['console']`	Logging backends (options: `console`, `wandb`, `tensorboard`)
`project_name`	`str`	`rllm-training`	Project name for logging
`experiment_name`	`str`	`default`	Experiment name for logging
`test_freq`	`int`	`5`	Frequency of validation (in epochs)
`save_freq`	`int`	`20`	Frequency of checkpoint saving (in epochs)
`val_before_train`	`bool`	`True`	Whether to run validation before training starts
`val_only`	`bool`	`False`	Whether to only run validation (no training)

Algorithm configuration

RL algorithm and advantage estimation settings.

Parameter	Type	Default	Description
`adv_estimator`	`str`	`grpo`	Advantage estimator (options: `grpo`, `reinforce`, `reinforce_plus_plus_baseline`, `rloo`, `gae`)
`gamma`	`float`	`1.0`	Discount factor for future rewards
`lam`	`float`	`0.95`	Lambda for GAE (Generalized Advantage Estimation)
`norm_adv_by_std_in_grpo`	`bool`	`True`	Whether to normalize advantages by standard deviation in GRPO
`use_rllm`	`bool`	`False`	Whether to use rLLM-specific features
`loss_fn`	`str \| null`	`null`	Loss function for Tinker backend (options: `importance_sampling`, `ppo`, `cispo`, `dro`, `cross_entropy`)

Stepwise advantage configuration

Settings for computing advantages at each step in multi-step trajectories.

Parameter	Type	Default	Description
`enable`	`bool`	`False`	Whether to enable stepwise advantage computation
`mode`	`str`	`broadcast`	Advantage computation mode (options: `broadcast`, `per_step`)
`normalize_by_steps`	`bool`	`False`	Whether to normalize advantages by number of steps

Trajectory processing flags

Top-level flags for trajectory processing and filtering.

Parameter	Type	Default	Description
`disable_thinking`	`bool`	`False`	Whether to disable thinking tokens in responses
`accumulate_reasoning`	`bool`	`False`	Whether to accumulate reasoning across steps
`mask_truncated_samples`	`bool`	`False`	Whether to mask trajectories that were truncated
`filter_token_mismatch`	`bool`	`True`	Whether to filter out trajectories with token mismatches

Compact filtering configuration

Fine-grained filtering of trajectories based on various termination conditions.

Parameter	Type	Default	Description
`enable`	`bool`	`False`	Whether to enable compact filtering
`mask_max_prompt_length_exceeded`	`bool`	`True`	Mask trajectories that exceed max prompt length
`mask_max_response_length_exceeded`	`bool`	`True`	Mask trajectories that exceed max response length
`mask_env_done`	`bool`	`False`	Mask trajectories where environment signaled done
`mask_max_turns_exceeded`	`bool`	`True`	Mask trajectories that exceed max turns
`mask_timeout`	`bool`	`True`	Mask trajectories that timed out
`mask_unknown`	`bool`	`False`	Mask trajectories with unknown termination reasons
`mask_error`	`bool`	`True`	Mask trajectories that encountered errors

Rejection sampling configuration

Settings for rejection sampling to improve training data quality.

Parameter	Type	Default	Description
`enable`	`bool`	`False`	Whether to enable rejection sampling
`multiplier`	`int`	`1`	Multiplier for number of rollouts to generate
`min_partial_solve_tasks`	`int`	`1`	Minimum number of tasks that must be partially solved
`min_trajs_per_group`	`int`	`2`	Minimum number of trajectories per group to keep

SDK configuration

Settings for the rLLM SDK, including trace storage and proxy server.

Parameter	Type	Default	Description
`store.path`	`str`	`~/.rllm/traces.db`	Path to trace database
`processing.groupby_key`	`str \| null`	`null`	Key to group trajectories by
`processing.traj_name_key`	`str \| null`	`null`	Key to use as trajectory name
`proxy.host`	`str`	`127.0.0.1`	Proxy server host
`proxy.port`	`int`	`4000`	Proxy server port
`proxy.mode`	`str`	`subprocess`	Proxy mode (options: `subprocess`, `external`)
`proxy.admin_token`	`str`	`my-shared-secret`	Admin token for proxy authentication

Episode logging configuration

Settings for logging full episode trajectories to disk.

Parameter	Type	Default	Description
`log_episodes`	`bool`	`false`	Whether to log full episodes to disk
`episode_log_dir`	`str`	`logs/${rllm.trainer.project_name}/${rllm.trainer.experiment_name}`	Directory for episode logs

Backend-specific configurations

Tinker backend

Tinker-specific settings live in rllm/experimental/config/rllm/backend/tinker.yaml. This file contains:

Tinker service and execution settings
Model/LoRA training settings
Sampling and rollout-engine settings
Tinker-native training/data blocks
Forwarding into rllm.* common config keys

Top-level Tinker-specific keys

Parameter	Type	Default	Description
`tinker_base_url`	`str \| null`	`null`	Tinker service URL (`null` for local/default)
`fuse_forward_backward_and_optim_step`	`bool`	`false`	Whether to fuse train-step internals in backend

Model block

Parameter	Type	Default	Description
`model.name`	`str`	`Qwen/Qwen3-8B`	Base model name
`model.lora_rank`	`int`	`32`	LoRA rank
`model.train_unembed`	`bool`	`true`	Train LoRA on output embedding layer
`model.train_attn`	`bool`	`true`	Train LoRA on attention layers
`model.train_mlp`	`bool`	`true`	Train LoRA on MLP layers

Training block (Tinker-native)

Parameter	Type	Default	Description
`training.group_size`	`int`	`???`	Number of rollouts per prompt
`training.learning_rate`	`float`	`2e-5`	Learning rate
`training.lr_schedule`	`str`	`constant`	LR schedule (`constant`, `linear`, `cosine`)
`training.warmup_steps_ratio`	`float`	`0.0`	Warmup ratio in `[0, 1]`
`training.beta1`	`float`	`0.9`	Adam beta1
`training.beta2`	`float`	`0.95`	Adam beta2
`training.eps`	`float`	`1e-8`	Adam epsilon
`training.max_length`	`int`	`32768`	Max model context length
`training.num_minibatches`	`int`	`1`	Number of minibatches
`training.default_local_dir`	`str`	`/tmp/rllm-tinker-checkpoints`	Local checkpoint directory
`training.resume_from_tinker_id`	`str \| null`	`null`	Optional checkpoint/model ID to resume

Validation, sampling, rollout, and data blocks

Parameter	Type	Default	Description
`validation.group_size`	`int`	`???`	Rollouts per prompt for validation
`sampling.train.temperature`	`float`	`1.0`	Train sampling temperature
`sampling.train.top_p`	`float`	`1.0`	Train nucleus sampling threshold
`sampling.train.top_k`	`int`	`-1`	Train top-k
`sampling.val.temperature`	`float`	`1.0`	Val sampling temperature
`sampling.val.top_p`	`float`	`1.0`	Val nucleus sampling threshold
`sampling.val.top_k`	`int`	`-1`	Val top-k
`rollout_engine.reasoning_effort`	`str`	`medium`	Reasoning effort mode
`rollout_engine.accumulate_reasoning`	`bool`	`false`	Whether to accumulate reasoning across steps
`rollout_engine.disable_thinking`	`bool`	`false`	Whether to disable thinking tokens
`rollout_engine.renderer_name`	`str \| null`	`null`	Optional renderer name
`data.max_prompt_length`	`int`	`2048`	Max prompt length
`data.max_response_length`	`int`	`2048`	Max response length
`data.train_batch_size`	`int`	`64`	Train batch size
`data.val_batch_size`	`int`	`32`	Validation batch size

OPSD block

Parameter	Type	Default	Description
`opsd.kl_penalty_coef`	`float`	`1.0`	KL penalty coefficient
`opsd.kl_discount_factor`	`float`	`0.0`	KL discount factor
`opsd.teacher_messages_key`	`str`	`teacher_messages`	Key for teacher messages
`opsd.teacher_policy_update_freq`	`int`	`-1`	Teacher refresh frequency (`-1` = initial teacher only)

Forwarding to common `rllm.*`

Tinker backend forwards group-size settings into backend-agnostic rollout config:

rllm.rollout.n <- training.group_size
rllm.rollout.n_val <- validation.group_size

Verl backend

Verl-specific settings live in rllm/experimental/config/rllm/backend/verl.yaml. This file is intentionally thin and composes Verl’s native PPO config via:

defaults:
  - /ppo_trainer
  - _self_

For detailed semantics of Verl-native fields, see the Verl configuration docs. In rLLM, the verl.yaml mainly does three things:

Sets a few required overrides for unified-trainer compatibility
Marks selected native fields as required (???) so they can be provided by user/config composition
Forwards native Verl fields into rllm.* common config for backward compatibility

Key fields in `verl.yaml`

Parameter	Type	Default	Description
`actor_rollout_ref.rollout.mode`	`str`	`async`	Required mode for unified Verl backend
`actor_rollout_ref.rollout.agent.num_workers`	`int`	`0`	Agent worker count
`actor_rollout_ref.rollout.val_kwargs.n`	`int`	`???`	Validation rollout count
`data.gen_batch_size`	`int`	`${mul:...}`	Generated batch size
`data.return_multi_modal_inputs`	`bool`	`False`	Include multimodal inputs in data path
`algorithm.adv_estimator`	`str`	`???`	Native algorithm estimator
`algorithm.gamma`	`float`	`???`	Native discount factor
`algorithm.lam`	`float`	`???`	Native lambda
`algorithm.norm_adv_by_std_in_grpo`	`bool`	`???`	Native GRPO normalization
`trainer.total_epochs`	`int`	`???`	Native epoch count
`trainer.total_training_steps`	`int`	`???`	Native total-step budget
`trainer.logger`	`list[str]`	`???`	Native logger list
`trainer.project_name`	`str`	`???`	Project name
`trainer.experiment_name`	`str`	`???`	Experiment name
`trainer.test_freq`	`int`	`???`	Validation cadence
`trainer.save_freq`	`int`	`???`	Save cadence
`trainer.val_before_train`	`bool`	`???`	Validate before training
`trainer.val_only`	`bool`	`???`	Validation-only mode

Forwarding to common `rllm.*`

Verl backend forwards the following:

rllm.algorithm.{adv_estimator,gamma,lam,norm_adv_by_std_in_grpo}
rllm.rollout.{n,n_val}
rllm.trainer.{total_epochs,total_batches,logger,project_name,experiment_name,test_freq,save_freq,val_before_train,val_only}

It also extends Hydra search path with Verl configs:

rllm:
  hydra.searchpath:
    - pkg://verl.trainer.config

Config forwarding mechanism

The rLLM configuration system supports a forwarding mechanism that allows users familiar with a specific backend (Tinker or Verl) to specify configurations in their native format. These backend-specific configs are then automatically forwarded to the corresponding rLLM configs for backward compatibility.

How it works

Backend-specific config files can override rLLM settings using Hydra’s oc.select resolver. This mechanism:

First checks if a backend-specific config value is provided
If provided, uses that value to populate the rLLM config
If not provided, falls back to the rLLM default value

Example: Verl backend forwarding

In rllm/experimental/config/rllm/backend/verl.yaml, Verl’s native trainer configuration is forwarded to rLLM:

# In Verl's native config format
trainer:
  total_epochs: 15
  project_name: 'my-verl-project'
  experiment_name: 'verl-experiment-1'

# These are automatically forwarded to rLLM configs
rllm:
  trainer:
    total_epochs: ${oc.select:trainer.total_epochs, 10}  # Uses 15 from above
    project_name: ${oc.select:trainer.project_name, 'rllm-training'}  # Uses 'my-verl-project'
    experiment_name: ${oc.select:trainer.experiment_name, 'default'}  # Uses 'verl-experiment-1'

In this example:

Users can specify trainer.total_epochs in Verl’s native format
The value is automatically forwarded to rllm.trainer.total_epochs
If the Verl config is not specified, the rLLM default (10) is used

Example: Algorithm configuration forwarding

Similarly, algorithm configurations can be forwarded:

# Backend-specific algorithm config
algorithm:
  adv_estimator: gae
  gamma: 0.99
  lam: 0.95

# Forwarded to rLLM
rllm:
  algorithm:
    adv_estimator: ${oc.select:algorithm.adv_estimator, grpo}  # Uses 'gae'
    gamma: ${oc.select:algorithm.gamma, 1.0}  # Uses 0.99
    lam: ${oc.select:algorithm.lam, 0.95}  # Uses 0.95

Benefits

This forwarding mechanism provides several benefits:

Backward compatibility: Users can continue using their familiar backend-specific config formats
Gradual migration: Projects can migrate to rLLM configs incrementally
Flexibility: Supports both backend-specific and rLLM-native configuration styles
Consistency: Ensures backend configs and rLLM configs stay synchronized

Configuration best practices

Use rLLM configs for new projects: If starting from scratch, use the rLLM backend-agnostic configs for better portability across backends.
Leverage forwarding for migration: If migrating from a specific backend, use the forwarding mechanism to maintain existing configs while gradually adopting rLLM conventions.
Check the unified config: The unified.yaml file shows how all configs are combined and is useful for debugging configuration issues.
Understand defaults hierarchy: Backend-specific configs override rLLM defaults, which in turn override Hydra’s base defaults.

Get started

Tutorials

rLLM CLI & UI

Core concepts

Agent runtimes

Training backends

Guides

Unified workflow trainer

Advanced algorithms

​Configuration structure

​rLLM backend-agnostic configurations

​Agent configuration

​Environment configuration

​Workflow configuration

​Rollout configuration

​Trainer configuration

​Algorithm configuration

​Stepwise advantage configuration

​Trajectory processing flags

​Compact filtering configuration

​Rejection sampling configuration

​SDK configuration

​Episode logging configuration

​Backend-specific configurations

​Tinker backend

​Top-level Tinker-specific keys

​Model block

​Training block (Tinker-native)

​Validation, sampling, rollout, and data blocks

​OPSD block

​Forwarding to common rllm.*

​Verl backend

​Key fields in verl.yaml

​Forwarding to common rllm.*

​Config forwarding mechanism

​How it works

​Example: Verl backend forwarding

​Example: Algorithm configuration forwarding

​Benefits

​Configuration best practices

Configuration structure

rLLM backend-agnostic configurations

Agent configuration

Environment configuration

Workflow configuration

Rollout configuration

Trainer configuration

Algorithm configuration

Stepwise advantage configuration

Trajectory processing flags

Compact filtering configuration

Rejection sampling configuration

SDK configuration

Episode logging configuration

Backend-specific configurations

Tinker backend

Top-level Tinker-specific keys

Model block

Training block (Tinker-native)

Validation, sampling, rollout, and data blocks

OPSD block

Forwarding to common `rllm.*`

Verl backend

Key fields in `verl.yaml`

Forwarding to common `rllm.*`

Config forwarding mechanism

How it works

Example: Verl backend forwarding

Example: Algorithm configuration forwarding

Benefits

Configuration best practices