Skip to main content
The rLLM framework provides a unified configuration system that separates backend-agnostic settings from backend-specific configurations. This design allows you to switch between different RL backends (Tinker, Verl) while maintaining consistent core training logic.

Configuration structure

The configuration system is organized into three main components:
  1. rLLM backend-agnostic configs: Core training settings shared across all backends
  2. Backend-specific configs: Settings specific to Tinker or Verl backends
  3. Forwarding mechanism: Allows backend-specific configs to override rLLM configs for backward compatibility
All configuration files are located in rllm/experimental/config/:
  • rllm/experimental/config/rllm/base.yaml: Backend-agnostic rLLM configurations
  • rllm/experimental/config/rllm/backend/tinker.yaml: Tinker-specific configurations
  • rllm/experimental/config/rllm/backend/verl.yaml: Verl-specific configurations
  • rllm/experimental/config/unified.yaml: Main entry point that combines all configs

rLLM backend-agnostic configurations

These configurations are defined in rllm/base.yaml and are used across different backends.

Agent configuration

Settings for the agent that interacts with the environment.
ParameterTypeDefaultDescription
namestrmath_agentName of the agent
max_stepsint20Maximum number of steps per trajectory
trajectory_timeoutint | nullnullTimeout for trajectory execution (seconds)
overlong_filterboolFalseWhether to filter out overlong trajectories
agent_argsdict{}Additional agent-specific arguments
engine_argsdict{}Additional engine-specific arguments

Environment configuration

Settings for the environment where the agent operates.
ParameterTypeDefaultDescription
namestrcustomName of the environment
env_argsdict{}Additional environment-specific arguments

Workflow configuration

Settings for workflow-based training (alternative to agent-based training).
ParameterTypeDefaultDescription
use_workflowboolFalseWhether to use workflow mode instead of agent mode
namestrsingle_turn_workflowName of the workflow
workflow_args.agent_clsstr | nullnullAgent class to use in workflow
workflow_args.agent_argsdict{}Agent arguments in workflow
workflow_args.env_clsstr | nullnullEnvironment class to use in workflow
workflow_args.env_argsdict{}Environment arguments in workflow
workflow_args.timeoutfloat1e6Workflow execution timeout
workflow_args.gammafloat0.0Discount factor (0.0 = no discounting)
workflow_args.reward_bonus_coefffloat0.0Reward shaping coefficient
n_parallel_tasksint256Number of parallel tasks to run
retry_limitint3Maximum number of retries on failure
raise_on_errorboolTrueWhether to raise exceptions on errors

Rollout configuration

Settings for trajectory rollouts during training and validation.
These settings are primarily for logging purposes. The actual rollout behavior is determined by backend-specific configurations.
ParameterTypeDefaultDescription
nint8Number of rollouts per prompt during training
n_valint1Number of rollouts per prompt during validation

Trainer configuration

Core training loop settings.
ParameterTypeDefaultDescription
total_epochsint10Total number of training epochs
total_batchesint-1Total number of training batches (-1 = use epochs)
loggerlist[str]['console']Logging backends (options: console, wandb, tensorboard)
project_namestrrllm-trainingProject name for logging
experiment_namestrdefaultExperiment name for logging
test_freqint5Frequency of validation (in epochs)
save_freqint20Frequency of checkpoint saving (in epochs)
val_before_trainboolTrueWhether to run validation before training starts
val_onlyboolFalseWhether to only run validation (no training)

Algorithm configuration

RL algorithm and advantage estimation settings.
ParameterTypeDefaultDescription
adv_estimatorstrgrpoAdvantage estimator (options: grpo, reinforce, reinforce_plus_plus_baseline, rloo, gae)
gammafloat1.0Discount factor for future rewards
lamfloat0.95Lambda for GAE (Generalized Advantage Estimation)
norm_adv_by_std_in_grpoboolTrueWhether to normalize advantages by standard deviation in GRPO
use_rllmboolFalseWhether to use rLLM-specific features
loss_fnstr | nullnullLoss function for Tinker backend (options: importance_sampling, ppo, cispo, dro, cross_entropy)

Stepwise advantage configuration

Settings for computing advantages at each step in multi-step trajectories.
ParameterTypeDefaultDescription
enableboolFalseWhether to enable stepwise advantage computation
modestrbroadcastAdvantage computation mode (options: broadcast, per_step)
normalize_by_stepsboolFalseWhether to normalize advantages by number of steps

Trajectory processing flags

Top-level flags for trajectory processing and filtering.
ParameterTypeDefaultDescription
disable_thinkingboolFalseWhether to disable thinking tokens in responses
accumulate_reasoningboolFalseWhether to accumulate reasoning across steps
mask_truncated_samplesboolFalseWhether to mask trajectories that were truncated
filter_token_mismatchboolTrueWhether to filter out trajectories with token mismatches

Compact filtering configuration

Fine-grained filtering of trajectories based on various termination conditions.
ParameterTypeDefaultDescription
enableboolFalseWhether to enable compact filtering
mask_max_prompt_length_exceededboolTrueMask trajectories that exceed max prompt length
mask_max_response_length_exceededboolTrueMask trajectories that exceed max response length
mask_env_doneboolFalseMask trajectories where environment signaled done
mask_max_turns_exceededboolTrueMask trajectories that exceed max turns
mask_timeoutboolTrueMask trajectories that timed out
mask_unknownboolFalseMask trajectories with unknown termination reasons
mask_errorboolTrueMask trajectories that encountered errors

Rejection sampling configuration

Settings for rejection sampling to improve training data quality.
ParameterTypeDefaultDescription
enableboolFalseWhether to enable rejection sampling
multiplierint1Multiplier for number of rollouts to generate
min_partial_solve_tasksint1Minimum number of tasks that must be partially solved
min_trajs_per_groupint2Minimum number of trajectories per group to keep

SDK configuration

Settings for the rLLM SDK, including trace storage and proxy server.
ParameterTypeDefaultDescription
store.pathstr~/.rllm/traces.dbPath to trace database
processing.groupby_keystr | nullnullKey to group trajectories by
processing.traj_name_keystr | nullnullKey to use as trajectory name
proxy.hoststr127.0.0.1Proxy server host
proxy.portint4000Proxy server port
proxy.modestrsubprocessProxy mode (options: subprocess, external)
proxy.admin_tokenstrmy-shared-secretAdmin token for proxy authentication

Episode logging configuration

Settings for logging full episode trajectories to disk.
ParameterTypeDefaultDescription
log_episodesboolfalseWhether to log full episodes to disk
episode_log_dirstrlogs/${rllm.trainer.project_name}/${rllm.trainer.experiment_name}Directory for episode logs

Backend-specific configurations

Tinker backend

Tinker-specific settings live in rllm/experimental/config/rllm/backend/tinker.yaml. This file contains:
  1. Tinker service and execution settings
  2. Model/LoRA training settings
  3. Sampling and rollout-engine settings
  4. Tinker-native training/data blocks
  5. Forwarding into rllm.* common config keys

Top-level Tinker-specific keys

ParameterTypeDefaultDescription
tinker_base_urlstr | nullnullTinker service URL (null for local/default)
fuse_forward_backward_and_optim_stepboolfalseWhether to fuse train-step internals in backend

Model block

ParameterTypeDefaultDescription
model.namestrQwen/Qwen3-8BBase model name
model.lora_rankint32LoRA rank
model.train_unembedbooltrueTrain LoRA on output embedding layer
model.train_attnbooltrueTrain LoRA on attention layers
model.train_mlpbooltrueTrain LoRA on MLP layers

Training block (Tinker-native)

ParameterTypeDefaultDescription
training.group_sizeint???Number of rollouts per prompt
training.learning_ratefloat2e-5Learning rate
training.lr_schedulestrconstantLR schedule (constant, linear, cosine)
training.warmup_steps_ratiofloat0.0Warmup ratio in [0, 1]
training.beta1float0.9Adam beta1
training.beta2float0.95Adam beta2
training.epsfloat1e-8Adam epsilon
training.max_lengthint32768Max model context length
training.num_minibatchesint1Number of minibatches
training.default_local_dirstr/tmp/rllm-tinker-checkpointsLocal checkpoint directory
training.resume_from_tinker_idstr | nullnullOptional checkpoint/model ID to resume

Validation, sampling, rollout, and data blocks

ParameterTypeDefaultDescription
validation.group_sizeint???Rollouts per prompt for validation
sampling.train.temperaturefloat1.0Train sampling temperature
sampling.train.top_pfloat1.0Train nucleus sampling threshold
sampling.train.top_kint-1Train top-k
sampling.val.temperaturefloat1.0Val sampling temperature
sampling.val.top_pfloat1.0Val nucleus sampling threshold
sampling.val.top_kint-1Val top-k
rollout_engine.reasoning_effortstrmediumReasoning effort mode
rollout_engine.accumulate_reasoningboolfalseWhether to accumulate reasoning across steps
rollout_engine.disable_thinkingboolfalseWhether to disable thinking tokens
rollout_engine.renderer_namestr | nullnullOptional renderer name
data.max_prompt_lengthint2048Max prompt length
data.max_response_lengthint2048Max response length
data.train_batch_sizeint64Train batch size
data.val_batch_sizeint32Validation batch size

OPSD block

ParameterTypeDefaultDescription
opsd.kl_penalty_coeffloat1.0KL penalty coefficient
opsd.kl_discount_factorfloat0.0KL discount factor
opsd.teacher_messages_keystrteacher_messagesKey for teacher messages
opsd.teacher_policy_update_freqint-1Teacher refresh frequency (-1 = initial teacher only)

Forwarding to common rllm.*

Tinker backend forwards group-size settings into backend-agnostic rollout config:
  • rllm.rollout.n <- training.group_size
  • rllm.rollout.n_val <- validation.group_size

Verl backend

Verl-specific settings live in rllm/experimental/config/rllm/backend/verl.yaml. This file is intentionally thin and composes Verl’s native PPO config via:
defaults:
  - /ppo_trainer
  - _self_
For detailed semantics of Verl-native fields, see the Verl configuration docs. In rLLM, the verl.yaml mainly does three things:
  1. Sets a few required overrides for unified-trainer compatibility
  2. Marks selected native fields as required (???) so they can be provided by user/config composition
  3. Forwards native Verl fields into rllm.* common config for backward compatibility

Key fields in verl.yaml

ParameterTypeDefaultDescription
actor_rollout_ref.rollout.modestrasyncRequired mode for unified Verl backend
actor_rollout_ref.rollout.agent.num_workersint0Agent worker count
actor_rollout_ref.rollout.val_kwargs.nint???Validation rollout count
data.gen_batch_sizeint${mul:...}Generated batch size
data.return_multi_modal_inputsboolFalseInclude multimodal inputs in data path
algorithm.adv_estimatorstr???Native algorithm estimator
algorithm.gammafloat???Native discount factor
algorithm.lamfloat???Native lambda
algorithm.norm_adv_by_std_in_grpobool???Native GRPO normalization
trainer.total_epochsint???Native epoch count
trainer.total_training_stepsint???Native total-step budget
trainer.loggerlist[str]???Native logger list
trainer.project_namestr???Project name
trainer.experiment_namestr???Experiment name
trainer.test_freqint???Validation cadence
trainer.save_freqint???Save cadence
trainer.val_before_trainbool???Validate before training
trainer.val_onlybool???Validation-only mode

Forwarding to common rllm.*

Verl backend forwards the following:
  • rllm.algorithm.{adv_estimator,gamma,lam,norm_adv_by_std_in_grpo}
  • rllm.rollout.{n,n_val}
  • rllm.trainer.{total_epochs,total_batches,logger,project_name,experiment_name,test_freq,save_freq,val_before_train,val_only}
It also extends Hydra search path with Verl configs:
rllm:
  hydra.searchpath:
    - pkg://verl.trainer.config

Config forwarding mechanism

The rLLM configuration system supports a forwarding mechanism that allows users familiar with a specific backend (Tinker or Verl) to specify configurations in their native format. These backend-specific configs are then automatically forwarded to the corresponding rLLM configs for backward compatibility.

How it works

Backend-specific config files can override rLLM settings using Hydra’s oc.select resolver. This mechanism:
  1. First checks if a backend-specific config value is provided
  2. If provided, uses that value to populate the rLLM config
  3. If not provided, falls back to the rLLM default value

Example: Verl backend forwarding

In rllm/experimental/config/rllm/backend/verl.yaml, Verl’s native trainer configuration is forwarded to rLLM:
# In Verl's native config format
trainer:
  total_epochs: 15
  project_name: 'my-verl-project'
  experiment_name: 'verl-experiment-1'

# These are automatically forwarded to rLLM configs
rllm:
  trainer:
    total_epochs: ${oc.select:trainer.total_epochs, 10}  # Uses 15 from above
    project_name: ${oc.select:trainer.project_name, 'rllm-training'}  # Uses 'my-verl-project'
    experiment_name: ${oc.select:trainer.experiment_name, 'default'}  # Uses 'verl-experiment-1'
In this example:
  • Users can specify trainer.total_epochs in Verl’s native format
  • The value is automatically forwarded to rllm.trainer.total_epochs
  • If the Verl config is not specified, the rLLM default (10) is used

Example: Algorithm configuration forwarding

Similarly, algorithm configurations can be forwarded:
# Backend-specific algorithm config
algorithm:
  adv_estimator: gae
  gamma: 0.99
  lam: 0.95

# Forwarded to rLLM
rllm:
  algorithm:
    adv_estimator: ${oc.select:algorithm.adv_estimator, grpo}  # Uses 'gae'
    gamma: ${oc.select:algorithm.gamma, 1.0}  # Uses 0.99
    lam: ${oc.select:algorithm.lam, 0.95}  # Uses 0.95

Benefits

This forwarding mechanism provides several benefits:
  • Backward compatibility: Users can continue using their familiar backend-specific config formats
  • Gradual migration: Projects can migrate to rLLM configs incrementally
  • Flexibility: Supports both backend-specific and rLLM-native configuration styles
  • Consistency: Ensures backend configs and rLLM configs stay synchronized

Configuration best practices

  1. Use rLLM configs for new projects: If starting from scratch, use the rLLM backend-agnostic configs for better portability across backends.
  2. Leverage forwarding for migration: If migrating from a specific backend, use the forwarding mechanism to maintain existing configs while gradually adopting rLLM conventions.
  3. Check the unified config: The unified.yaml file shows how all configs are combined and is useful for debugging configuration issues.
  4. Understand defaults hierarchy: Backend-specific configs override rLLM defaults, which in turn override Hydra’s base defaults.