Skip to main content
The rLLM framework provides a unified configuration system that separates backend-agnostic settings from backend-specific configurations. This design allows you to switch between different RL backends (Tinker, Verl) while maintaining consistent core training logic.

Configuration structure

The configuration system is organized into three main components:
  1. rLLM backend-agnostic configs: Core training settings shared across all backends
  2. Backend-specific configs: Settings specific to Tinker or Verl backends
  3. Forwarding mechanism: Allows backend-specific configs to override rLLM configs for backward compatibility
All configuration files are located in rllm/trainer/config/:
  • rllm/trainer/config/rllm/base.yaml: Backend-agnostic rLLM configurations
  • rllm/trainer/config/rllm/backend/tinker.yaml: Tinker-specific configurations
  • rllm/trainer/config/rllm/backend/verl.yaml: Verl-specific configurations
  • rllm/trainer/config/unified.yaml: Main entry point that combines all configs

rLLM backend-agnostic configurations

These configurations are defined in rllm/base.yaml and are used across different backends.

Agent configuration

Settings for the agent that interacts with the environment.
ParameterTypeDefaultDescription
namestrmath_agentName of the agent
max_stepsint20Maximum number of steps per trajectory
trajectory_timeoutint | nullnullTimeout for trajectory execution (seconds)
overlong_filterboolFalseWhether to filter out overlong trajectories
agent_argsdict{}Additional agent-specific arguments
engine_argsdict{}Additional engine-specific arguments

Environment configuration

Settings for the environment where the agent operates.
ParameterTypeDefaultDescription
namestrcustomName of the environment
env_argsdict{}Additional environment-specific arguments

Workflow configuration

Settings for workflow-based training (alternative to agent-based training).
ParameterTypeDefaultDescription
use_workflowboolFalseWhether to use workflow mode instead of agent mode
namestrsingle_turn_workflowName of the workflow
workflow_args.agent_clsstr | nullnullAgent class to use in workflow
workflow_args.agent_argsdict{}Agent arguments in workflow
workflow_args.env_clsstr | nullnullEnvironment class to use in workflow
workflow_args.env_argsdict{}Environment arguments in workflow
workflow_args.timeoutfloat1e6Workflow execution timeout
workflow_args.gammafloat0.0Discount factor (0.0 = no discounting)
workflow_args.reward_bonus_coefffloat0.0Reward shaping coefficient
n_parallel_tasksint256Number of parallel tasks to run
retry_limitint3Maximum number of retries on failure
raise_on_errorboolTrueWhether to raise exceptions on errors

Rollout configuration

Settings for trajectory rollouts during training and validation.
These settings are primarily for logging purposes. The actual rollout behavior is determined by backend-specific configurations.
ParameterTypeDefaultDescription
nint8Number of rollouts per prompt during training
n_valint1Number of rollouts per prompt during validation

Trainer configuration

Core training loop settings.
ParameterTypeDefaultDescription
total_epochsint10Total number of training epochs
total_batchesint-1Total number of training batches (-1 = use epochs)
loggerlist[str]['console']Logging backends (options: console, wandb, tensorboard)
project_namestrrllm-trainingProject name for logging
experiment_namestrdefaultExperiment name for logging
test_freqint5Frequency of validation (in epochs)
save_freqint20Frequency of checkpoint saving (in epochs)
val_before_trainboolTrueWhether to run validation before training starts
val_onlyboolFalseWhether to only run validation (no training)

Algorithm configuration

RL algorithm and advantage estimation settings.
ParameterTypeDefaultDescription
adv_estimatorstrgrpoAdvantage estimator (options: grpo, reinforce, reinforce_plus_plus_baseline, rloo)
norm_adv_by_std_in_grpoboolTrueWhether to normalize advantages by standard deviation in GRPO
loss_fnstr | nullnullLoss function for Tinker backend (options: importance_sampling, ppo, cispo, dro, cross_entropy)

Stepwise advantage configuration

Settings for computing advantages at each step in multi-step trajectories.
ParameterTypeDefaultDescription
enableboolFalseWhether to enable stepwise advantage computation
modestrbroadcastAdvantage computation mode (options: broadcast, per_step)
normalize_by_stepsboolFalseWhether to normalize advantages by number of steps

Trajectory processing flags

Top-level flags for trajectory processing and filtering.
ParameterTypeDefaultDescription
disable_thinkingboolFalseWhether to disable thinking tokens in responses
accumulate_reasoningboolFalseWhether to accumulate reasoning across steps
mask_truncated_samplesboolFalseWhether to mask trajectories that were truncated
filter_token_mismatchboolTrueWhether to filter out trajectories with token mismatches

Compact filtering configuration

Fine-grained filtering of trajectories based on various termination conditions.
ParameterTypeDefaultDescription
enableboolFalseWhether to enable compact filtering
mask_max_prompt_length_exceededboolTrueMask trajectories that exceed max prompt length
mask_max_response_length_exceededboolTrueMask trajectories that exceed max response length
mask_env_doneboolFalseMask trajectories where environment signaled done
mask_max_turns_exceededboolTrueMask trajectories that exceed max turns
mask_timeoutboolTrueMask trajectories that timed out
mask_unknownboolFalseMask trajectories with unknown termination reasons
mask_errorboolTrueMask trajectories that encountered errors

Rejection sampling configuration

Settings for rejection sampling to improve training data quality.
ParameterTypeDefaultDescription
enableboolFalseWhether to enable rejection sampling
multiplierint1Multiplier for number of rollouts to generate
min_partial_solve_tasksint1Minimum number of tasks that must be partially solved
min_trajs_per_groupint2Minimum number of trajectories per group to keep

SDK configuration

Settings for the rLLM SDK, including trace storage and proxy server.
ParameterTypeDefaultDescription
store.pathstr~/.rllm/traces.dbPath to trace database
processing.groupby_keystr | nullnullKey to group trajectories by
processing.traj_name_keystr | nullnullKey to use as trajectory name
proxy.hoststr127.0.0.1Proxy server host
proxy.portint4000Proxy server port
proxy.modestrsubprocessProxy mode (options: subprocess, external)
proxy.admin_tokenstrmy-shared-secretAdmin token for proxy authentication

Episode logging configuration

Settings for logging full episode trajectories to disk.
ParameterTypeDefaultDescription
log_episodesboolfalseWhether to log full episodes to disk
episode_log_dirstrlogs/${rllm.trainer.project_name}/${rllm.trainer.experiment_name}Directory for episode logs

Backend-specific configurations

Tinker backend

Tinker-specific settings live in rllm/trainer/config/rllm/backend/tinker.yaml. This file contains:
  1. Tinker service and execution settings
  2. Model/LoRA training settings
  3. Sampling and rollout-engine settings
  4. Tinker-native training/data blocks
  5. Forwarding into rllm.* common config keys

Top-level Tinker-specific keys

ParameterTypeDefaultDescription
tinker_base_urlstr | nullnullTinker service URL (null for local/default)
fuse_forward_backward_and_optim_stepboolfalseWhether to fuse train-step internals in backend

Model block

ParameterTypeDefaultDescription
model.namestrQwen/Qwen3-8BBase model name
model.lora_rankint32LoRA rank
model.train_unembedbooltrueTrain LoRA on output embedding layer
model.train_attnbooltrueTrain LoRA on attention layers
model.train_mlpbooltrueTrain LoRA on MLP layers

Training block (Tinker-native)

ParameterTypeDefaultDescription
training.group_sizeint???Number of rollouts per prompt
training.learning_ratefloat2e-5Learning rate
training.lr_schedulestrconstantLR schedule (constant, linear, cosine)
training.warmup_steps_ratiofloat0.0Warmup ratio in [0, 1]
training.beta1float0.9Adam beta1
training.beta2float0.95Adam beta2
training.epsfloat1e-8Adam epsilon
training.max_lengthint32768Max model context length
training.num_minibatchesint1Number of minibatches
training.default_local_dirstr/tmp/rllm-tinker-checkpointsLocal checkpoint directory
training.resume_from_tinker_idstr | nullnullOptional checkpoint/model ID to resume

Validation, sampling, rollout, and data blocks

ParameterTypeDefaultDescription
validation.group_sizeint???Rollouts per prompt for validation
sampling.train.temperaturefloat1.0Train sampling temperature
sampling.train.top_pfloat1.0Train nucleus sampling threshold
sampling.train.top_kint-1Train top-k
sampling.val.temperaturefloat1.0Val sampling temperature
sampling.val.top_pfloat1.0Val nucleus sampling threshold
sampling.val.top_kint-1Val top-k
rollout_engine.reasoning_effortstrmediumReasoning effort mode
rollout_engine.accumulate_reasoningboolfalseWhether to accumulate reasoning across steps
rollout_engine.disable_thinkingboolfalseWhether to disable thinking tokens
rollout_engine.renderer_namestr | nullnullOptional renderer name
data.max_prompt_lengthint2048Max prompt length
data.max_response_lengthint2048Max response length
data.train_batch_sizeint64Train batch size
data.val_batch_sizeint32Validation batch size

Forwarding to common rllm.*

Tinker backend forwards group-size settings into backend-agnostic rollout config:
  • rllm.rollout.n <- training.group_size
  • rllm.rollout.n_val <- validation.group_size

Verl backend

Verl-specific settings live in rllm/trainer/config/rllm/backend/verl.yaml. This file is intentionally thin and composes Verl’s native PPO config via:
defaults:
  - /ppo_trainer
  - _self_
For detailed semantics of Verl-native fields, see the Verl configuration docs. In rLLM, the verl.yaml only does two things:
  1. Sets a small number of required overrides for unified-trainer compatibility (e.g. actor_rollout_ref.rollout.mode=async, actor.use_rollout_log_probs=True).
  2. Pins one rllm-namespaced default that diverges from verl’s (rllm.algorithm.rollout_correction.bypass_mode=False).
Everything else — propagating values between the verl-native namespace and the rllm.* namespace — happens at runtime via sync_config in rllm/trainer/verl/utils.py.

Key fields in verl.yaml

ParameterTypeDefaultDescription
actor_rollout_ref.rollout.modestrasyncRequired mode for unified Verl backend
actor_rollout_ref.rollout.agent.num_workersint0Agent worker count
actor_rollout_ref.rollout.calculate_log_probsboolTrueCompute log-probs during rollout (needed for rollout-correction)
actor_rollout_ref.rollout.val_kwargs.do_sampleboolTrueUse sampling during validation
actor_rollout_ref.actor.use_rollout_log_probsboolTrueReuse rollout log-probs in the actor (bypass-mode default)
data.gen_batch_sizeint${mul:...}Generated batch size
data.return_multi_modal_inputsboolFalseInclude multimodal inputs in data path
rllm.backendstrverlBackend selector
rllm.algorithm.rollout_correction.bypass_modeboolFalseVerl-side default differs from rLLM’s null; pinned here
All other shared knobs (algorithm.adv_estimator, actor.kl_loss_coef, trainer.total_epochs, …) live in their natural locations — verl-native keys come from ppo_trainer.yaml, rllm-namespace keys from base.yaml — and are reconciled at startup by sync_config.

Bidirectional config sync

For a fixed table of “shared keys” (the same value, different paths in the two namespaces), sync_config mirrors the value between the verl-native side and the rllm.* side at trainer startup. New configs should use the rllm.* path. Existing Verl-style CLI overrides still work for backward compatibility, but they log a deprecation warning. Per-key precedence:
  1. rllm.* value explicitly set on the Hydra CLI
  2. Verl-native value explicitly set on the Hydra CLI
  3. rllm.* value from base.yaml (when non-null)
  4. Verl-native value from ppo_trainer.yaml
Verl-native shared-key CLI overrides are deprecated. If you set a shared key on the Verl-native side, rLLM will still sync it for now and log a warning. If both sides set conflicting values, the rllm.* value wins and rLLM logs a conflict warning. Passing an extra yaml/config group to override shared keys is not a supported migration path; pass shared values as individual Hydra overrides instead.
The shared-keys table:
Verl-native pathrllm.* path
algorithm.adv_estimatorrllm.algorithm.adv_estimator
algorithm.norm_adv_by_std_in_grporllm.algorithm.norm_adv_by_std_in_grpo
algorithm.rollout_correction.bypass_moderllm.algorithm.rollout_correction.bypass_mode
algorithm.rollout_correction.rollout_isrllm.algorithm.rollout_correction.tis_mode
algorithm.rollout_correction.rollout_is_thresholdrllm.algorithm.rollout_correction.tis_cap
actor_rollout_ref.actor.kl_loss_coefrllm.algorithm.kl_beta
actor_rollout_ref.actor.policy_loss.loss_moderllm.algorithm.loss_fn
actor_rollout_ref.actor.loss_agg_moderllm.algorithm.loss_agg_mode
actor_rollout_ref.actor.clip_ratio_highrllm.algorithm.eps_clip_high
actor_rollout_ref.rollout.nrllm.rollout.n
actor_rollout_ref.rollout.val_kwargs.nrllm.rollout.n_val
trainer.total_epochsrllm.trainer.total_epochs
trainer.total_training_stepsrllm.trainer.total_batches
trainer.loggerrllm.trainer.logger
trainer.project_namerllm.trainer.project_name
trainer.experiment_namerllm.trainer.experiment_name
trainer.test_freqrllm.trainer.test_freq
trainer.save_freqrllm.trainer.save_freq
trainer.val_before_trainrllm.trainer.val_before_train
trainer.val_onlyrllm.trainer.val_only
Two extra rules sit alongside this table:
  • actor.use_kl_loss is derived from kl_beta: if you do not explicitly set actor_rollout_ref.actor.use_kl_loss on the CLI, sync_config sets it to (kl_beta > 0). Setting rllm.algorithm.kl_beta also mirrors into actor_rollout_ref.actor.kl_loss_coef. Setting the Verl-native coefficient still backfills rllm.algorithm.kl_beta for now, with a deprecation warning.
  • clip_ratio family. rllm.algorithm.eps_clip mirrors to actor_rollout_ref.actor.clip_ratio and actor_rollout_ref.actor.clip_ratio_low. If rllm.algorithm.eps_clip_high is set, it mirrors to actor_rollout_ref.actor.clip_ratio_high; otherwise the upper bound mirrors eps_clip. Verl-native clip_ratio, clip_ratio_low, and clip_ratio_high still backfill the rLLM values for now when the rLLM side is not set, with deprecation warnings.
It also extends the Hydra search path so /ppo_trainer resolves:
rllm:
  hydra.searchpath:
    - pkg://verl.trainer.config

Config forwarding mechanism

The Verl backend uses bidirectional sync (described above): for any shared key, the preferred path is rllm.*, while legacy Verl-native shared-key CLI overrides still work with deprecation warnings. The Tinker backend does its own one-way forwarding from native group-size settings into rllm.rollout.{n,n_val} (see “Tinker backend” above).

Example: set the preferred shared knob

Set adv_estimator on the rllm.* side:
python train.py rllm/backend=verl rllm.algorithm.adv_estimator=rloo
A legacy Verl-native override still works for now, but logs a deprecation warning:
python train.py rllm/backend=verl algorithm.adv_estimator=rloo
If both are set and conflict, the rllm.* value wins.

Example: KL-in-loss

Setting rllm.algorithm.kl_beta=0.01 is enough — actor.kl_loss_coef is mirrored and actor.use_kl_loss is auto-set to True:
python train.py rllm/backend=verl rllm.algorithm.kl_beta=0.01
The legacy Verl-native equivalent still works for now, but logs a deprecation warning:
python train.py rllm/backend=verl actor_rollout_ref.actor.kl_loss_coef=0.01

Benefits

  • Backward compatibility. Existing scripts that override on the verl-native side continue to work while logging deprecation warnings; the rllm-side value is mirrored automatically.
  • Backend portability. New scripts can target the backend-agnostic rllm.* namespace and run on either Tinker or Verl with the same flags.
  • Single source of truth at runtime. No oc.select interpolation in the yaml; the merged config that goes to verl’s workers and rLLM’s trainer holds the same values on both sides.

Configuration best practices

  1. Use rLLM configs for new projects: If starting from scratch, use the rLLM backend-agnostic configs for better portability across backends.
  2. Use rLLM paths for shared knobs: Existing Verl-native shared-key overrides still work for compatibility, but new configs should use rllm.*.
  3. Check the unified config: The unified.yaml file shows how all configs are combined and is useful for debugging configuration issues.
  4. Understand defaults hierarchy: Backend-specific configs override rLLM defaults, which in turn override Hydra’s base defaults.