Skip to main content
rLLM supports multiple reinforcement learning algorithms optimized for training language agents. This guide explains the algorithms, their advantages, and how to configure them.

Overview

RL algorithms in rLLM differ primarily in how they compute advantages - the signals that guide policy optimization. The advantage function A(s,a)A(s,a) estimates how much better an action aa is compared to the average action in state ss.

Supported Algorithms

  • GRPO (Group Relative Policy Optimization): Efficient algorithm using group-based baselines
  • PPO (Proximal Policy Optimization): Industry-standard policy gradient method
  • ReMax: Reward maximization with KL regularization
  • Custom: Define your own advantage computation

GRPO (Group Relative Policy Optimization)

GRPO is rLLM’s recommended algorithm for most use cases. It’s efficient, stable, and doesn’t require a value network.

How GRPO Works

GRPO groups multiple rollouts for the same task and computes advantages relative to the group:
# For each task, generate N rollouts
trajectories = [
    rollout(task, policy) for _ in range(N)
]

# Compute baseline from the group
baseline = mean([traj.reward for traj in trajectories])
# Or: baseline = max([traj.reward for traj in trajectories])

# Compute advantages
for traj in trajectories:
    advantage = traj.reward - baseline
    # Apply to each step in trajectory

Mathematical Formulation

For a task xx, we generate NN trajectories {y1,,yN}\{y_1, \ldots, y_N\} and compute: Ai=R(x,yi)b({R(x,yj)}j=1N)A_i = R(x, y_i) - b(\{R(x, y_j)\}_{j=1}^N) Where:
  • R(x,yi)R(x, y_i) is the reward for trajectory yiy_i
  • b()b(\cdot) is the baseline function (mean or max)
  • AiA_i is the advantage for trajectory yiy_i

Configuration

algorithm:
  advantage:
    kl_ctrl:
      type: "grpo"              # Use GRPO
      coeff: 0.05               # KL coefficient
  
  # GRPO-specific settings
  grpo:
    baseline: "mean"            # or "max"

Advantages

No Value Network: GRPO doesn’t require training a value function, reducing complexity and compute.
Stable: Group-based baselines provide stable advantage estimates even with sparse rewards.
Efficient: Lower memory footprint than PPO (no value network state).

When to Use GRPO

GRPO works well when:
  • You can generate multiple rollouts per task (typical: 4-8)
  • Rewards are sparse or noisy
  • You want simpler training (no value network)
  • You’re working with limited compute resources

Example Configuration

@hydra.main(config_path="pkg://rllm.trainer.config", config_name="ppo_trainer", version_base=None)
def main(config):
    # Override to use GRPO
    config.algorithm.advantage.kl_ctrl.type = "grpo"
    config.algorithm.grpo.baseline = "mean"
    config.actor_rollout_ref.rollout.n = 8  # 8 rollouts per task
    
    trainer = AgentTrainer(
        agent_class=MyAgent,
        env_class=MyEnv,
        config=config,
        ...
    )
    trainer.train()

PPO (Proximal Policy Optimization)

PPO is the industry-standard policy gradient algorithm. It uses a learned value function to estimate advantages.

How PPO Works

PPO trains two networks:
  1. Policy network πθ(as)\pi_\theta(a|s): The agent’s policy
  2. Value network Vϕ(s)V_\phi(s): Estimates expected return from state ss
Advantages are computed using Generalized Advantage Estimation (GAE): At=l=0(γλ)lδt+lA_t = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l} Where δt=rt+γV(st+1)V(st)\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) is the TD error.

Mathematical Formulation

The PPO objective is: LCLIP(θ)=Et[min(rt(θ)At,clip(rt(θ),1ϵ,1+ϵ)At)]L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min(r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t) \right] Where:
  • rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)} is the probability ratio
  • ϵ\epsilon is the clip ratio (typically 0.2)
  • AtA_t is the advantage estimate

Configuration

algorithm:
  advantage:
    kl_ctrl:
      type: "ppo"               # Use PPO
      coeff: 0.05               # KL coefficient
  
  # PPO-specific settings
  clip_ratio: 0.2               # Clip parameter ε
  ppo_epochs: 4                 # Number of PPO epochs per update
  ppo_mini_batch_size: 256      # Mini-batch size
  gamma: 0.99                   # Discount factor
  gae_lambda: 0.95              # GAE lambda parameter
  value_loss_coeff: 0.5         # Value loss coefficient
  entropy_coeff: 0.01           # Entropy bonus coefficient

Advantages

Sample Efficient: Value network provides better advantage estimates, improving sample efficiency.
Dense Feedback: Works well with dense, intermediate rewards.
Stable: Clipping prevents large policy updates.

When to Use PPO

PPO works well when:
  • You have dense, intermediate rewards
  • You can’t generate many rollouts per task
  • Sample efficiency is critical
  • You have compute for training a value network

Example Configuration

@hydra.main(config_path="pkg://rllm.trainer.config", config_name="ppo_trainer", version_base=None)
def main(config):
    # PPO is the default in ppo_trainer.yaml
    config.algorithm.advantage.kl_ctrl.type = "ppo"
    config.algorithm.clip_ratio = 0.2
    config.algorithm.ppo_epochs = 4
    config.algorithm.gamma = 0.99
    config.algorithm.gae_lambda = 0.95
    
    trainer = AgentTrainer(
        agent_class=MyAgent,
        env_class=MyEnv,
        config=config,
        ...
    )
    trainer.train()

ReMax (Reward Maximization)

ReMax is a simpler algorithm that directly maximizes rewards with KL regularization.

How ReMax Works

ReMax uses the trajectory reward directly as the advantage: Ai=R(τi)A_i = R(\tau_i) With KL regularization to prevent the policy from diverging too far from the reference policy: L(θ)=Eτπθ[R(τ)βDKL(πθπref)]L(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ R(\tau) - \beta \cdot D_{\text{KL}}(\pi_\theta || \pi_{\text{ref}}) \right]

Configuration

algorithm:
  advantage:
    kl_ctrl:
      type: "remax"             # Use ReMax
      coeff: 0.1                # KL coefficient β

When to Use ReMax

ReMax works well when:
  • You have very sparse rewards (only final outcome matters)
  • Task is simple (limited need for step-level credit assignment)
  • You want the simplest possible algorithm

Stepwise vs. Trajectory-Level Advantages

rLLM supports two modes for applying advantages:

Trajectory-Level (Default)

rllm:
  stepwise_advantage:
    enable: false               # Trajectory-level
Advantages are computed at the trajectory level and applied uniformly:
# Compute advantage for entire trajectory
advantage = trajectory.reward - baseline

# Apply to all tokens in the trajectory
for token in trajectory.response_tokens:
    token.advantage = advantage
Use when: Final outcome is what matters (e.g., math problem solved correctly)

Step-Level

rllm:
  stepwise_advantage:
    enable: true                # Step-level
Advantages are computed separately for each step:
# Compute advantage for each step
for step in trajectory.steps:
    step.advantage = step.reward - baseline
Use when: Intermediate rewards are meaningful (e.g., progressive task completion)
Step-level advantages require that max_prompt_length and max_response_length are enforced per-step, not per-trajectory. Set enforce_max_prompt_length: true in the execution engine.

Compact Filtering

rLLM can filter out invalid trajectories based on termination reason:
rllm:
  compact_filtering:
    enable: true
    mask_max_prompt_length_exceeded: true
    mask_max_response_length_exceeded: true
    mask_env_done: false
    mask_max_turns_exceeded: true
    mask_timeout: true
    mask_error: true
    mask_unknown: true
Filtered trajectories have their response_mask set to 0, excluding them from training.
Compact filtering is useful for removing low-quality trajectories that hit limits or errors, improving training efficiency.

Algorithm Comparison

FeatureGRPOPPOReMax
Value NetworkNoYesNo
Sample EfficiencyMediumHighLow
Compute CostLowHighLow
StabilityHighMediumMedium
Best ForSparse rewards, multi-rolloutDense rewards, sample efficiencyVery sparse rewards, simplicity
Rollouts per Task4-81-21-4
Hyperparameter SensitivityLowMediumLow

Advanced: Custom Advantage Computation

You can implement custom advantage computation:
from rllm.trainer.verl.agent_ppo_trainer import AgentPPOTrainer

class CustomAdvantageTrainer(AgentPPOTrainer):
    def compute_advantages(self, rollout_data):
        """Custom advantage computation."""
        
        # Group trajectories by task
        task_groups = defaultdict(list)
        for traj in rollout_data:
            task_id = traj.task_id
            task_groups[task_id].append(traj)
        
        # Compute custom advantages
        for task_id, trajectories in task_groups.items():
            # Your custom logic here
            rewards = [traj.reward for traj in trajectories]
            
            # Example: Use median as baseline
            baseline = np.median(rewards)
            
            for traj in trajectories:
                advantage = traj.reward - baseline
                
                # Apply advantage to all steps
                for step in traj.steps:
                    step.advantage = advantage
        
        return rollout_data

Hyperparameter Tuning Guide

GRPO Tuning

1

Rollouts per task

Start with 4-8 rollouts. More rollouts = better baseline estimates but higher compute cost.
2

Baseline function

Try both mean and max. Mean is more stable, max can help with very sparse rewards.
3

KL coefficient

Start with 0.05. Increase if policy changes too fast, decrease if learning is too slow.

PPO Tuning

1

Clip ratio

Default 0.2 works well. Increase for more aggressive updates, decrease for more conservative.
2

GAE lambda

Start with 0.95. Closer to 1.0 = lower bias but higher variance.
3

PPO epochs

Start with 4. Too many epochs can lead to overfitting, too few can underfit.
4

Mini-batch size

Balance between compute efficiency and gradient variance. Typical: 64-256.

General Tips

Learning Rate: Start with 1e-6 for language models. rLLM uses small LRs to avoid catastrophic forgetting.
Batch Size: Larger batches = more stable gradients but slower iteration. Typical: 128-512.
KL Coefficient: Controls how much the policy can deviate from reference. Typical: 0.01-0.1.
RL is sensitive to hyperparameters. Always start with defaults and change one thing at a time.

Monitoring Algorithm Performance

Key metrics to watch during training:

Reward Metrics

  • mean_reward: Average trajectory reward (should increase)
  • max_reward: Best trajectory reward (indicates ceiling)
  • reward_std: Reward variance (high variance = unstable)

Policy Metrics

  • kl_divergence: KL from reference policy (should stay small)
  • policy_loss: Policy gradient loss (should decrease)
  • entropy: Policy entropy (too low = policy collapse)

Training Metrics

  • explained_variance (PPO only): How well value network predicts returns (should be >0.5)
  • value_loss (PPO only): Value network loss (should decrease)
  • gradient_norm: Gradient magnitude (too high = unstable)
If kl_divergence grows too large, increase kl_ctrl.coeff. If entropy drops to near 0, add entropy_coeff or reduce temperature.

Next Steps