Skip to main content
The tinker backend is rLLM’s async-first training backend that provides a unified architecture for both agent and workflow training. It’s designed for flexibility and ease of use with built-in support for LoRA and seamless integration with the tinker service.

Overview

tinker backend features:
  • Async-First Design: Native async/await support throughout the training pipeline
  • Unified Architecture: Single codebase for agent and workflow training
  • Service-Based: Uses tinker service for model serving and training
  • Simplified API: Cleaner configuration and easier setup
Python Version: Requires Python >= 3.11 for tinker backend

Installation

Install rLLM with the tinker backend:
uv pip install "rllm[tinker] @ git+https://github.com/rllm-org/rllm.git"

Dependencies

The tinker backend includes (from pyproject.toml):
tinker = [
    "tinker ; python_version >= '3.11'",
    "tinker-cookbook @ git+https://github.com/thinking-machines-lab/tinker-cookbook.git#egg=tinker-cookbook ; python_version >= '3.11'",
]

Basic Usage

Agent Training

Train a math agent with tinker backend:
train_math_tinker.py
import hydra
from omegaconf import DictConfig

from examples.math_tinker.math_agent_with_fewshot import MathAgentWithFewshot
from examples.math_tinker.math_reward import math_reward_fn
from rllm.data.dataset import DatasetRegistry
from rllm.environments.base.single_turn_env import SingleTurnEnvironment
from rllm.trainer import AgentTrainer

@hydra.main(
    version_base=None,
    config_path="../../rllm/trainer/config",
    config_name="tinker_rl_trainer"
)
def main(config: DictConfig):
    # Load datasets
    train_dataset = DatasetRegistry.load_dataset("gsm8k", "train")
    test_dataset = DatasetRegistry.load_dataset("math500", "test")

    # Create trainer with tinker backend
    trainer = AgentTrainer(
        config=config,
        agent_class=MathAgentWithFewshot,
        env_class=SingleTurnEnvironment,
        agent_args={"use_fewshot": True},
        env_args={"reward_fn": math_reward_fn},
        train_dataset=train_dataset,
        val_dataset=test_dataset,
        backend="tinker",  # Specify tinker backend
    )

    # Train
    trainer.train()

if __name__ == "__main__":
    main()
Run with:
python train_math_tinker.py \
  model.name=Qwen/Qwen2.5-Math-7B-Instruct \
  data.train_batch_size=16 \
  training.group_size=16

Workflow Training

Tinker backend also supports workflow-based training:
train_workflow_tinker.py
import hydra
from omegaconf import DictConfig

from examples.solver_judge_tinker.solver_judge_flow import SolverJudgeFlow
from rllm.data.dataset import DatasetRegistry
from rllm.trainer import WorkflowTrainer

@hydra.main(
    version_base=None,
    config_path="../../rllm/trainer/config",
    config_name="tinker_rl_trainer"
)
def main(config: DictConfig):
    train_dataset = DatasetRegistry.load_dataset("countdown", "train")
    test_dataset = DatasetRegistry.load_dataset("countdown", "test")

    trainer = WorkflowTrainer(
        config=config,
        workflow_class=SolverJudgeFlow,
        workflow_args={},
        train_dataset=train_dataset,
        val_dataset=test_dataset,
        backend="tinker",
    )

    trainer.train()

if __name__ == "__main__":
    main()

Configuration

The tinker backend uses tinker_rl_trainer.yaml configuration:

Model Configuration

model.name
string
default:"Qwen/Qwen3-8B"
Model path (HuggingFace or local)
model.lora_rank
integer
default:"32"
LoRA rank (parameter-efficient fine-tuning)
model.train_unembed
boolean
default:"true"
Train LoRA on output embedding layer
model.train_attn
boolean
default:"true"
Train LoRA on attention layers
model.train_mlp
boolean
default:"true"
Train LoRA on MLP layers

Training Configuration

training.group_size
integer
default:"16"
Number of rollouts per prompt (for GRPO)
training.val_group_size
integer
default:"1"
Number of rollouts per validation prompt
training.learning_rate
float
default:"2e-5"
Learning rate for optimizer
training.max_length
integer
default:"32768"
Maximum sequence length (prompt + response)
training.num_minibatches
integer
default:"1"
Number of minibatches per update (currently only 1 is fully tested)

Algorithm Configuration

algorithm.adv_estimator
string
default:"grpo"
Advantage estimator: “grpo”, “reinforce”, or “distill”
algorithm.gamma
float
default:"1.0"
Discount factor for rewards
algorithm.grouping_level
string
default:"trajectory"
Grouping level: “trajectory” or “step”
algorithm.norm_adv_by_std_in_grpo
boolean
default:"false"
Normalize advantages by standard deviation in GRPO

Data Configuration

data.train_batch_size
integer
default:"64"
Training batch size
data.val_batch_size
integer
default:"32"
Validation batch size
data.max_prompt_length
integer
default:"2048"
Maximum prompt length in tokens
data.max_response_length
integer
default:"2048"
Maximum response length in tokens

Trainer Configuration

trainer.total_epochs
integer
default:"10"
Number of training epochs
trainer.test_freq
integer
default:"5"
Validation frequency (in steps)
trainer.save_freq
integer
default:"20"
Checkpoint save frequency (in steps)
trainer.default_local_dir
string
default:"/tmp/rllm-tinker-checkpoints"
Checkpoint directory

LoRA Training

tinker backend has native LoRA support built-in:
# LoRA is enabled by default with rank=32
trainer = AgentTrainer(
    config=config,
    agent_class=MathAgent,
    env_class=SingleTurnEnvironment,
    backend="tinker",
    # ... other args
)
Configure LoRA parameters:
python train_agent.py \
  model.lora_rank=64 \
  model.train_attn=true \
  model.train_mlp=true \
  model.train_unembed=true
Set model.train_unembed=false for Fireworks AI compatibility when deploying LoRA adapters.

Tinker Service

Local Service

By default, tinker backend uses a local service:
tinker_base_url: null  # null means local

Remote Service

Connect to a remote tinker service:
python train_agent.py \
  tinker_base_url=http://remote-server:8080

Sampling Configuration

Configure sampling parameters:
sampling.temperature
float
default:"1.0"
Sampling temperature
sampling.top_p
float
default:"1.0"
Top-p (nucleus) sampling parameter
Important: Setting temperature or top_p away from 1.0 is not recommended by tinker and can cause mysterious issues with logprobs. See tinker-cookbook#86 for discussion.

Rollout Engine Configuration

rollout_engine.reasoning_effort
string
default:"medium"
Reasoning effort level: “low”, “medium”, “high”
rollout_engine.accumulate_reasoning
boolean
default:"false"
Accumulate reasoning tokens across steps
rollout_engine.disable_thinking
boolean
default:"false"
Disable thinking tokens in responses
rollout_engine.bypass_render_with_parser
boolean
default:"false"
Bypass renderer and use parser directly

Checkpointing

tinker backend provides flexible checkpointing:

Automatic Checkpointing

trainer:
  save_freq: 20  # Save every 20 steps
  default_local_dir: /tmp/rllm-tinker-checkpoints

Resume from Checkpoint

Resume from a tinker checkpoint:
python train_agent.py \
  trainer.resume_from_tinker_id=tinker://uuid/weights/000060

Manual Checkpoint Loading

python train_agent.py \
  trainer.default_local_dir=/path/to/checkpoint/dir

Distillation Support

tinker backend supports knowledge distillation from teacher models:
algorithm:
  adv_estimator: distill
  shared_tokenizer: false
  teacher_rollout_args:
    backend: tinker  # or openai
    model: "Qwen/Qwen3-32B"
    base_url: "http://localhost:8000/v1"
    api_key: "EMPTY"
    max_prompt_length: 32768
Run distillation training:
python train_agent.py \
  algorithm.adv_estimator=distill \
  algorithm.teacher_rollout_args.model=Qwen/Qwen3-32B

Advanced Features

Fused Forward-Backward and Optimizer Step

For better performance, tinker can fuse forward-backward pass with optimizer step:
fuse_forward_backward_and_optim_step: true
This optimization reduces overhead by combining gradient computation and parameter updates into a single operation.

Multi-Step Agents

For multi-turn agent interactions:
agent:
  max_steps: 20  # Allow up to 20 turns

Workflow Parallel Tasks

Control parallelism in workflow execution:
workflow:
  n_parallel_tasks: 256  # Run up to 256 tasks in parallel
  retry_limit: 3  # Retry failed tasks up to 3 times

Monitoring

Configure logging backends:
trainer:
  logger: ['console', 'wandb', 'tensorboard']
  project_name: 'rllm-tinker'
  experiment_name: 'math-agent-v1'

Example Configuration

Complete configuration for MATH dataset training:
config.yaml
# Model
model:
  name: "Qwen/Qwen3-8B"
  lora_rank: 32
  train_unembed: true
  train_attn: true
  train_mlp: true

# Training
training:
  group_size: 16
  val_group_size: 1
  learning_rate: 2e-5
  max_length: 32768

# Sampling
sampling:
  temperature: 1.0
  top_p: 1.0

# Algorithm
algorithm:
  adv_estimator: grpo
  gamma: 1.0
  lam: 0.95
  norm_adv_by_std_in_grpo: false
  grouping_level: 'trajectory'

# Data
data:
  train_batch_size: 64
  val_batch_size: 32
  max_prompt_length: 2048
  max_response_length: 2048

# Trainer
trainer:
  total_epochs: 10
  test_freq: 5
  save_freq: 20
  logger: ['console', 'wandb']
  project_name: 'math-rl'
  experiment_name: 'qwen3-8b-gsm8k'
  default_local_dir: '/tmp/rllm-tinker-checkpoints'

# Agent
agent:
  max_steps: 1  # Single-turn
  agent_args: {}

# Environment
env:
  env_args: {}

# Rollout Engine
rollout_engine:
  reasoning_effort: "medium"
  accumulate_reasoning: false
  disable_thinking: false

Performance Optimization

Increase Batch Size

Tune data.train_batch_size and training.group_size for better GPU utilization

Use LoRA

Enable LoRA for faster training and lower memory usage

Fuse Operations

Set fuse_forward_backward_and_optim_step=true for reduced overhead

Parallel Workflows

Increase workflow.n_parallel_tasks for workflow-based training

Troubleshooting

tinker requires Python >= 3.11. Upgrade your Python version:
uv venv --python 3.11
source .venv/bin/activate
uv pip install -e .[tinker]
If you see warnings about temperature or top_p:
sampling:
  temperature: 1.0  # Keep at 1.0
  top_p: 1.0        # Keep at 1.0
Setting these away from 1.0 can cause logprob issues.
Currently only num_minibatches=1 is fully tested:
training:
  num_minibatches: 1  # Don't change this
Ensure the checkpoint directory exists:
mkdir -p /tmp/rllm-tinker-checkpoints
python train_agent.py trainer.default_local_dir=/tmp/rllm-tinker-checkpoints
If using remote service, verify the URL:
curl http://remote-server:8080/health
python train_agent.py tinker_base_url=http://remote-server:8080

Comparison with verl

Key differences from verl backend:
Featuretinkerverl
Python Version>= 3.11>= 3.10
ArchitectureAsync-firstRay-based
LoRA SupportNativeVia config
VLM SupportLimitedFull (Qwen2-VL, Qwen3-VL)
Distributed TrainingLimitedMulti-node Ray
ConfigurationSimplerMore complex
Service Modeltinker servicevLLM/SGLang
See Backend Comparison for detailed feature comparison.

See Also