The tinker backend is rLLM’s async-first training backend that provides a unified architecture for both agent and workflow training. It’s designed for flexibility and ease of use with built-in support for LoRA and seamless integration with the tinker service.
Overview
tinker backend features:
Async-First Design : Native async/await support throughout the training pipeline
Unified Architecture : Single codebase for agent and workflow training
Service-Based : Uses tinker service for model serving and training
Simplified API : Cleaner configuration and easier setup
Python Version : Requires Python >= 3.11 for tinker backend
Installation
Install rLLM with the tinker backend:
Direct Installation
From Source
CPU-Only Installation
uv pip install "rllm[tinker] @ git+https://github.com/rllm-org/rllm.git"
Dependencies
The tinker backend includes (from pyproject.toml):
tinker = [
"tinker ; python_version >= '3.11'" ,
"tinker-cookbook @ git+https://github.com/thinking-machines-lab/tinker-cookbook.git#egg=tinker-cookbook ; python_version >= '3.11'" ,
]
Basic Usage
Agent Training
Train a math agent with tinker backend:
import hydra
from omegaconf import DictConfig
from examples.math_tinker.math_agent_with_fewshot import MathAgentWithFewshot
from examples.math_tinker.math_reward import math_reward_fn
from rllm.data.dataset import DatasetRegistry
from rllm.environments.base.single_turn_env import SingleTurnEnvironment
from rllm.trainer import AgentTrainer
@hydra.main (
version_base = None ,
config_path = "../../rllm/trainer/config" ,
config_name = "tinker_rl_trainer"
)
def main ( config : DictConfig):
# Load datasets
train_dataset = DatasetRegistry.load_dataset( "gsm8k" , "train" )
test_dataset = DatasetRegistry.load_dataset( "math500" , "test" )
# Create trainer with tinker backend
trainer = AgentTrainer(
config = config,
agent_class = MathAgentWithFewshot,
env_class = SingleTurnEnvironment,
agent_args = { "use_fewshot" : True },
env_args = { "reward_fn" : math_reward_fn},
train_dataset = train_dataset,
val_dataset = test_dataset,
backend = "tinker" , # Specify tinker backend
)
# Train
trainer.train()
if __name__ == "__main__" :
main()
Run with:
python train_math_tinker.py \
model.name=Qwen/Qwen2.5-Math-7B-Instruct \
data.train_batch_size= 16 \
training.group_size= 16
Workflow Training
Tinker backend also supports workflow-based training:
import hydra
from omegaconf import DictConfig
from examples.solver_judge_tinker.solver_judge_flow import SolverJudgeFlow
from rllm.data.dataset import DatasetRegistry
from rllm.trainer import WorkflowTrainer
@hydra.main (
version_base = None ,
config_path = "../../rllm/trainer/config" ,
config_name = "tinker_rl_trainer"
)
def main ( config : DictConfig):
train_dataset = DatasetRegistry.load_dataset( "countdown" , "train" )
test_dataset = DatasetRegistry.load_dataset( "countdown" , "test" )
trainer = WorkflowTrainer(
config = config,
workflow_class = SolverJudgeFlow,
workflow_args = {},
train_dataset = train_dataset,
val_dataset = test_dataset,
backend = "tinker" ,
)
trainer.train()
if __name__ == "__main__" :
main()
Configuration
The tinker backend uses tinker_rl_trainer.yaml configuration:
Model Configuration
model.name
string
default: "Qwen/Qwen3-8B"
Model path (HuggingFace or local)
LoRA rank (parameter-efficient fine-tuning)
Train LoRA on output embedding layer
Train LoRA on attention layers
Training Configuration
Number of rollouts per prompt (for GRPO)
Number of rollouts per validation prompt
Learning rate for optimizer
Maximum sequence length (prompt + response)
Number of minibatches per update (currently only 1 is fully tested)
Algorithm Configuration
Advantage estimator: “grpo”, “reinforce”, or “distill”
Discount factor for rewards
algorithm.grouping_level
string
default: "trajectory"
Grouping level: “trajectory” or “step”
algorithm.norm_adv_by_std_in_grpo
Normalize advantages by standard deviation in GRPO
Data Configuration
Maximum prompt length in tokens
Maximum response length in tokens
Trainer Configuration
Number of training epochs
Validation frequency (in steps)
Checkpoint save frequency (in steps)
trainer.default_local_dir
string
default: "/tmp/rllm-tinker-checkpoints"
Checkpoint directory
LoRA Training
tinker backend has native LoRA support built-in:
# LoRA is enabled by default with rank=32
trainer = AgentTrainer(
config = config,
agent_class = MathAgent,
env_class = SingleTurnEnvironment,
backend = "tinker" ,
# ... other args
)
Configure LoRA parameters:
python train_agent.py \
model.lora_rank= 64 \
model.train_attn= true \
model.train_mlp= true \
model.train_unembed= true
Set model.train_unembed=false for Fireworks AI compatibility when deploying LoRA adapters.
Tinker Service
Local Service
By default, tinker backend uses a local service:
tinker_base_url : null # null means local
Remote Service
Connect to a remote tinker service:
python train_agent.py \
tinker_base_url=http://remote-server:8080
Sampling Configuration
Configure sampling parameters:
Top-p (nucleus) sampling parameter
Important : Setting temperature or top_p away from 1.0 is not recommended by tinker and can cause mysterious issues with logprobs. See tinker-cookbook#86 for discussion.
Rollout Engine Configuration
rollout_engine.reasoning_effort
Reasoning effort level: “low”, “medium”, “high”
rollout_engine.accumulate_reasoning
Accumulate reasoning tokens across steps
rollout_engine.disable_thinking
Disable thinking tokens in responses
rollout_engine.bypass_render_with_parser
Bypass renderer and use parser directly
Checkpointing
tinker backend provides flexible checkpointing:
Automatic Checkpointing
trainer :
save_freq : 20 # Save every 20 steps
default_local_dir : /tmp/rllm-tinker-checkpoints
Resume from Checkpoint
Resume from a tinker checkpoint:
python train_agent.py \
trainer.resume_from_tinker_id=tinker://uuid/weights/000060
Manual Checkpoint Loading
python train_agent.py \
trainer.default_local_dir=/path/to/checkpoint/dir
Distillation Support
tinker backend supports knowledge distillation from teacher models:
algorithm :
adv_estimator : distill
shared_tokenizer : false
teacher_rollout_args :
backend : tinker # or openai
model : "Qwen/Qwen3-32B"
base_url : "http://localhost:8000/v1"
api_key : "EMPTY"
max_prompt_length : 32768
Run distillation training:
python train_agent.py \
algorithm.adv_estimator=distill \
algorithm.teacher_rollout_args.model=Qwen/Qwen3-32B
Advanced Features
Fused Forward-Backward and Optimizer Step
For better performance, tinker can fuse forward-backward pass with optimizer step:
fuse_forward_backward_and_optim_step : true
This optimization reduces overhead by combining gradient computation and parameter updates into a single operation.
Multi-Step Agents
For multi-turn agent interactions:
agent :
max_steps : 20 # Allow up to 20 turns
Workflow Parallel Tasks
Control parallelism in workflow execution:
workflow :
n_parallel_tasks : 256 # Run up to 256 tasks in parallel
retry_limit : 3 # Retry failed tasks up to 3 times
Monitoring
Configure logging backends:
trainer :
logger : [ 'console' , 'wandb' , 'tensorboard' ]
project_name : 'rllm-tinker'
experiment_name : 'math-agent-v1'
Example Configuration
Complete configuration for MATH dataset training:
# Model
model :
name : "Qwen/Qwen3-8B"
lora_rank : 32
train_unembed : true
train_attn : true
train_mlp : true
# Training
training :
group_size : 16
val_group_size : 1
learning_rate : 2e-5
max_length : 32768
# Sampling
sampling :
temperature : 1.0
top_p : 1.0
# Algorithm
algorithm :
adv_estimator : grpo
gamma : 1.0
lam : 0.95
norm_adv_by_std_in_grpo : false
grouping_level : 'trajectory'
# Data
data :
train_batch_size : 64
val_batch_size : 32
max_prompt_length : 2048
max_response_length : 2048
# Trainer
trainer :
total_epochs : 10
test_freq : 5
save_freq : 20
logger : [ 'console' , 'wandb' ]
project_name : 'math-rl'
experiment_name : 'qwen3-8b-gsm8k'
default_local_dir : '/tmp/rllm-tinker-checkpoints'
# Agent
agent :
max_steps : 1 # Single-turn
agent_args : {}
# Environment
env :
env_args : {}
# Rollout Engine
rollout_engine :
reasoning_effort : "medium"
accumulate_reasoning : false
disable_thinking : false
Increase Batch Size Tune data.train_batch_size and training.group_size for better GPU utilization
Use LoRA Enable LoRA for faster training and lower memory usage
Fuse Operations Set fuse_forward_backward_and_optim_step=true for reduced overhead
Parallel Workflows Increase workflow.n_parallel_tasks for workflow-based training
Troubleshooting
tinker requires Python >= 3.11. Upgrade your Python version: uv venv --python 3.11
source .venv/bin/activate
uv pip install -e .[tinker]
Sampling Parameter Warning
If you see warnings about temperature or top_p: sampling :
temperature : 1.0 # Keep at 1.0
top_p : 1.0 # Keep at 1.0
Setting these away from 1.0 can cause logprob issues.
Currently only num_minibatches=1 is fully tested: training :
num_minibatches : 1 # Don't change this
Ensure the checkpoint directory exists: mkdir -p /tmp/rllm-tinker-checkpoints
python train_agent.py trainer.default_local_dir=/tmp/rllm-tinker-checkpoints
Tinker Service Connection Failed
If using remote service, verify the URL: curl http://remote-server:8080/health
python train_agent.py tinker_base_url=http://remote-server:8080
Comparison with verl
Key differences from verl backend:
Feature tinker verl Python Version >= 3.11 >= 3.10 Architecture Async-first Ray-based LoRA Support Native Via config VLM Support Limited Full (Qwen2-VL, Qwen3-VL) Distributed Training Limited Multi-node Ray Configuration Simpler More complex Service Model tinker service vLLM/SGLang
See Backend Comparison for detailed feature comparison.
See Also