Overview
The VLM training example demonstrates:- How to implement multimodal workflows that process both images and text
- How to integrate VLMs with rLLM’s training pipeline
- How to evaluate multimodal reasoning performance on mathematical tasks
- Training agents on visual geometry problem solving
Prerequisites
- rLLM framework installed
- SGLang or vLLM for vision-language model serving
- Base model:
Qwen/Qwen3-VL-2B-Instruct(or similar VLM) - GPU with sufficient memory for multimodal processing
Setup
Prepare Geo3K dataset
Download and preprocess the Geometry3K dataset:This will:
- Download
hiyouga/geometry3kdataset from HuggingFace - Process geometry problems with images and text
- Register the dataset with rLLM’s DatasetRegistry
- Save processed data for training and evaluation
Running the VLM Agent
Execute the VLM agent on geometry problems:Code Implementation
Geo3K Workflow Implementation
The workflow handles multimodal inputs:Expected Results
Qwen3-VL-2B-Instruct on Geometry3K:| Metric | Performance |
|---|---|
| Pass@1 | 35.2% |
| Pass@4 | 52.8% |
Training the VLM Agent
Train your own VLM agent using reinforcement learning:Training Configuration
Key hyperparameters:- Base Model: Qwen/Qwen3-VL-2B-Instruct
- Algorithm: GRPO (Group Relative Policy Optimization)
- Training Dataset: Geometry3K train split
- Evaluation Dataset: Geometry3K test split
- Training Batch Size: 32
- Validation Batch Size: 128
- Response Length: Up to 2048 tokens
- Prompt Length: Up to 1024 tokens
- Number of GPUs: 8 (configurable)
- Training Epochs: 3
- Learning Rate: 1e-6
Training Script
Multimodal Input Formats
Base64 Encoding (Default)
Image URL (Alternative)
Geometry3K Dataset
The dataset contains:- Diagrams: Geometry figures (triangles, circles, etc.)
- Questions: Mathematical questions about the figures
- Answers: Numerical or symbolic answers
Monitoring Training
Key metrics to track:| Metric | Description |
|---|---|
val/pass@1 | Test set accuracy (single attempt) |
val/pass@4 | Test set accuracy (best of 4) |
critic/score/mean | Average reward per batch |
train/response_length | Average solution length |
Supported VLM Models
rLLM supports various vision-language models:- Qwen3-VL series (2B, 7B)
- LLaVA series
- CogVLM series
- Any model compatible with vLLM/SGLang
Next Steps
- Explore VLM training guide
- Try DeepScaleR for text-only math reasoning
- Learn about multimodal workflows

