Supported datasets

rLLM ships with a built-in catalog of 50+ benchmark datasets spanning math, code, question answering, instruction following, search, vision-language, translation, and agentic tasks. All datasets are auto-pulled from HuggingFace on first use.

rllm dataset list --all    # See all available datasets
rllm eval gsm8k            # Auto-pulls and evaluates

Math

Dataset	Description	Size	Source	Evaluator
`gsm8k`	Grade school math word problems	8.5K train, 1.3K test	openai/gsm8k	`math_reward_fn`
`math500`	MATH-500 competition math benchmark	500 test	HuggingFaceH4/MATH-500	`math_reward_fn`
`countdown`	Countdown arithmetic puzzle	1K train, 500 test	predibase/countdown	`countdown_reward_fn`
`hmmt`	HMMT Feb 2025: Harvard-MIT Mathematics Tournament	train	MathArena/hmmt_feb_2025	`math_reward_fn`
`hmmt_nov`	HMMT Nov 2025: Harvard-MIT Mathematics Tournament	30 problems	MathArena/hmmt_nov_2025	`math_reward_fn`
`aime_2025`	AIME 2025: American Invitational Mathematics Exam	30 problems	MathArena/aime_2025	`math_reward_fn`
`aime_2026`	AIME 2026: American Invitational Mathematics Exam	30 problems	MathArena/aime_2026	`math_reward_fn`
`polymath`	PolyMATH: Multilingual math reasoning across 18 languages	4 difficulty splits	Qwen/PolyMath	`math_reward_fn`

Code

Dataset	Description	Size	Source	Evaluator
`humaneval`	HumanEval: Function-level code generation	164 problems	openai/openai_humaneval	`code_reward_fn`
`mbpp`	MBPP: Python programming benchmark	974 problems	google-research-datasets/mbpp	`code_reward_fn`
`livecodebench`	LiveCodeBench: Contamination-free competitive programming	test	livecodebench/code_generation	`code_reward_fn`
`swebench_verified`	SWE-bench Verified: Real-world GitHub issues for SWE agents	500 test	princeton-nlp/SWE-bench_Verified	`swebench_reward_fn`

Multiple choice (MCQ)

Dataset	Description	Size	Source	Evaluator
`mmlu_pro`	MMLU-Pro: Expert-level MCQ with 10 options	12K test	TIGER-Lab/MMLU-Pro	`mcq_reward_fn`
`mmlu_redux`	MMLU-Redux: Curated MMLU subset with error fixes	3K test	edinburgh-dawg/mmlu-redux	`mcq_reward_fn`
`gpqa_diamond`	GPQA: Expert-level graduate science QA	448 questions	ankner/gpqa	`mcq_reward_fn`
`supergpqa`	SuperGPQA: Graduate-level QA across 285 disciplines	26.5K	m-a-p/SuperGPQA	`mcq_reward_fn`
`ceval`	C-Eval: Chinese evaluation across 52 disciplines	13.9K	ceval/ceval-exam	`mcq_reward_fn`
`mmmlu`	MMMLU: Multilingual MMLU across 14 languages	15.9K/lang	openai/MMMLU	`mcq_reward_fn`
`mmlu_prox`	MMLU-ProX: Multilingual MMLU-Pro across 29 languages	11.8K/lang	li-lab/MMLU-ProX	`mcq_reward_fn`
`include`	INCLUDE: Multilingual knowledge from local exams, 44 languages	test	CohereLabs/include-base-44	`mcq_reward_fn`
`global_piqa`	Global PIQA: Physical commonsense reasoning, 100+ languages	test	mrlbenchmarks/global-piqa-nonparallel	`mcq_reward_fn`
`longbench_v2`	LongBench v2: Long-context understanding MCQ	test	THUDM/LongBench-v2	`mcq_reward_fn`

Question answering

Dataset	Description	Size	Source	Evaluator
`hotpotqa`	HotpotQA: Multi-hop question answering	7.4K validation	hotpotqa/hotpot_qa	`f1_reward_fn`
`aa_lcr`	AA-LCR: Long-context reasoning over ~100K-token documents	100 questions	ArtificialAnalysis/AA-LCR	`llm_equality_reward_fn`
`hle`	HLE: Humanity’s Last Exam — expert-level questions	2,500 test	cais/hle	`llm_equality_reward_fn`

hle and hle_search are gated datasets on HuggingFace. Run huggingface-cli login before pulling them.

Instruction following

Dataset	Description	Size	Source	Evaluator
`ifeval`	IFEval: Instruction following with verifiable constraints	541	google/IFEval	`ifeval_reward_fn`
`ifbench`	IFBench: Out-of-distribution instruction following	test	allenai/IFBench_test	`ifeval_reward_fn`

Search

Datasets in this category use the search agent, which requires a search backend. Set one with --search-backend serper or --search-backend brave.

Dataset	Description	Size	Source	Evaluator
`browsecomp`	BrowseComp: Web browsing comprehension	200 test	Tevatron/browsecomp-plus	`llm_equality_reward_fn`
`seal0`	Seal-0: Search-augmented QA with freshness metadata	test	vtllms/sealqa	`llm_equality_reward_fn`
`widesearch`	WideSearch: Broad web search with structured table output	200	ByteDance-Seed/WideSearch	`widesearch_reward_fn`
`hle_search`	HLE + Search: Humanity’s Last Exam with web search tools	test	cais/hle	`llm_equality_reward_fn`

Agentic

Dataset	Description	Size	Source	Evaluator
`bfcl`	BFCL: Berkeley Function Calling Leaderboard (exec_simple)	test	gorilla-llm/Berkeley-Function-Calling-Leaderboard	`bfcl_reward_fn`
`multichallenge`	MultiChallenge: Multi-turn conversation evaluation	test	nmayorga7/multichallenge	`llm_judge_reward_fn`
`frozenlake`	FrozenLake: Grid navigation (procedurally generated)	train, test	Generated	`frozenlake_reward_fn`

Translation

Dataset	Description	Size	Source	Evaluator
`wmt24pp`	WMT24++: Machine translation across 55 languages (ChrF)	train	google/wmt24pp	`translation_reward_fn`

Vision-language (VLM)

These datasets contain images and require a vision-capable model.

Dataset	Description	Size	Source	Evaluator
`mmmu`	MMMU: Multi-discipline multimodal understanding	900 validation	MMMU/MMMU	`mcq_reward_fn`
`mmmu_pro`	MMMU-Pro: Harder multimodal understanding, 10 options	1,730 test	MMMU/MMMU_Pro	`mcq_reward_fn`
`mathvision`	MathVision: Visual math reasoning	304 testmini	MathLLMs/MathVision	`math_reward_fn`
`mathvista`	MathVista: Visual math across diverse tasks	1,000 testmini	AI4Math/MathVista	`math_reward_fn`
`dynamath`	DynaMath: Dynamic visual math with 10 variants	5,010	DynaMath/DynaMath_Sample	`math_reward_fn`
`zerobench`	ZEROBench: Zero-shot visual reasoning	100 questions	jonathan-roberts1/zerobench	`llm_equality_reward_fn`
`zerobench_sub`	ZEROBench Subquestions: Decomposed visual reasoning	334 subquestions	jonathan-roberts1/zerobench	`llm_equality_reward_fn`
`vlmsareblind`	VLMs Are Blind: Visual perception benchmark	8,020 valid	XAI/vlmsareblind	`f1_reward_fn`
`babyvision`	BabyVision: Early visual understanding MCQ	388 questions	UnipatAI/BabyVision	`llm_equality_reward_fn`
`ai2d`	AI2D: Science diagram understanding MCQ	3,088 test	lmms-lab/ai2d	`mcq_reward_fn`
`ocrbench`	OCRBench: OCR and text recognition	1,000 test	echo840/OCRBench	`f1_reward_fn`
`charxiv`	CharXiv: Chart understanding reasoning	1,000 validation	princeton-nlp/CharXiv	`llm_equality_reward_fn`
`cc_ocr`	CC-OCR: Multi-scene OCR with 4 sub-tasks	7,058 test	wulipc/CC-OCR	`f1_reward_fn`
`countbenchqa`	CountBenchQA: Visual object counting QA	491 test	vikhyatk/CountBenchQA	`f1_reward_fn`
`erqa`	ERQA: Entity recognition QA with multi-image support	400 test	FlagEval/ERQA	`mcq_reward_fn`
`geo3k`	Geometry3K: Geometry problems with diagrams	2.4K train, 601 test	hiyouga/geometry3k	`math_reward_fn`
`omnidocbench`	OmniDocBench: Comprehensive document understanding	test	rwood-97/english_OmniDocBench_with_eval	`f1_reward_fn`
`docvqa`	DocVQA: Single-page document visual QA	5,188 validation	lmms-lab/DocVQA	`f1_reward_fn`
`refcoco`	RefCOCO: Referring expression comprehension (bounding box)	test	lmms-lab/RefCOCO	`iou_reward_fn`
`refspatial`	RefSpatial-Bench: Spatial reasoning with point prediction	test	BAAI/RefSpatial-Bench	`point_in_mask_reward_fn`
`lingoqa`	LingoQA: Language-grounded QA for autonomous driving	test	runoob1/lingoqa	`f1_reward_fn`
`sunrgbd`	SUN RGB-D: Depth estimation and scene understanding	test	wyrx/SUNRGBD_seg	`depth_reward_fn`

Using custom datasets

You can register your own datasets for use with rllm eval and rllm train:

rllm dataset register my-dataset --file data.jsonl --category math
rllm eval my-dataset --agent react --evaluator math_reward_fn

Your data file should contain question and ground_truth fields. Supported formats: JSON, JSONL, CSV, and Parquet. For more details, see rllm dataset.

Get started

Tutorials

rLLM CLI & UI

Core concepts

Agent runtimes

Training backends

Guides

Unified workflow trainer

Advanced algorithms

Supported datasets

Math

Code

Multiple choice (MCQ)

Question answering

Instruction following

Search

Agentic

Translation

Vision-language (VLM)

Using custom datasets

Get started

Tutorials

rLLM CLI & UI

Core concepts

Agent runtimes

Training backends

Guides

Unified workflow trainer

Advanced algorithms

Documentation Index

​Math

​Code

​Multiple choice (MCQ)

​Question answering

​Instruction following

​Search

​Agentic

​Translation

​Vision-language (VLM)

​Using custom datasets

Math

Code

Multiple choice (MCQ)

Question answering

Instruction following

Search

Agentic

Translation

Vision-language (VLM)

Using custom datasets