Supported datasets

rLLM ships with a built-in catalog of 60+ benchmark datasets spanning math, code, question answering, instruction following, search, vision-language, translation, and agentic tasks. All datasets are auto-pulled from HuggingFace on first use.

rllm dataset list --all    # See all available datasets
rllm eval gsm8k            # Auto-pulls and evaluates

To add your own dataset (catalog entry, ad-hoc directory, or Harbor package), see Bring your own dataset.

Math

Dataset	Description	Size	Source	Evaluator
`gsm8k`	Grade school math word problems	8.5K train, 1.3K test	openai/gsm8k	`math_reward_fn`
`math500`	MATH-500 competition math benchmark	500 test	HuggingFaceH4/MATH-500	`math_reward_fn`
`hendrycks_math`	MATH: Competition mathematics across 7 subjects	7.5K train, 5K test	EleutherAI/hendrycks_math	`math_reward_fn`
`deepscaler_math`	DeepScaleR-Preview: ~40K problems (AIME/AMC/Omni-MATH/STILL)	~40K train	agentica-org/DeepScaleR-Preview-Dataset	`math_reward_fn`
`countdown`	Countdown arithmetic puzzle	1K train, 500 test	predibase/countdown	`countdown_reward_fn`
`hmmt`	HMMT Feb 2025: Harvard-MIT Mathematics Tournament	train	MathArena/hmmt_feb_2025	`math_reward_fn`
`hmmt_nov`	HMMT Nov 2025: Harvard-MIT Mathematics Tournament	30 problems	MathArena/hmmt_nov_2025	`math_reward_fn`
`aime_2025`	AIME 2025: American Invitational Mathematics Exam	30 problems	MathArena/aime_2025	`math_reward_fn`
`aime_2026`	AIME 2026: American Invitational Mathematics Exam	30 problems	MathArena/aime_2026	`math_reward_fn`
`polymath`	PolyMATH: Multilingual math reasoning across 18 languages	4 difficulty splits	Qwen/PolyMath	`math_reward_fn`

Code

Dataset	Description	Size	Source	Evaluator
`humaneval`	HumanEval: Function-level code generation	164 problems	openai/openai_humaneval	`code_reward_fn`
`mbpp`	MBPP: Python programming benchmark	974 problems	google-research-datasets/mbpp	`code_reward_fn`
`livecodebench`	LiveCodeBench: Contamination-free competitive programming	test	livecodebench/code_generation	`code_reward_fn`
`swebench_verified`	SWE-bench Verified: Real-world GitHub issues for SWE agents	500 test	princeton-nlp/SWE-bench_Verified	`swebench_reward_fn`

Multiple choice (MCQ)

Dataset	Description	Size	Source	Evaluator
`mmlu_pro`	MMLU-Pro: Expert-level MCQ with 10 options	12K test	TIGER-Lab/MMLU-Pro	`mcq_reward_fn`
`mmlu_redux`	MMLU-Redux: Curated MMLU subset with error fixes	3K test	edinburgh-dawg/mmlu-redux	`mcq_reward_fn`
`gpqa_diamond`	GPQA: Expert-level graduate science QA	448 questions	ankner/gpqa	`mcq_reward_fn`
`supergpqa`	SuperGPQA: Graduate-level QA across 285 disciplines	26.5K	m-a-p/SuperGPQA	`mcq_reward_fn`
`ceval`	C-Eval: Chinese evaluation across 52 disciplines	13.9K	ceval/ceval-exam	`mcq_reward_fn`
`mmmlu`	MMMLU: Multilingual MMLU across 14 languages	15.9K/lang	openai/MMMLU	`mcq_reward_fn`
`mmlu_prox`	MMLU-ProX: Multilingual MMLU-Pro across 29 languages	11.8K/lang	li-lab/MMLU-ProX	`mcq_reward_fn`
`include`	INCLUDE: Multilingual knowledge from local exams, 44 languages	test	CohereLabs/include-base-44	`mcq_reward_fn`
`global_piqa`	Global PIQA: Physical commonsense reasoning, 100+ languages	test	mrlbenchmarks/global-piqa-nonparallel	`mcq_reward_fn`
`longbench_v2`	LongBench v2: Long-context understanding MCQ	test	THUDM/LongBench-v2	`mcq_reward_fn`

Question answering

Dataset	Description	Size	Source	Evaluator
`hotpotqa`	HotpotQA: Multi-hop question answering	7.4K validation	hotpotqa/hotpot_qa	`f1_reward_fn`
`aa_lcr`	AA-LCR: Long-context reasoning over ~100K-token documents	100 questions	ArtificialAnalysis/AA-LCR	`llm_equality_reward_fn`
`hle`	HLE: Humanity’s Last Exam — expert-level questions	2,500 test	cais/hle	`llm_equality_reward_fn`

hle and hle_search are gated datasets on HuggingFace. Run huggingface-cli login before pulling them.

Instruction following

Dataset	Description	Size	Source	Evaluator
`ifeval`	IFEval: Instruction following with verifiable constraints	541	google/IFEval	`ifeval_reward_fn`
`ifbench`	IFBench: Out-of-distribution instruction following	test	allenai/IFBench_test	`ifeval_reward_fn`

Search

Datasets in this category use the search agent, which requires a search backend. Set one with --search-backend serper or --search-backend brave.

Dataset	Description	Size	Source	Evaluator
`browsecomp`	BrowseComp: Web browsing comprehension	200 test	Tevatron/browsecomp-plus	`llm_equality_reward_fn`
`seal0`	Seal-0: Search-augmented QA with freshness metadata	test	vtllms/sealqa	`llm_equality_reward_fn`
`widesearch`	WideSearch: Broad web search with structured table output	200	ByteDance-Seed/WideSearch	`widesearch_reward_fn`
`hle_search`	HLE + Search: Humanity’s Last Exam with web search tools	test	cais/hle	`llm_equality_reward_fn`

Agentic

Most agentic datasets are sandboxed — they ship per-task Dockerfiles and shell verifiers. See Sandboxes for how to pick --sandbox-backend and Tasks (Harbor-compatible) for the on-disk format.

Dataset	Description	Size	Source	Default agent
`bfcl`	BFCL: Berkeley Function Calling Leaderboard (exec_simple)	test	gorilla-llm/Berkeley-Function-Calling-Leaderboard	`react`
`multichallenge`	MultiChallenge: Multi-turn conversation evaluation	test	nmayorga7/multichallenge	`react`
`claw_eval`	Claw-Eval: 161 personal-assistant agent tasks in sandbox workspaces (LLM-judge graded)	161 general	claw-eval/Claw-Eval	`zeroclaw`
`rllm-swesmith`	SWE-smith filtered: solvable bug-fixing tasks across 105 Python repos (in-sandbox pytest grading)	~4.7K train	kylemontgomery/swesmith-filtered	`mini-swe-agent`
`skillsbench`	SkillsBench: 91 expert-curated agentic tasks measuring skill use (per-task `tests/test.sh` verifier)	91 train	benchflow/skillsbench	`claude-code`
`skillsbench-no-skills`	SkillsBench baseline without the per-task `skills/` tree, for measuring skills-augmentation gain	91 train	benchflow/skillsbench	`claude-code`

Harbor packages (no registry entry)

Terminal-Bench 2.0 and other Harbor task packages are not in the catalog — they’re resolved from Harbor at run time:

rllm eval harbor:laude-institute/t-bench-2 --agent terminus2 --sandbox-backend modal

See the Terminal-Bench cookbook for the full walkthrough.

Translation

Dataset	Description	Size	Source	Evaluator
`wmt24pp`	WMT24++: Machine translation across 55 languages (ChrF)	train	google/wmt24pp	`translation_reward_fn`

Vision-language (VLM)

These datasets contain images and require a vision-capable model.

Dataset	Description	Size	Source	Evaluator
`mmmu`	MMMU: Multi-discipline multimodal understanding	900 validation	MMMU/MMMU	`mcq_reward_fn`
`mmmu_pro`	MMMU-Pro: Harder multimodal understanding, 10 options	1,730 test	MMMU/MMMU_Pro	`mcq_reward_fn`
`mathvision`	MathVision: Visual math reasoning	304 testmini	MathLLMs/MathVision	`math_reward_fn`
`mathvista`	MathVista: Visual math across diverse tasks	1,000 testmini	AI4Math/MathVista	`math_reward_fn`
`dynamath`	DynaMath: Dynamic visual math with 10 variants	5,010	DynaMath/DynaMath_Sample	`math_reward_fn`
`zerobench`	ZEROBench: Zero-shot visual reasoning	100 questions	jonathan-roberts1/zerobench	`llm_equality_reward_fn`
`zerobench_sub`	ZEROBench Subquestions: Decomposed visual reasoning	334 subquestions	jonathan-roberts1/zerobench	`llm_equality_reward_fn`
`vlmsareblind`	VLMs Are Blind: Visual perception benchmark	8,020 valid	XAI/vlmsareblind	`f1_reward_fn`
`babyvision`	BabyVision: Early visual understanding MCQ	388 questions	UnipatAI/BabyVision	`llm_equality_reward_fn`
`ai2d`	AI2D: Science diagram understanding MCQ	3,088 test	lmms-lab/ai2d	`mcq_reward_fn`
`ocrbench`	OCRBench: OCR and text recognition	1,000 test	echo840/OCRBench	`f1_reward_fn`
`charxiv`	CharXiv: Chart understanding reasoning	1,000 validation	princeton-nlp/CharXiv	`llm_equality_reward_fn`
`cc_ocr`	CC-OCR: Multi-scene OCR with 4 sub-tasks	7,058 test	wulipc/CC-OCR	`f1_reward_fn`
`countbenchqa`	CountBenchQA: Visual object counting QA	491 test	vikhyatk/CountBenchQA	`f1_reward_fn`
`erqa`	ERQA: Entity recognition QA with multi-image support	400 test	FlagEval/ERQA	`mcq_reward_fn`
`geo3k`	Geometry3K: Geometry problems with diagrams	2.4K train, 601 test	hiyouga/geometry3k	`math_reward_fn`
`omnidocbench`	OmniDocBench: Comprehensive document understanding	test	rwood-97/english_OmniDocBench_with_eval	`f1_reward_fn`
`docvqa`	DocVQA: Single-page document visual QA	5,188 validation	lmms-lab/DocVQA	`f1_reward_fn`
`refcoco`	RefCOCO: Referring expression comprehension (bounding box)	test	lmms-lab/RefCOCO	`iou_reward_fn`
`refspatial`	RefSpatial-Bench: Spatial reasoning with point prediction	test	BAAI/RefSpatial-Bench	`point_in_mask_reward_fn`
`lingoqa`	LingoQA: Language-grounded QA for autonomous driving	test	runoob1/lingoqa	`f1_reward_fn`
`sunrgbd`	SUN RGB-D: Depth estimation and scene understanding	test	wyrx/SUNRGBD_seg	`depth_reward_fn`

Using custom datasets

Three paths for adding your own dataset:

Ad-hoc directory — point rllm eval at a folder on disk; the loader autodetects the shape.
Catalog entry + builder — register a builder in rllm/registry/datasets.json so collaborators can rllm dataset pull <name>.
Harbor package — drop in any Harbor task package with the harbor: prefix.

See Bring your own dataset for the full walkthrough.

Get started

Tutorials

rLLM CLI & UI

Core concepts

Datasets & Evaluation

Agent runtimes

Training backends

Guides

Unified workflow trainer

Advanced algorithms

Supported datasets

Math

Code

Multiple choice (MCQ)

Question answering

Instruction following

Search

Agentic

Harbor packages (no registry entry)

Translation

Vision-language (VLM)

Using custom datasets

​Math

​Code

​Multiple choice (MCQ)

​Question answering

​Instruction following

​Search

​Agentic

​Harbor packages (no registry entry)

​Translation

​Vision-language (VLM)

​Using custom datasets

Math

Code

Multiple choice (MCQ)

Question answering

Instruction following

Search

Agentic

Harbor packages (no registry entry)

Translation

Vision-language (VLM)

Using custom datasets