rLLM ships with a built-in catalog of 60+ benchmark datasets spanning math, code, question answering, instruction following, search, vision-language, translation, and agentic tasks. All datasets are auto-pulled from HuggingFace on first use.
rllm dataset list --all # See all available datasets
rllm eval gsm8k # Auto-pulls and evaluates
To add your own dataset (catalog entry, ad-hoc directory, or Harbor package), see Bring your own dataset.
Math
| Dataset | Description | Size | Source | Evaluator |
|---|
gsm8k | Grade school math word problems | 8.5K train, 1.3K test | openai/gsm8k | math_reward_fn |
math500 | MATH-500 competition math benchmark | 500 test | HuggingFaceH4/MATH-500 | math_reward_fn |
hendrycks_math | MATH: Competition mathematics across 7 subjects | 7.5K train, 5K test | EleutherAI/hendrycks_math | math_reward_fn |
deepscaler_math | DeepScaleR-Preview: ~40K problems (AIME/AMC/Omni-MATH/STILL) | ~40K train | agentica-org/DeepScaleR-Preview-Dataset | math_reward_fn |
countdown | Countdown arithmetic puzzle | 1K train, 500 test | predibase/countdown | countdown_reward_fn |
hmmt | HMMT Feb 2025: Harvard-MIT Mathematics Tournament | train | MathArena/hmmt_feb_2025 | math_reward_fn |
hmmt_nov | HMMT Nov 2025: Harvard-MIT Mathematics Tournament | 30 problems | MathArena/hmmt_nov_2025 | math_reward_fn |
aime_2025 | AIME 2025: American Invitational Mathematics Exam | 30 problems | MathArena/aime_2025 | math_reward_fn |
aime_2026 | AIME 2026: American Invitational Mathematics Exam | 30 problems | MathArena/aime_2026 | math_reward_fn |
polymath | PolyMATH: Multilingual math reasoning across 18 languages | 4 difficulty splits | Qwen/PolyMath | math_reward_fn |
Code
| Dataset | Description | Size | Source | Evaluator |
|---|
humaneval | HumanEval: Function-level code generation | 164 problems | openai/openai_humaneval | code_reward_fn |
mbpp | MBPP: Python programming benchmark | 974 problems | google-research-datasets/mbpp | code_reward_fn |
livecodebench | LiveCodeBench: Contamination-free competitive programming | test | livecodebench/code_generation | code_reward_fn |
swebench_verified | SWE-bench Verified: Real-world GitHub issues for SWE agents | 500 test | princeton-nlp/SWE-bench_Verified | swebench_reward_fn |
Multiple choice (MCQ)
| Dataset | Description | Size | Source | Evaluator |
|---|
mmlu_pro | MMLU-Pro: Expert-level MCQ with 10 options | 12K test | TIGER-Lab/MMLU-Pro | mcq_reward_fn |
mmlu_redux | MMLU-Redux: Curated MMLU subset with error fixes | 3K test | edinburgh-dawg/mmlu-redux | mcq_reward_fn |
gpqa_diamond | GPQA: Expert-level graduate science QA | 448 questions | ankner/gpqa | mcq_reward_fn |
supergpqa | SuperGPQA: Graduate-level QA across 285 disciplines | 26.5K | m-a-p/SuperGPQA | mcq_reward_fn |
ceval | C-Eval: Chinese evaluation across 52 disciplines | 13.9K | ceval/ceval-exam | mcq_reward_fn |
mmmlu | MMMLU: Multilingual MMLU across 14 languages | 15.9K/lang | openai/MMMLU | mcq_reward_fn |
mmlu_prox | MMLU-ProX: Multilingual MMLU-Pro across 29 languages | 11.8K/lang | li-lab/MMLU-ProX | mcq_reward_fn |
include | INCLUDE: Multilingual knowledge from local exams, 44 languages | test | CohereLabs/include-base-44 | mcq_reward_fn |
global_piqa | Global PIQA: Physical commonsense reasoning, 100+ languages | test | mrlbenchmarks/global-piqa-nonparallel | mcq_reward_fn |
longbench_v2 | LongBench v2: Long-context understanding MCQ | test | THUDM/LongBench-v2 | mcq_reward_fn |
Question answering
| Dataset | Description | Size | Source | Evaluator |
|---|
hotpotqa | HotpotQA: Multi-hop question answering | 7.4K validation | hotpotqa/hotpot_qa | f1_reward_fn |
aa_lcr | AA-LCR: Long-context reasoning over ~100K-token documents | 100 questions | ArtificialAnalysis/AA-LCR | llm_equality_reward_fn |
hle | HLE: Humanity’s Last Exam — expert-level questions | 2,500 test | cais/hle | llm_equality_reward_fn |
hle and hle_search are gated datasets on HuggingFace. Run huggingface-cli login before pulling them.
Instruction following
| Dataset | Description | Size | Source | Evaluator |
|---|
ifeval | IFEval: Instruction following with verifiable constraints | 541 | google/IFEval | ifeval_reward_fn |
ifbench | IFBench: Out-of-distribution instruction following | test | allenai/IFBench_test | ifeval_reward_fn |
Search
Datasets in this category use the search agent, which requires a search backend. Set one with --search-backend serper or --search-backend brave.
| Dataset | Description | Size | Source | Evaluator |
|---|
browsecomp | BrowseComp: Web browsing comprehension | 200 test | Tevatron/browsecomp-plus | llm_equality_reward_fn |
seal0 | Seal-0: Search-augmented QA with freshness metadata | test | vtllms/sealqa | llm_equality_reward_fn |
widesearch | WideSearch: Broad web search with structured table output | 200 | ByteDance-Seed/WideSearch | widesearch_reward_fn |
hle_search | HLE + Search: Humanity’s Last Exam with web search tools | test | cais/hle | llm_equality_reward_fn |
Agentic
Most agentic datasets are sandboxed — they ship per-task Dockerfiles and shell verifiers. See Sandboxes for how to pick --sandbox-backend and Tasks (Harbor-compatible) for the on-disk format.
| Dataset | Description | Size | Source | Default agent |
|---|
bfcl | BFCL: Berkeley Function Calling Leaderboard (exec_simple) | test | gorilla-llm/Berkeley-Function-Calling-Leaderboard | react |
multichallenge | MultiChallenge: Multi-turn conversation evaluation | test | nmayorga7/multichallenge | react |
claw_eval | Claw-Eval: 161 personal-assistant agent tasks in sandbox workspaces (LLM-judge graded) | 161 general | claw-eval/Claw-Eval | zeroclaw |
rllm-swesmith | SWE-smith filtered: solvable bug-fixing tasks across 105 Python repos (in-sandbox pytest grading) | ~4.7K train | kylemontgomery/swesmith-filtered | mini-swe-agent |
skillsbench | SkillsBench: 91 expert-curated agentic tasks measuring skill use (per-task tests/test.sh verifier) | 91 train | benchflow/skillsbench | claude-code |
skillsbench-no-skills | SkillsBench baseline without the per-task skills/ tree, for measuring skills-augmentation gain | 91 train | benchflow/skillsbench | claude-code |
Harbor packages (no registry entry)
Terminal-Bench 2.0 and other Harbor task packages are not in the catalog — they’re resolved from Harbor at run time:
rllm eval harbor:laude-institute/t-bench-2 --agent terminus2 --sandbox-backend modal
See the Terminal-Bench cookbook for the full walkthrough.
Translation
| Dataset | Description | Size | Source | Evaluator |
|---|
wmt24pp | WMT24++: Machine translation across 55 languages (ChrF) | train | google/wmt24pp | translation_reward_fn |
Vision-language (VLM)
These datasets contain images and require a vision-capable model.
| Dataset | Description | Size | Source | Evaluator |
|---|
mmmu | MMMU: Multi-discipline multimodal understanding | 900 validation | MMMU/MMMU | mcq_reward_fn |
mmmu_pro | MMMU-Pro: Harder multimodal understanding, 10 options | 1,730 test | MMMU/MMMU_Pro | mcq_reward_fn |
mathvision | MathVision: Visual math reasoning | 304 testmini | MathLLMs/MathVision | math_reward_fn |
mathvista | MathVista: Visual math across diverse tasks | 1,000 testmini | AI4Math/MathVista | math_reward_fn |
dynamath | DynaMath: Dynamic visual math with 10 variants | 5,010 | DynaMath/DynaMath_Sample | math_reward_fn |
zerobench | ZEROBench: Zero-shot visual reasoning | 100 questions | jonathan-roberts1/zerobench | llm_equality_reward_fn |
zerobench_sub | ZEROBench Subquestions: Decomposed visual reasoning | 334 subquestions | jonathan-roberts1/zerobench | llm_equality_reward_fn |
vlmsareblind | VLMs Are Blind: Visual perception benchmark | 8,020 valid | XAI/vlmsareblind | f1_reward_fn |
babyvision | BabyVision: Early visual understanding MCQ | 388 questions | UnipatAI/BabyVision | llm_equality_reward_fn |
ai2d | AI2D: Science diagram understanding MCQ | 3,088 test | lmms-lab/ai2d | mcq_reward_fn |
ocrbench | OCRBench: OCR and text recognition | 1,000 test | echo840/OCRBench | f1_reward_fn |
charxiv | CharXiv: Chart understanding reasoning | 1,000 validation | princeton-nlp/CharXiv | llm_equality_reward_fn |
cc_ocr | CC-OCR: Multi-scene OCR with 4 sub-tasks | 7,058 test | wulipc/CC-OCR | f1_reward_fn |
countbenchqa | CountBenchQA: Visual object counting QA | 491 test | vikhyatk/CountBenchQA | f1_reward_fn |
erqa | ERQA: Entity recognition QA with multi-image support | 400 test | FlagEval/ERQA | mcq_reward_fn |
geo3k | Geometry3K: Geometry problems with diagrams | 2.4K train, 601 test | hiyouga/geometry3k | math_reward_fn |
omnidocbench | OmniDocBench: Comprehensive document understanding | test | rwood-97/english_OmniDocBench_with_eval | f1_reward_fn |
docvqa | DocVQA: Single-page document visual QA | 5,188 validation | lmms-lab/DocVQA | f1_reward_fn |
refcoco | RefCOCO: Referring expression comprehension (bounding box) | test | lmms-lab/RefCOCO | iou_reward_fn |
refspatial | RefSpatial-Bench: Spatial reasoning with point prediction | test | BAAI/RefSpatial-Bench | point_in_mask_reward_fn |
lingoqa | LingoQA: Language-grounded QA for autonomous driving | test | runoob1/lingoqa | f1_reward_fn |
sunrgbd | SUN RGB-D: Depth estimation and scene understanding | test | wyrx/SUNRGBD_seg | depth_reward_fn |
Using custom datasets
Three paths for adding your own dataset:
- Ad-hoc directory — point
rllm eval at a folder on disk; the loader autodetects the shape.
- Catalog entry + builder — register a builder in
rllm/registry/datasets.json so collaborators can rllm dataset pull <name>.
- Harbor package — drop in any Harbor task package with the
harbor: prefix.
See Bring your own dataset for the full walkthrough.