Skip to main content
rLLM ships with a built-in catalog of 60+ benchmark datasets spanning math, code, question answering, instruction following, search, vision-language, translation, and agentic tasks. All datasets are auto-pulled from HuggingFace on first use.
rllm dataset list --all    # See all available datasets
rllm eval gsm8k            # Auto-pulls and evaluates
To add your own dataset (catalog entry, ad-hoc directory, or Harbor package), see Bring your own dataset.

Math

DatasetDescriptionSizeSourceEvaluator
gsm8kGrade school math word problems8.5K train, 1.3K testopenai/gsm8kmath_reward_fn
math500MATH-500 competition math benchmark500 testHuggingFaceH4/MATH-500math_reward_fn
hendrycks_mathMATH: Competition mathematics across 7 subjects7.5K train, 5K testEleutherAI/hendrycks_mathmath_reward_fn
deepscaler_mathDeepScaleR-Preview: ~40K problems (AIME/AMC/Omni-MATH/STILL)~40K trainagentica-org/DeepScaleR-Preview-Datasetmath_reward_fn
countdownCountdown arithmetic puzzle1K train, 500 testpredibase/countdowncountdown_reward_fn
hmmtHMMT Feb 2025: Harvard-MIT Mathematics TournamenttrainMathArena/hmmt_feb_2025math_reward_fn
hmmt_novHMMT Nov 2025: Harvard-MIT Mathematics Tournament30 problemsMathArena/hmmt_nov_2025math_reward_fn
aime_2025AIME 2025: American Invitational Mathematics Exam30 problemsMathArena/aime_2025math_reward_fn
aime_2026AIME 2026: American Invitational Mathematics Exam30 problemsMathArena/aime_2026math_reward_fn
polymathPolyMATH: Multilingual math reasoning across 18 languages4 difficulty splitsQwen/PolyMathmath_reward_fn

Code

DatasetDescriptionSizeSourceEvaluator
humanevalHumanEval: Function-level code generation164 problemsopenai/openai_humanevalcode_reward_fn
mbppMBPP: Python programming benchmark974 problemsgoogle-research-datasets/mbppcode_reward_fn
livecodebenchLiveCodeBench: Contamination-free competitive programmingtestlivecodebench/code_generationcode_reward_fn
swebench_verifiedSWE-bench Verified: Real-world GitHub issues for SWE agents500 testprinceton-nlp/SWE-bench_Verifiedswebench_reward_fn

Multiple choice (MCQ)

DatasetDescriptionSizeSourceEvaluator
mmlu_proMMLU-Pro: Expert-level MCQ with 10 options12K testTIGER-Lab/MMLU-Promcq_reward_fn
mmlu_reduxMMLU-Redux: Curated MMLU subset with error fixes3K testedinburgh-dawg/mmlu-reduxmcq_reward_fn
gpqa_diamondGPQA: Expert-level graduate science QA448 questionsankner/gpqamcq_reward_fn
supergpqaSuperGPQA: Graduate-level QA across 285 disciplines26.5Km-a-p/SuperGPQAmcq_reward_fn
cevalC-Eval: Chinese evaluation across 52 disciplines13.9Kceval/ceval-exammcq_reward_fn
mmmluMMMLU: Multilingual MMLU across 14 languages15.9K/langopenai/MMMLUmcq_reward_fn
mmlu_proxMMLU-ProX: Multilingual MMLU-Pro across 29 languages11.8K/langli-lab/MMLU-ProXmcq_reward_fn
includeINCLUDE: Multilingual knowledge from local exams, 44 languagestestCohereLabs/include-base-44mcq_reward_fn
global_piqaGlobal PIQA: Physical commonsense reasoning, 100+ languagestestmrlbenchmarks/global-piqa-nonparallelmcq_reward_fn
longbench_v2LongBench v2: Long-context understanding MCQtestTHUDM/LongBench-v2mcq_reward_fn

Question answering

DatasetDescriptionSizeSourceEvaluator
hotpotqaHotpotQA: Multi-hop question answering7.4K validationhotpotqa/hotpot_qaf1_reward_fn
aa_lcrAA-LCR: Long-context reasoning over ~100K-token documents100 questionsArtificialAnalysis/AA-LCRllm_equality_reward_fn
hleHLE: Humanity’s Last Exam — expert-level questions2,500 testcais/hlellm_equality_reward_fn
hle and hle_search are gated datasets on HuggingFace. Run huggingface-cli login before pulling them.

Instruction following

DatasetDescriptionSizeSourceEvaluator
ifevalIFEval: Instruction following with verifiable constraints541google/IFEvalifeval_reward_fn
ifbenchIFBench: Out-of-distribution instruction followingtestallenai/IFBench_testifeval_reward_fn
Datasets in this category use the search agent, which requires a search backend. Set one with --search-backend serper or --search-backend brave.
DatasetDescriptionSizeSourceEvaluator
browsecompBrowseComp: Web browsing comprehension200 testTevatron/browsecomp-plusllm_equality_reward_fn
seal0Seal-0: Search-augmented QA with freshness metadatatestvtllms/sealqallm_equality_reward_fn
widesearchWideSearch: Broad web search with structured table output200ByteDance-Seed/WideSearchwidesearch_reward_fn
hle_searchHLE + Search: Humanity’s Last Exam with web search toolstestcais/hlellm_equality_reward_fn

Agentic

Most agentic datasets are sandboxed — they ship per-task Dockerfiles and shell verifiers. See Sandboxes for how to pick --sandbox-backend and Tasks (Harbor-compatible) for the on-disk format.
DatasetDescriptionSizeSourceDefault agent
bfclBFCL: Berkeley Function Calling Leaderboard (exec_simple)testgorilla-llm/Berkeley-Function-Calling-Leaderboardreact
multichallengeMultiChallenge: Multi-turn conversation evaluationtestnmayorga7/multichallengereact
claw_evalClaw-Eval: 161 personal-assistant agent tasks in sandbox workspaces (LLM-judge graded)161 generalclaw-eval/Claw-Evalzeroclaw
rllm-swesmithSWE-smith filtered: solvable bug-fixing tasks across 105 Python repos (in-sandbox pytest grading)~4.7K trainkylemontgomery/swesmith-filteredmini-swe-agent
skillsbenchSkillsBench: 91 expert-curated agentic tasks measuring skill use (per-task tests/test.sh verifier)91 trainbenchflow/skillsbenchclaude-code
skillsbench-no-skillsSkillsBench baseline without the per-task skills/ tree, for measuring skills-augmentation gain91 trainbenchflow/skillsbenchclaude-code

Harbor packages (no registry entry)

Terminal-Bench 2.0 and other Harbor task packages are not in the catalog — they’re resolved from Harbor at run time:
rllm eval harbor:laude-institute/t-bench-2 --agent terminus2 --sandbox-backend modal
See the Terminal-Bench cookbook for the full walkthrough.

Translation

DatasetDescriptionSizeSourceEvaluator
wmt24ppWMT24++: Machine translation across 55 languages (ChrF)traingoogle/wmt24pptranslation_reward_fn

Vision-language (VLM)

These datasets contain images and require a vision-capable model.
DatasetDescriptionSizeSourceEvaluator
mmmuMMMU: Multi-discipline multimodal understanding900 validationMMMU/MMMUmcq_reward_fn
mmmu_proMMMU-Pro: Harder multimodal understanding, 10 options1,730 testMMMU/MMMU_Promcq_reward_fn
mathvisionMathVision: Visual math reasoning304 testminiMathLLMs/MathVisionmath_reward_fn
mathvistaMathVista: Visual math across diverse tasks1,000 testminiAI4Math/MathVistamath_reward_fn
dynamathDynaMath: Dynamic visual math with 10 variants5,010DynaMath/DynaMath_Samplemath_reward_fn
zerobenchZEROBench: Zero-shot visual reasoning100 questionsjonathan-roberts1/zerobenchllm_equality_reward_fn
zerobench_subZEROBench Subquestions: Decomposed visual reasoning334 subquestionsjonathan-roberts1/zerobenchllm_equality_reward_fn
vlmsareblindVLMs Are Blind: Visual perception benchmark8,020 validXAI/vlmsareblindf1_reward_fn
babyvisionBabyVision: Early visual understanding MCQ388 questionsUnipatAI/BabyVisionllm_equality_reward_fn
ai2dAI2D: Science diagram understanding MCQ3,088 testlmms-lab/ai2dmcq_reward_fn
ocrbenchOCRBench: OCR and text recognition1,000 testecho840/OCRBenchf1_reward_fn
charxivCharXiv: Chart understanding reasoning1,000 validationprinceton-nlp/CharXivllm_equality_reward_fn
cc_ocrCC-OCR: Multi-scene OCR with 4 sub-tasks7,058 testwulipc/CC-OCRf1_reward_fn
countbenchqaCountBenchQA: Visual object counting QA491 testvikhyatk/CountBenchQAf1_reward_fn
erqaERQA: Entity recognition QA with multi-image support400 testFlagEval/ERQAmcq_reward_fn
geo3kGeometry3K: Geometry problems with diagrams2.4K train, 601 testhiyouga/geometry3kmath_reward_fn
omnidocbenchOmniDocBench: Comprehensive document understandingtestrwood-97/english_OmniDocBench_with_evalf1_reward_fn
docvqaDocVQA: Single-page document visual QA5,188 validationlmms-lab/DocVQAf1_reward_fn
refcocoRefCOCO: Referring expression comprehension (bounding box)testlmms-lab/RefCOCOiou_reward_fn
refspatialRefSpatial-Bench: Spatial reasoning with point predictiontestBAAI/RefSpatial-Benchpoint_in_mask_reward_fn
lingoqaLingoQA: Language-grounded QA for autonomous drivingtestrunoob1/lingoqaf1_reward_fn
sunrgbdSUN RGB-D: Depth estimation and scene understandingtestwyrx/SUNRGBD_segdepth_reward_fn

Using custom datasets

Three paths for adding your own dataset:
  • Ad-hoc directory — point rllm eval at a folder on disk; the loader autodetects the shape.
  • Catalog entry + builder — register a builder in rllm/registry/datasets.json so collaborators can rllm dataset pull <name>.
  • Harbor package — drop in any Harbor task package with the harbor: prefix.
See Bring your own dataset for the full walkthrough.