Skip to main content
rLLM ships with a built-in catalog of 50+ benchmark datasets spanning math, code, question answering, instruction following, search, vision-language, translation, and agentic tasks. All datasets are auto-pulled from HuggingFace on first use.
rllm dataset list --all    # See all available datasets
rllm eval gsm8k            # Auto-pulls and evaluates

Math

DatasetDescriptionSizeSourceEvaluator
gsm8kGrade school math word problems8.5K train, 1.3K testopenai/gsm8kmath_reward_fn
math500MATH-500 competition math benchmark500 testHuggingFaceH4/MATH-500math_reward_fn
countdownCountdown arithmetic puzzle1K train, 500 testpredibase/countdowncountdown_reward_fn
hmmtHMMT Feb 2025: Harvard-MIT Mathematics TournamenttrainMathArena/hmmt_feb_2025math_reward_fn
hmmt_novHMMT Nov 2025: Harvard-MIT Mathematics Tournament30 problemsMathArena/hmmt_nov_2025math_reward_fn
aime_2025AIME 2025: American Invitational Mathematics Exam30 problemsMathArena/aime_2025math_reward_fn
aime_2026AIME 2026: American Invitational Mathematics Exam30 problemsMathArena/aime_2026math_reward_fn
polymathPolyMATH: Multilingual math reasoning across 18 languages4 difficulty splitsQwen/PolyMathmath_reward_fn

Code

DatasetDescriptionSizeSourceEvaluator
humanevalHumanEval: Function-level code generation164 problemsopenai/openai_humanevalcode_reward_fn
mbppMBPP: Python programming benchmark974 problemsgoogle-research-datasets/mbppcode_reward_fn
livecodebenchLiveCodeBench: Contamination-free competitive programmingtestlivecodebench/code_generationcode_reward_fn
swebench_verifiedSWE-bench Verified: Real-world GitHub issues for SWE agents500 testprinceton-nlp/SWE-bench_Verifiedswebench_reward_fn

Multiple choice (MCQ)

DatasetDescriptionSizeSourceEvaluator
mmlu_proMMLU-Pro: Expert-level MCQ with 10 options12K testTIGER-Lab/MMLU-Promcq_reward_fn
mmlu_reduxMMLU-Redux: Curated MMLU subset with error fixes3K testedinburgh-dawg/mmlu-reduxmcq_reward_fn
gpqa_diamondGPQA: Expert-level graduate science QA448 questionsankner/gpqamcq_reward_fn
supergpqaSuperGPQA: Graduate-level QA across 285 disciplines26.5Km-a-p/SuperGPQAmcq_reward_fn
cevalC-Eval: Chinese evaluation across 52 disciplines13.9Kceval/ceval-exammcq_reward_fn
mmmluMMMLU: Multilingual MMLU across 14 languages15.9K/langopenai/MMMLUmcq_reward_fn
mmlu_proxMMLU-ProX: Multilingual MMLU-Pro across 29 languages11.8K/langli-lab/MMLU-ProXmcq_reward_fn
includeINCLUDE: Multilingual knowledge from local exams, 44 languagestestCohereLabs/include-base-44mcq_reward_fn
global_piqaGlobal PIQA: Physical commonsense reasoning, 100+ languagestestmrlbenchmarks/global-piqa-nonparallelmcq_reward_fn
longbench_v2LongBench v2: Long-context understanding MCQtestTHUDM/LongBench-v2mcq_reward_fn

Question answering

DatasetDescriptionSizeSourceEvaluator
hotpotqaHotpotQA: Multi-hop question answering7.4K validationhotpotqa/hotpot_qaf1_reward_fn
aa_lcrAA-LCR: Long-context reasoning over ~100K-token documents100 questionsArtificialAnalysis/AA-LCRllm_equality_reward_fn
hleHLE: Humanity’s Last Exam — expert-level questions2,500 testcais/hlellm_equality_reward_fn
hle and hle_search are gated datasets on HuggingFace. Run huggingface-cli login before pulling them.

Instruction following

DatasetDescriptionSizeSourceEvaluator
ifevalIFEval: Instruction following with verifiable constraints541google/IFEvalifeval_reward_fn
ifbenchIFBench: Out-of-distribution instruction followingtestallenai/IFBench_testifeval_reward_fn
Datasets in this category use the search agent, which requires a search backend. Set one with --search-backend serper or --search-backend brave.
DatasetDescriptionSizeSourceEvaluator
browsecompBrowseComp: Web browsing comprehension200 testTevatron/browsecomp-plusllm_equality_reward_fn
seal0Seal-0: Search-augmented QA with freshness metadatatestvtllms/sealqallm_equality_reward_fn
widesearchWideSearch: Broad web search with structured table output200ByteDance-Seed/WideSearchwidesearch_reward_fn
hle_searchHLE + Search: Humanity’s Last Exam with web search toolstestcais/hlellm_equality_reward_fn

Agentic

DatasetDescriptionSizeSourceEvaluator
bfclBFCL: Berkeley Function Calling Leaderboard (exec_simple)testgorilla-llm/Berkeley-Function-Calling-Leaderboardbfcl_reward_fn
multichallengeMultiChallenge: Multi-turn conversation evaluationtestnmayorga7/multichallengellm_judge_reward_fn
frozenlakeFrozenLake: Grid navigation (procedurally generated)train, testGeneratedfrozenlake_reward_fn

Translation

DatasetDescriptionSizeSourceEvaluator
wmt24ppWMT24++: Machine translation across 55 languages (ChrF)traingoogle/wmt24pptranslation_reward_fn

Vision-language (VLM)

These datasets contain images and require a vision-capable model.
DatasetDescriptionSizeSourceEvaluator
mmmuMMMU: Multi-discipline multimodal understanding900 validationMMMU/MMMUmcq_reward_fn
mmmu_proMMMU-Pro: Harder multimodal understanding, 10 options1,730 testMMMU/MMMU_Promcq_reward_fn
mathvisionMathVision: Visual math reasoning304 testminiMathLLMs/MathVisionmath_reward_fn
mathvistaMathVista: Visual math across diverse tasks1,000 testminiAI4Math/MathVistamath_reward_fn
dynamathDynaMath: Dynamic visual math with 10 variants5,010DynaMath/DynaMath_Samplemath_reward_fn
zerobenchZEROBench: Zero-shot visual reasoning100 questionsjonathan-roberts1/zerobenchllm_equality_reward_fn
zerobench_subZEROBench Subquestions: Decomposed visual reasoning334 subquestionsjonathan-roberts1/zerobenchllm_equality_reward_fn
vlmsareblindVLMs Are Blind: Visual perception benchmark8,020 validXAI/vlmsareblindf1_reward_fn
babyvisionBabyVision: Early visual understanding MCQ388 questionsUnipatAI/BabyVisionllm_equality_reward_fn
ai2dAI2D: Science diagram understanding MCQ3,088 testlmms-lab/ai2dmcq_reward_fn
ocrbenchOCRBench: OCR and text recognition1,000 testecho840/OCRBenchf1_reward_fn
charxivCharXiv: Chart understanding reasoning1,000 validationprinceton-nlp/CharXivllm_equality_reward_fn
cc_ocrCC-OCR: Multi-scene OCR with 4 sub-tasks7,058 testwulipc/CC-OCRf1_reward_fn
countbenchqaCountBenchQA: Visual object counting QA491 testvikhyatk/CountBenchQAf1_reward_fn
erqaERQA: Entity recognition QA with multi-image support400 testFlagEval/ERQAmcq_reward_fn
geo3kGeometry3K: Geometry problems with diagrams2.4K train, 601 testhiyouga/geometry3kmath_reward_fn
omnidocbenchOmniDocBench: Comprehensive document understandingtestrwood-97/english_OmniDocBench_with_evalf1_reward_fn
docvqaDocVQA: Single-page document visual QA5,188 validationlmms-lab/DocVQAf1_reward_fn
refcocoRefCOCO: Referring expression comprehension (bounding box)testlmms-lab/RefCOCOiou_reward_fn
refspatialRefSpatial-Bench: Spatial reasoning with point predictiontestBAAI/RefSpatial-Benchpoint_in_mask_reward_fn
lingoqaLingoQA: Language-grounded QA for autonomous drivingtestrunoob1/lingoqaf1_reward_fn
sunrgbdSUN RGB-D: Depth estimation and scene understandingtestwyrx/SUNRGBD_segdepth_reward_fn

Using custom datasets

You can register your own datasets for use with rllm eval and rllm train:
rllm dataset register my-dataset --file data.jsonl --category math
rllm eval my-dataset --agent react --evaluator math_reward_fn
Your data file should contain question and ground_truth fields. Supported formats: JSON, JSONL, CSV, and Parquet. For more details, see rllm dataset.