evaluate(task: Task, episode: Episode) -> EvalOutput. They live under rllm.eval.reward_fns (built-ins) or in a benchmark’s own directory (tests/evaluate.py). The same function powers both eval scoring and RL reward — there is no separate “reward” vs “evaluator” code path.
The convention
Every reward function has the same shape:evaluate(task, episode) → EvalOutputis the contract. The Runner calls it once per Task and writesrewardback onto each trajectory.SYSTEM_PROMPT(optional, module-level) describes the output format the grader expects. Harnesses inject it into their system prompt so the model produces parseable output. Without this, a math harness might silently grade “the answer is 42” as wrong because the grader expects\boxed{42}.
Built-in reward functions
rllm.eval.reward_fns ships graders for common benchmark shapes:
| Module | Grades | Used by |
|---|---|---|
math | \boxed{} extraction + symbolic equivalence | gsm8k, MATH-500, hendrycks_math, deepscaler |
mcq | Single-letter choice match | MMLU, HellaSwag, ARC |
code | Test execution | HumanEval, MBPP |
f1 | Token-level F1 | TriviaQA, NaturalQuestions |
countdown | Arithmetic puzzle solver check | Countdown |
bfcl | Function-call signature matching | BFCL |
ifeval | Instruction-following constraint checks | IFEval |
llm_judge | LLM-as-judge | Open-ended generation |
claw_eval | LLM-judge over the full agent trajectory (task completion) | Claw-Eval |
llm_equality | LLM-graded equivalence | Free-form QA |
translation | BLEU / chrF | MT benchmarks |
widesearch | Source-grounded checks | Wide-search agents |
iou, point_in_mask, depth | Vision metrics | VLM benchmarks |
rllm.rewards infrastructure. The wrappers stay tiny on purpose — they exist to satisfy the evaluate(task, episode) contract, declare a SYSTEM_PROMPT, and pass through to the canonical grading logic.
Wiring a verifier into a benchmark
A benchmark declares its verifier indataset.toml (data tasks) or task.toml (sandbox tasks):
Built-in reward function
Custom reward function
Drop atests/evaluate.py next to your dataset:
evaluate function. You can reuse helpers from rllm.eval.reward_fns — for example, import its boxed-answer extraction so you don’t reinvent it.
Shell-script verifier (sandbox tasks)
For Harbor-style sandbox tasks, the verifier is typically a shell script that runs inside the sandbox:How the system-prompt hint flows
TheSYSTEM_PROMPT constant on a reward function is how a verifier tells the harness what output format it expects:
Reward function declares the format
rllm.eval.reward_fns.math exports SYSTEM_PROMPT = "Put your final answer in \\boxed{}".Dataset names the verifier
dataset.toml has [verifier] import_path = "rllm.eval.reward_fns.math:evaluate".Harness reads the hint
ReActHarness calls get_verifier_system_prompt(task), which loads the verifier module and returns its SYSTEM_PROMPT.\boxed{}-style to letter-choice-style is a one-line import_path change in dataset.toml, with no harness modifications.
Loading a reward function programmatically
Outside the CLI — e.g. when wiring a reward function into a custom training loop — useload_evaluator:
load_evaluator returns an object that satisfies the Evaluator protocol. Whether the underlying source is a function or a class, the call site is the same.
