rLLM’s eval framework is built around two protocols: AgentFlow runs an agent on a task and produces an Episode, and Evaluator scores that Episode. Together they form a clean separation between agent logic and evaluation logic.Documentation Index
Fetch the complete documentation index at: https://docs.rllm-project.com/llms.txt
Use this file to discover all available pages before exploring further.
AgentFlow
AnAgentFlow is any callable that takes a task and returns an Episode. The same flow runs at eval time and at training time — at training, config.base_url points at a model gateway that captures token IDs / logprobs transparently, so your flow code doesn’t change between the two modes.
run (sync) or arun (async) — the runner prefers arun when available.
Task and AgentConfig
The two arguments passed to every AgentFlow call:Task is pure data. The instruction is rendered ahead of time (from a JSONL row, an instruction.md file, or an instruction.md.tpl template). metadata carries everything the verifier needs — for catalog datasets that’s the source row; for sandbox tasks it’s the parsed task.toml. See Tasks and the Runner for the loader and execution pipeline.
Example AgentFlow
A minimal math agent that solves a problem in one turn (seecookbooks/math/math_flow.py
for the runnable, async version that ships in the repo):
Evaluator
AnEvaluator scores an Episode produced by an AgentFlow. It examines the task and the episode’s trajectories, then returns an EvalOutput with a reward and correctness flag.
EvalOutput
The result of evaluating a single episode:signals lets an evaluator report multiple dimensions of quality. For example, an IFEval evaluator might return separate signals for instruction-following accuracy and format compliance.
Example Evaluator
A simple exact-match evaluator for math:evaluate(task, episode) -> EvalOutput functions exported from a module under rllm.eval.reward_fns. See Reward functions for the convention and how a benchmark wires one up.
Built-in reward functions
A non-exhaustive selection fromrllm.eval.reward_fns:
| Module | Method | Use case |
|---|---|---|
rllm.eval.reward_fns.math | \boxed{} extraction + symbolic grading | Math, GSM8K, MATH-500 |
rllm.eval.reward_fns.mcq | Letter-choice match | Multiple-choice QA, MMLU |
rllm.eval.reward_fns.code | Test execution | Code generation |
rllm.eval.reward_fns.llm_judge | LLM-as-judge | Open-ended generation |
rllm.eval.reward_fns.bfcl | Function-call matching | Tool-use benchmarks |
rllm.eval.reward_fns.ifeval | Instruction-following checks | IFEval |
How they work together
The eval runner connects AgentFlow and Evaluator in a simple pipeline:Score results
The runner passes each Episode to
evaluator.evaluate(task, episode) to get an EvalOutput.Connection to training
AgentFlow and Evaluator are designed to be reusable across eval and training. The same flow function that powersrllm eval also powers rllm train — the only thing that changes is which base_url config.base_url points at:
- During eval,
config.base_urlpoints at a regular OpenAI-compatible endpoint - During training, the trainer routes
config.base_urlthrough a model gateway that captures token IDs and logprobs alongside the chat-completion response, so policy-gradient training has the data it needs without any change to your flow code - The Evaluator runs unchanged — its
EvalOutput.rewardis the per-task reward signal the trainer feeds into advantage computation - The Episode structure is identical in both paths, so eval results are directly comparable to training metrics
cookbooks/ for seven worked examples that train end-to-end with this pattern.
