Skip to main content
rllm eval runs a model against a benchmark and reports how well it did. A single command pulls the dataset, spins up the agent, routes model calls, scores each result, and prints accuracy:
rllm eval gsm8k
This guide walks through the whole workflow — from picking what to evaluate to controlling how the run executes — without assuming any knowledge of rLLM’s internals.
New to the CLI? Start with the Quick start to configure a model provider (rllm model setup). Everything below assumes you’ve done that once.

The shape of a command

rllm eval <benchmark> [--agent <harness>] [--model <name>] [options]
Only the benchmark is required. When you omit the rest, rLLM fills in sensible defaults: the dataset’s default agent, your configured model, and the benchmark’s built-in scorer.
1

Choose a dataset

What you want to measure — math, coding, QA, agentic tasks.
2

Choose an agent harness

How the model attempts each task — a single call, a tool loop, or a full coding CLI in a sandbox.
3

Point at a model

Your configured provider, or any OpenAI-compatible endpoint.
4

Tune the run

Limit examples, set concurrency, save outputs.

Step 1 — Choose a dataset

rLLM ships a catalog of 50+ benchmarks across math, code, QA, instruction-following, search, vision-language, and agentic tasks. Browse and preview them before committing to a run.
rllm dataset list --all          # the full catalog
rllm dataset list --category code # filter by category
rllm dataset info swebench-verified  # splits, size, default agent
rllm dataset inspect gsm8k -n 3      # peek at a few real rows
The Supported datasets page lists every benchmark with its size, source, and default evaluator. Use it to find the exact name to pass to rllm eval.
You can evaluate three kinds of dataset sources:
A name from the built-in catalog. Auto-pulled from HuggingFace on first use.
rllm eval gsm8k
rllm eval humaneval

Step 2 — Choose an agent harness

The agent harness decides how the model actually attempts each task. Pass it with --agent; if you omit it, the dataset’s default is used (react for most data benchmarks).
rllm agent list   # see every built-in harness
HarnessWhat it doesGood for
reactOne-shot LLM callMath, multiple-choice, QA — the default for data tasks
bashMulti-turn ReAct bash loop in a sandboxGeneral agentic / shell tasks
mini-swe-agentLightweight SWE agent in a sandboxSWE-bench-style code repair
claude-code, codex, aider, opencode, qwen-code, kimi-cli, zeroclawRun a real coding-agent CLI inside the sandboxBenchmarking coding agents on repo tasks
oracleRuns the task’s reference solution (no LLM)Smoke-testing that an environment + verifier work before spending tokens
# Evaluate a coding agent on a sandboxed benchmark
rllm eval harbor:swebench-verified --agent mini-swe-agent --sandbox-backend modal

# Sanity-check the environment with no model calls
rllm eval harbor:swebench-verified --agent oracle --sandbox-backend modal
--agent oracle is the cheapest way to confirm a sandboxed benchmark is wired correctly: it runs the gold solution and should score ~100%. If it doesn’t, the problem is the environment, not your model.
You can also point --agent at your own scaffold — a registry name (rllm init my-agent) or a module:object import path.

Step 3 — Point at a model

By default, eval uses the model from rllm model setup and starts a local LiteLLM proxy to route requests. Override either piece:
rllm eval gsm8k --model gpt-4o
Uses your saved provider + API key; the proxy is started for you.

Step 4 — Tune the run

These flags shape how the evaluation executes and what it leaves behind.
--max-examples
int
Evaluate only the first N examples. Ideal for a quick smoke test before a full run.
--task-indices
string
Run specific rows only: '0', '3,7,12', or a range '0-9'. Great for re-running a handful of failures.
--split
string
Which dataset split to use (defaults to the benchmark’s eval split, usually test).
--concurrency
int
default:"64"
How many tasks run in parallel. Lower it if you hit provider rate limits; raise it for throughput.
--evaluator
string
Override the scorer for every task (a registry name or module:class). By default each benchmark brings its own.
--output
string
Where to write the results JSON. Defaults to a timestamped run directory under ~/.rllm/eval_results/.
--save-episodes / --no-save-episodes
default:"enabled"
Save each task’s full trajectory as JSON for later inspection with rllm view.
# A fast, reproducible dev loop
rllm eval gsm8k --max-examples 20 --concurrency 16

# Re-run just three rows and save results to a known path
rllm eval gsm8k --task-indices 4,11,27 --output runs/retry.json
When the run finishes, rLLM prints accuracy, an error count, and per-signal metrics, and tells you how to browse the saved episodes:
rllm view <run-id>

Sandboxed (agentic) evaluations

Coding agents and Harbor tasks run inside an isolated sandbox — a container with the task’s repo, dependencies, and test suite. Pick where those sandboxes run with --sandbox-backend:
BackendWhere it runsNotes
dockerYour local Docker daemonDefault; requires Docker installed
localThe host machine directlyNo isolation; fastest
modal, daytona, e2b, runloop, gkeCloudNo local Docker needed; scales out
rllm eval harbor:swebench-verified \
  --agent mini-swe-agent \
  --sandbox-backend modal \
  --sandbox-concurrency 8
--sandbox-concurrency caps how many sandboxes exist at once, independent of --concurrency. Use it to stay within a cloud backend’s limits.

Accelerating cold start with snapshots

Every sandboxed task pays a cold start before any real work happens: the backend pulls a (often multi-gigabyte) image and replays the environment’s setup steps. For large or repeated evaluations this can dominate wall-clock time. A snapshot bakes that one-time work into a reusable backend artifact, so tasks boot from a warmed environment in roughly a second. Snapshots are managed like datasets — you create them once, they persist, and rllm eval uses them automatically:
1

Create snapshots for a dataset (or a slice)

# Build snapshots for the whole benchmark, or just the slice you'll evaluate
rllm snapshot create harbor:swebench-verified --sandbox-backend modal
rllm snapshot create harbor:swebench-verified --sandbox-backend modal --max-examples 5
2

Evaluate — snapshots are used transparently

# Boots from a snapshot whenever one exists; falls back to cold otherwise
rllm eval harbor:swebench-verified --sandbox-backend modal
3

Inspect or remove them

rllm snapshot list                                            # what you have, by dataset
rllm snapshot destroy harbor:swebench-verified --sandbox-backend modal

What a snapshot is

A snapshot is keyed by the environment — its base image plus Dockerfile setup steps — not by an individual task. So every task that shares an environment uses the same snapshot, and one set of snapshots is reused by eval and training alike (including all the group copies a training run makes of a task). The agent harness is not part of the key, so the same snapshots serve any harness.
A sandbox is ephemeral — created for one rollout and torn down right after. A snapshot persists until you remove it, like a dataset. Evaluation never creates or destroys snapshots; it only reads them. Building and removing them is always an explicit rllm snapshot create / destroy.
Snapshots are stored as image diffs, so they cost almost nothing while idle (no sandbox is running). Each carries a local time-to-live (--ttl-hours, default 7 days): past it, rLLM stops trusting the local entry and falls back to cold rather than risk a stale boot.
Cold (default fallback)Snapshot
Setup re-paid per taskYesNo — boots from the warmed environment
Idle costNoneNegligible (storage only)
Reused across runsYes, by eval and training
Created byrllm snapshot create (never by a run)
Snapshots are available only on cloud backends with a snapshot mechanism (modal, daytona); docker / local always take the cold path (their local image cache already amortizes it). To force the cold path on a backend that has snapshots — for A/B timing, say — pass --no-snapshot:
rllm eval harbor:swebench-verified --sandbox-backend modal --no-snapshot
If a snapshot was removed on the backend out of band, the eval notices on boot and transparently falls back to cold for that task — a run never fails because a snapshot went missing.

What’s next

Supported datasets

Every benchmark with its size, source, and evaluator

CLI reference

All commands, flags, and the web UI

Agent harnesses

How harnesses drive the model through a task

Evaluators

How results are scored