rllm eval runs a model against a benchmark and reports how well it did. A single command pulls the dataset, spins up the agent, routes model calls, scores each result, and prints accuracy:
New to the CLI? Start with the Quick start to configure a model provider (
rllm model setup). Everything below assumes you’ve done that once.The shape of a command
Choose an agent harness
How the model attempts each task — a single call, a tool loop, or a full coding CLI in a sandbox.
Step 1 — Choose a dataset
rLLM ships a catalog of 50+ benchmarks across math, code, QA, instruction-following, search, vision-language, and agentic tasks. Browse and preview them before committing to a run.- Catalog benchmark
- Harbor (agentic) dataset
- Local directory
A name from the built-in catalog. Auto-pulled from HuggingFace on first use.
Step 2 — Choose an agent harness
The agent harness decides how the model actually attempts each task. Pass it with--agent; if you omit it, the dataset’s default is used (react for most data benchmarks).
| Harness | What it does | Good for |
|---|---|---|
react | One-shot LLM call | Math, multiple-choice, QA — the default for data tasks |
bash | Multi-turn ReAct bash loop in a sandbox | General agentic / shell tasks |
mini-swe-agent | Lightweight SWE agent in a sandbox | SWE-bench-style code repair |
claude-code, codex, aider, opencode, qwen-code, kimi-cli, zeroclaw | Run a real coding-agent CLI inside the sandbox | Benchmarking coding agents on repo tasks |
oracle | Runs the task’s reference solution (no LLM) | Smoke-testing that an environment + verifier work before spending tokens |
--agent at your own scaffold — a registry name (rllm init my-agent) or a module:object import path.
Step 3 — Point at a model
By default,eval uses the model from rllm model setup and starts a local LiteLLM proxy to route requests. Override either piece:
- Configured provider
- Your own endpoint
Step 4 — Tune the run
These flags shape how the evaluation executes and what it leaves behind.Evaluate only the first N examples. Ideal for a quick smoke test before a full run.
Run specific rows only:
'0', '3,7,12', or a range '0-9'. Great for re-running a handful of failures.Which dataset split to use (defaults to the benchmark’s eval split, usually
test).How many tasks run in parallel. Lower it if you hit provider rate limits; raise it for throughput.
Override the scorer for every task (a registry name or
module:class). By default each benchmark brings its own.Where to write the results JSON. Defaults to a timestamped run directory under
~/.rllm/eval_results/.Save each task’s full trajectory as JSON for later inspection with
rllm view.Sandboxed (agentic) evaluations
Coding agents and Harbor tasks run inside an isolated sandbox — a container with the task’s repo, dependencies, and test suite. Pick where those sandboxes run with--sandbox-backend:
| Backend | Where it runs | Notes |
|---|---|---|
docker | Your local Docker daemon | Default; requires Docker installed |
local | The host machine directly | No isolation; fastest |
modal, daytona, e2b, runloop, gke | Cloud | No local Docker needed; scales out |
--sandbox-concurrency caps how many sandboxes exist at once, independent of --concurrency. Use it to stay within a cloud backend’s limits.Accelerating cold start with snapshots
Every sandboxed task pays a cold start before any real work happens: the backend pulls a (often multi-gigabyte) image and replays the environment’s setup steps. For large or repeated evaluations this can dominate wall-clock time. A snapshot bakes that one-time work into a reusable backend artifact, so tasks boot from a warmed environment in roughly a second. Snapshots are managed like datasets — you create them once, they persist, andrllm eval uses them automatically:
What a snapshot is
A per-environment artifact, shared widely
A per-environment artifact, shared widely
Persistent and user-managed, unlike a sandbox
Persistent and user-managed, unlike a sandbox
A sandbox is ephemeral — created for one rollout and torn down right after. A snapshot persists until you remove it, like a dataset. Evaluation never creates or destroys snapshots; it only reads them. Building and removing them is always an explicit
rllm snapshot create / destroy.Storage, not running compute
Storage, not running compute
Snapshots are stored as image diffs, so they cost almost nothing while idle (no sandbox is running). Each carries a local time-to-live (
--ttl-hours, default 7 days): past it, rLLM stops trusting the local entry and falls back to cold rather than risk a stale boot.| Cold (default fallback) | Snapshot | |
|---|---|---|
| Setup re-paid per task | Yes | No — boots from the warmed environment |
| Idle cost | None | Negligible (storage only) |
| Reused across runs | — | Yes, by eval and training |
| Created by | — | rllm snapshot create (never by a run) |
modal, daytona); docker / local always take the cold path (their local image cache already amortizes it). To force the cold path on a backend that has snapshots — for A/B timing, say — pass --no-snapshot:
If a snapshot was removed on the backend out of band, the eval notices on boot and transparently falls back to cold for that task — a run never fails because a snapshot went missing.
What’s next
Supported datasets
Every benchmark with its size, source, and evaluator
CLI reference
All commands, flags, and the web UI
Agent harnesses
How harnesses drive the model through a task
Evaluators
How results are scored

