Task is the unit of work in rLLM: one problem instance — a single math row, a single sandboxed coding task — described as pure data. The on-disk format is Harbor-compatible (existing Harbor task packages run unmodified) with a small [rllm] extension. The same Task type and the same Runner drive both data benchmarks (gsm8k-style) and sandbox benchmarks (SWE-bench, Terminal-Bench, rllm-swesmith, …) — one code path.
The Task data model
Task is intentionally minimal. It says what the problem is and where its files live. It does not carry an evaluator reference, a sandbox handle, or a rendered prompt template — those are produced lazily by the Runner from on-disk config.
Instruction
instruction is what the agent sees in the user message. Three sources, in priority order:
instruction.md.tpl— a template in the dataset directory rendered with the row, supporting{{field}}placeholders. List-valued fields (e.g. MCQchoices) become a lettered block:(A) ...\n(B) ....instruction_field— a column name declared indataset.toml.instruction.md— a literal file (one per task directory; sandbox shape).
category = "vlm" in dataset.toml), the loader produces a list of OpenAI-style content blocks instead of a string, with images encoded as inline data URIs.
Metadata
metadata is the side channel between the dataset and the verifier:
- Data tasks: the source row (or the subset of fields named in
metadata_fields). This is where ground truth, MCQ choices, expected answer, etc. live. - Sandbox tasks: the parsed
task.toml, plus convenience keys lifted from common sections (workdir,agent_user,verifier_user,verifier_timeout,setup_commands, …).
task.metadata["ground_truth"]. A sandbox verifier uses task.task_dir to find its files.
Two on-disk shapes
A “benchmark” on disk is one of two shapes — both producelist[Task]:
Shape A — rows-with-shared-verifier (gsm8k-style)
A singledata/<split>.jsonl provides per-task data; all rows share one verifier. Use this for math, MCQ, code, or any benchmark with thousands of small instances.
data/<split>.jsonl becomes one Task. dataset_dir is the benchmark directory; sub_dir is None. The verifier is shared across all rows.
Shape B — task-per-directory (Harbor sandbox style)
Each task is its own subdirectory. Use this for SWE-bench, Terminal-Bench, or any benchmark where each instance has its own seed files, Dockerfile, and tests.sub_dir is set, so task.task_dir resolves to the per-task directory and the verifier finds its files there. The Runner detects whether a sandbox is needed and provisions one — see Sandboxes.
dataset.toml
Dataset-wide defaults shared by every task in the directory:
dataset.toml also declares the instruction template and the shared verifier:
task.toml (Shape B only)
Per-task config. The minimum is an empty file (everything is autodetected):
Automatic lifts
The loader fills in declarations the task makes implicitly:environment/Dockerfile’sFROMline populates[environment].docker_imagewhen unset.environment/Dockerfile’sWORKDIRpopulates[environment].workdirwhen unset.- The
--sandbox-backendflag overridesdataset.toml’sdefault_sandbox, which overrides the harness class default.
task.toml always win.
Verifier resolution
The Runner readsdataset.toml (or task.toml) to find each task’s verifier. Four ways to declare it:
tests/test.sh → A, tests/evaluate.py → B.
Python verifier signature
(metadata, trajectory) is convenient for ad-hoc verifiers. Returns are coerced: float, bool, dict, tuple[float, bool], EvalOutput are all accepted.
Shell verifier contract
The script writes a reward file in the sandbox; the framework reads it. Search order (first existing wins):/tmp/rllm/reward.json/logs/verifier/reward.json(Harbor convention)/logs/verifier/reward.txt(Harbor convention; single float)
tests/test.sh:
/tmp/rllm/ and /logs/verifier/ before invoking the verifier.
Oracle solver (optional)
solve.sh at the task root is a “what the right answer looks like” script — used by rllm eval --agent oracle to confirm a task is solvable end-to-end. It executes in the declared workdir:
solve.sh and then invokes the verifier. If the verifier returns reward == 1.0, the task is solvable; if not, the task is broken (mislabeled verifier, missing fixture, deleted source file). This is how dataset builders smoke-test thousands of instances before publishing.
User isolation (optional)
To stop an adversarial agent from writing the reward file directly:agent, locks /logs/verifier, /tmp/rllm, and /tests to root-only, and runs agent commands as agent while the verifier runs as root. The kernel enforces the boundary — echo 1 > /logs/verifier/reward.txt returns Permission denied.
Loading a benchmark
BenchmarkLoader autodetects the shape:
dataset.toml+data/<split>.jsonl→ Shape A (data dataset).dataset.toml+ subdirectories withtask.toml(or listed under[[tasks]]) → Shape B (sandbox dataset).- A single
task.tomlat the root, or autodiscovered subdirectories withtask.toml→ still produces Tasks.
~/.rllm/datasets/<name>/, then load through the same path. See Bring your own dataset for the catalog vs. ad-hoc paths.
The Runner
rllm.runner.Runner runs a single Task end-to-end:
Read verifier config
Resolve the task’s verifier from
task.toml (per-task) or dataset.toml (shared).Provision sandbox if needed
If the task or AgentFlow requires a sandbox, build one from
task.task_dir/environment/ (Dockerfile, image, setup commands). Sandbox backends: docker, local, modal, daytona.Run the AgentFlow
Call
agent_flow.run(task, config) (or arun if defined) to produce an Episode with one or more Trajectories.Resolve and run the Evaluator
Build an evaluator from the verifier config: a Python
evaluate function for data tasks, or a shell-script runner for sandbox tasks. Pass it (task, episode) to get an EvalOutput.Running many tasks
run_dataset is the parallel front-end on top of Runner — it’s what rllm eval uses internally:
min(concurrency, agent_flow.max_concurrent). Sandboxed flows get a fresh per-task agent_flow copy so sandbox state doesn’t leak.
Running
Why this split
Before this refactor, eval and training each had their own task abstraction, and sandbox tasks went through a different pipeline than data tasks. That split made it hard to share verifiers, mix shapes in one benchmark suite, or move a benchmark from “in-memory rows” to “on-disk task directories” without rewriting glue. The new model — pure-dataTask + a single Runner that reads verifier config off disk — means:
- Data tasks and sandbox tasks share the same Episode-producing pipeline.
- Verifiers travel with the dataset (in
dataset.toml/task.toml), not in agent code. - The CLI, training loop, and SDK all consume the same
TaskandRunner, so behaviour stays consistent across entry points. - Harbor task packages drop in unchanged — the format is the same one rLLM uses natively.

