rllm eval and rllm train at a dataset that isn’t already in the Supported datasets catalog. Pick the one that matches your situation.
| Path | Use when | What you write |
|---|---|---|
| Ad-hoc directory | one-off; experiment doesn’t need to be redistributed | a folder on disk |
| Builder + catalog entry | you want rllm dataset pull <name> to work for collaborators | a builder function + a registry row |
| Harbor package | you’re publishing a task suite for broad consumption | a Harbor-format git repo |
Task shape; the difference is how the directory gets on disk.
Ad-hoc directory
The fastest path. Build a directory in either supported shape (see Tasks (Harbor-compatible) for the format) and point the CLI at it:rllm dataset pull registration needed — rllm eval ./path works on any well-formed directory.
Builder + catalog entry
When you wantrllm dataset pull my-bench to materialize the dataset on a collaborator’s machine, register a builder in rllm/registry/datasets.json:
source, screens out unsolvable rows, writes dataset.toml + per-task directories into ~/.rllm/datasets/my-bench/, and registers task_path rows so rllm eval my-bench resolves to the materialized directory:
rllm-swesmith and skillsbench builders (in rllm/data/) are real-world references. Builders also accept task_ids / limit kwargs for partial builds — useful for smoke tests:
What goes in default_sandbox
For sandbox datasets, declaring default_sandbox in the registry row (or in the materialized dataset.toml) means users don’t have to remember --sandbox-backend modal. Both the eval and train CLIs honor it when --sandbox-backend is absent, falling back to the harness class default otherwise.
When the builder should not fill in resource limits
Remote backends (Modal, Daytona) OOM on heavy environments with the default ~1 GiB allocation. If your tasks need more, patch in defaults at build time:rllm-swesmith builder does this — without it, the uv install step OOMs on cold sandbox builds.
Harbor packages
If you already have a Harbor task package, rLLM consumes it unchanged:harbor: prefix resolves through the Harbor registry. See Tasks (Harbor-compatible) for the format — it’s the same one rLLM uses natively.
Mixing splits for training
rllm train accepts independent --train-dataset and --val-dataset paths, so train/val can be different directories or different shapes:
train_split / eval_split keys do the same for catalog datasets:
What you don’t need to write
- The reward function for sandbox tasks —
tests/test.shwrites a reward file; the framework reads it. See the shell verifier contract. - The reward function for catalog-style data tasks if your task matches an existing pattern — point
[verifier].nameatmath_reward_fn,mcq_reward_fn,f1_reward_fn, etc. - A custom harness — if your task fits the bash-ReAct or one-shot LLM shape, the built-in harnesses already cover it. See Harnesses.

