--agent flag picks one by name; the Runner drives it like any other AgentFlow.
You don’t need a harness for everything — for one-off agents and cookbooks, hand-write an AgentFlow class. Reach for a harness when the same agent shape applies to many tasks (every math benchmark needs a one-shot LLM call; every SWE-bench-style task needs an interactive bash loop).
Built-in harnesses
| Name | Class | Use case |
|---|---|---|
react | rllm.harnesses.react.ReActHarness | One-shot LLM call. Default for data tasks (math, MCQ, QA). |
bash | rllm.harnesses.bash.BashHarness | Multi-turn ReAct bash loop inside a sandbox. Default for sandbox tasks. |
claude-code | rllm.harnesses.claude_code.ClaudeCodeHarness | Run the Claude Code CLI inside the sandbox. |
opencode | rllm.harnesses.opencode.OpenCodeHarness | Run the opencode-ai CLI inside the sandbox. |
mini-swe-agent | rllm.harnesses.mini_swe_agent.MiniSweAgentHarness | Run mini-swe-agent inside the sandbox. |
codex | rllm.harnesses.codex.CodexHarness | Run the OpenAI Codex CLI inside the sandbox. |
qwen-code | rllm.harnesses.qwen_code.QwenCodeHarness | Run the Qwen Code CLI inside the sandbox. |
aider | rllm.harnesses.aider.AiderHarness | Run Paul Gauthier’s aider CLI inside the sandbox. |
kimi-cli | rllm.harnesses.kimi_cli.KimiCliHarness | Run Moonshot’s kimi-cli inside the sandbox. |
zeroclaw | rllm.harnesses.zeroclaw.ZeroClawHarness | Run the ZeroClaw autonomous-assistant CLI inside the sandbox. |
rllm/registry/agents.json. load_agent("react") resolves to ReActHarness().
ReAct harness — one-shot LLM
ReActHarness is the default for catalog datasets (gsm8k, MATH-500, MMLU, …). It makes one chat completion and packs the response into a single-step Episode:
"Put your final answer in \boxed{}" injected, while an MCQ benchmark gets "Respond with a single letter A, B, C, or D". See Reward functions for how that hint is declared.
trajectory.output is set to the LLM response so reward functions can extract the answer with a uniform contract.
Bash harness — sandboxed ReAct loop
BashHarness is the default for sandbox tasks. It runs a multi-turn ReAct loop where the agent’s tool is bash executed inside the task’s sandbox. Each turn:
- The model proposes a bash command (or signals it’s done).
- The harness runs it in the sandbox and returns stdout/stderr.
- The conversation continues until the model declares completion or a turn cap is hit.
Claude Code harness
ClaudeCodeHarness runs the Claude Code CLI inside the sandbox as a black-box agent. The harness is a thin wrapper that streams Claude Code’s output and packages the resulting trajectory. Use it when the benchmark expects a fully agentic coding workflow and you want Claude Code’s tool-use behaviour out of the box.
Resolving a harness
The CLI resolves harnesses by name throughload_agent:
load_agent accepts:
- A registered name from
rllm/registry/agents.json(react,bash,claude-code). - A
module.path:attributereference for any AgentFlow you’ve installed.
dataset.toml can declare a default_agent that points at a harness name; BenchmarkLoader surfaces it on BenchmarkResult.harness_name, and the CLI uses it when --agent is not supplied.
Writing a custom harness
A harness is just an AgentFlow with a class-levelname and (optionally) max_concurrent:
rllm/registry/agents.json (for built-ins) or by referencing it as my_pkg.my_module:MyHarness from the CLI:
rllm.sandbox.SandboxedAgentFlow instead — the Runner will provision a fresh sandbox per task.
