Skip to main content
A harness is a built-in AgentFlow: a generic agent program parameterised by the LLM and the task, with no per-benchmark code. rLLM ships three of them. The CLI’s --agent flag picks one by name; the Runner drives it like any other AgentFlow. You don’t need a harness for everything — for one-off agents and cookbooks, hand-write an AgentFlow class. Reach for a harness when the same agent shape applies to many tasks (every math benchmark needs a one-shot LLM call; every SWE-bench-style task needs an interactive bash loop).

Built-in harnesses

NameClassUse case
reactrllm.harnesses.react.ReActHarnessOne-shot LLM call. Default for data tasks (math, MCQ, QA).
bashrllm.harnesses.bash.BashHarnessMulti-turn ReAct bash loop inside a sandbox. Default for sandbox tasks.
claude-coderllm.harnesses.claude_code.ClaudeCodeHarnessRun the Claude Code CLI inside the sandbox.
opencoderllm.harnesses.opencode.OpenCodeHarnessRun the opencode-ai CLI inside the sandbox.
mini-swe-agentrllm.harnesses.mini_swe_agent.MiniSweAgentHarnessRun mini-swe-agent inside the sandbox.
codexrllm.harnesses.codex.CodexHarnessRun the OpenAI Codex CLI inside the sandbox.
qwen-coderllm.harnesses.qwen_code.QwenCodeHarnessRun the Qwen Code CLI inside the sandbox.
aiderrllm.harnesses.aider.AiderHarnessRun Paul Gauthier’s aider CLI inside the sandbox.
kimi-clirllm.harnesses.kimi_cli.KimiCliHarnessRun Moonshot’s kimi-cli inside the sandbox.
zeroclawrllm.harnesses.zeroclaw.ZeroClawHarnessRun the ZeroClaw autonomous-assistant CLI inside the sandbox.
The registry lives at rllm/registry/agents.json. load_agent("react") resolves to ReActHarness().

ReAct harness — one-shot LLM

ReActHarness is the default for catalog datasets (gsm8k, MATH-500, MMLU, …). It makes one chat completion and packs the response into a single-step Episode:
from rllm.harnesses.react import ReActHarness

agent = ReActHarness(system_prompt="You are a careful math tutor.")
episode = agent.run(task, config)
The harness automatically appends the verifier’s expected output format to the system prompt — so a math benchmark gets "Put your final answer in \boxed{}" injected, while an MCQ benchmark gets "Respond with a single letter A, B, C, or D". See Reward functions for how that hint is declared. trajectory.output is set to the LLM response so reward functions can extract the answer with a uniform contract.

Bash harness — sandboxed ReAct loop

BashHarness is the default for sandbox tasks. It runs a multi-turn ReAct loop where the agent’s tool is bash executed inside the task’s sandbox. Each turn:
  1. The model proposes a bash command (or signals it’s done).
  2. The harness runs it in the sandbox and returns stdout/stderr.
  3. The conversation continues until the model declares completion or a turn cap is hit.
Use this for benchmarks where success is measured by the state of files in a sandboxed environment — code repairs, terminal tasks, environment configuration.

Claude Code harness

ClaudeCodeHarness runs the Claude Code CLI inside the sandbox as a black-box agent. The harness is a thin wrapper that streams Claude Code’s output and packages the resulting trajectory. Use it when the benchmark expects a fully agentic coding workflow and you want Claude Code’s tool-use behaviour out of the box.

Resolving a harness

The CLI resolves harnesses by name through load_agent:
from rllm.eval import load_agent

agent = load_agent("react")               # Built-in harness
agent = load_agent("my_pkg:my_agent")     # Module path:attribute
load_agent accepts:
  • A registered name from rllm/registry/agents.json (react, bash, claude-code).
  • A module.path:attribute reference for any AgentFlow you’ve installed.
dataset.toml can declare a default_agent that points at a harness name; BenchmarkLoader surfaces it on BenchmarkResult.harness_name, and the CLI uses it when --agent is not supplied.

Writing a custom harness

A harness is just an AgentFlow with a class-level name and (optionally) max_concurrent:
from rllm.types import Episode, Step, Trajectory, Task


class MyHarness:
    name = "my-harness"
    max_concurrent = 32

    def __init__(self, system_prompt: str | None = None):
        self.system_prompt = system_prompt or "You are helpful."

    async def arun(self, task: Task, config) -> Episode:
        # ... call the model, build steps, etc.
        step = Step(input=task.instruction, output="...")
        traj = Trajectory(steps=[step], output=step.output)
        return Episode(trajectories=[traj])
Register it by adding an entry to rllm/registry/agents.json (for built-ins) or by referencing it as my_pkg.my_module:MyHarness from the CLI:
rllm eval gsm8k --agent my_pkg.my_module:MyHarness
If your harness needs a sandbox, subclass rllm.sandbox.SandboxedAgentFlow instead — the Runner will provision a fresh sandbox per task.