Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.rllm-project.com/llms.txt

Use this file to discover all available pages before exploring further.

A single-turn math agent that solves competition / textbook problems via the AgentFlow protocol. The model emits reasoning followed by \boxed{ANSWER}; the evaluator pulls the boxed value and grades it via mathd + sympy equivalence. This is the no-tool counterpart to math_tool_agent, which uses a calculator tool. Use math for chain-of-thought-only training.

Pattern

AspectValue
Loop shapeSingle-turn (one LLM call per task)
ToolsNone — answer is parsed out of the response text
TerminationSingle LLM call returns; evaluator grades the boxed answer
Reward shape1.0 if \boxed{…} matches ground truth (mathd + sympy), else 0.0

Architecture

AgentFlow.run(task, config)

  ├── one LLM call via OpenAI(base_url=config.base_url)
  │     model outputs reasoning + \boxed{ANSWER}

  └── store full response in episode.artifacts["answer"]

Evaluator.evaluate(task, episode)

  └── extract last \boxed{...}, grade against task.metadata["ground_truth"]
      via mathd + sympy

Install

uv pip install -e ".[tinker]"                    # rllm + tinker backend
uv pip install --no-deps -e cookbooks/math       # this cookbook
rllm agent list                                  # should show "math"

Datasets

rllm dataset pull hendrycks_math    # train (Hendrycks MATH)
rllm dataset pull math500           # 500-problem test
rllm dataset pull gsm8k             # alternative train
rllm dataset pull deepscaler_math   # ~40K AIME/AMC/Omni-MATH/STILL
rllm dataset pull aime2024          # AIME 2024 (eval)

Eval

rllm eval math500 \
    --agent math \
    --evaluator math \
    --model Qwen/Qwen3-4B-Instruct-2507 \
    --base-url http://localhost:8000/v1 \
    --max-examples 20
Verified with gpt-5.4-mini: 100% (10/10) on a smoke run.

Training

# Tinker single-machine
bash cookbooks/math/train_tinker.sh

# Verl distributed
bash cookbooks/math/train_verl.sh
For LoRA-only training (the legacy gsm8k_lora use case), set --lora-rank higher and pass --train-dataset gsm8k:
rllm train gsm8k --agent math --evaluator math --lora-rank 32

Key code

@rllm.rollout(name="math")
async def math_flow(task: Task, config: AgentConfig) -> Episode:
    question = str(task.metadata.get("question") or task.metadata.get("problem") or task.instruction)
    client = AsyncOpenAI(base_url=config.base_url, api_key="EMPTY")

    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": question},
    ]

    resp = await client.chat.completions.create(
        model=config.model, messages=messages,
        temperature=0.6, max_tokens=8192,
    )
    content = resp.choices[0].message.content or ""
    messages.append({"role": "assistant", "content": content})
    step = Step(chat_completions=list(messages), model_response=content, action=content, thought=content)

    return Episode(
        trajectories=[Trajectory(name="math", steps=[step])],
        artifacts={"answer": content},
    )
The evaluator wraps the existing rllm.eval.reward_fns.math.evaluate:
@rllm.evaluator
def math_evaluator(task, episode) -> EvalOutput:
    if isinstance(task, dict):
        task = Task(id="", instruction="", metadata=task, dataset_dir=Path("."))
    return _math_evaluate(task, episode)

Files

FileDescription
math_flow.pyThe single-turn AgentFlow
math_eval.pyWraps rllm.eval.reward_fns.math
train.py + train_{tinker,verl}.shHydra entry points
pyproject.tomlPlugin entry-point declarations
test.py7 unit tests covering correct / wrong / no-fence / latex equivalence

On GitHub

cookbooks/math

Full source, README, and runnable launch scripts