Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.rllm-project.com/llms.txt

Use this file to discover all available pages before exploring further.

A single-turn coding agent for competition-style programming problems. The model emits reasoning followed by a fenced ```python block; the evaluator extracts the last block and runs it against hidden test cases.

Pattern

AspectValue
Loop shapeSingle-turn (one LLM call per task)
ToolsNone — code is parsed out of the response
TerminationSingle LLM call returns; evaluator runs hidden tests
Reward shape1.0 if all hidden tests pass, 0.0 otherwise

Architecture

AgentFlow.run(task, config)

  ├── one LLM call via OpenAI(base_url=config.base_url)
  │     model outputs reasoning + ```python ... ```

  └── store full response in episode.artifacts["answer"]

Evaluator.evaluate(task, episode)

  └── RewardCodeFn extracts last ```python``` block, runs against
      task.metadata["ground_truth"] (hidden tests)
Long chain-of-thought reasoning happens inside the assistant message — there is no multi-turn revise/feedback loop. This matches the original deepcoder training setup.

Install

uv pip install -e ".[tinker]"                          # rllm + tinker backend
uv pip install --no-deps -e cookbooks/deepcoder        # this cookbook
rllm agent list                                        # should show "deepcoder"

Dataset

python cookbooks/deepcoder/prepare_data.py
# Smoke-size:
python cookbooks/deepcoder/prepare_data.py --train-size 200 --test-size 50
This pulls agentica-org/DeepCoder-Preview-Dataset (primeintellect + taco + lcbv5 train; codeforces + lcbv5 test) and normalizes the test schemas (TACO’s nested dict → flat list).

Eval

rllm eval deepcoder \
    --agent deepcoder \
    --evaluator deepcoder \
    --model agentica-org/DeepCoder-14B-Preview \
    --base-url http://localhost:8000/v1 \
    --split test \
    --max-examples 20
Verified end-to-end: rllm eval deepcoder --max-examples 10 against gpt-5.4-mini reports 5/10 correct (50% accuracy) with per-item rewards split mixed 1.0 / 0.0.

Training

# Tinker (single-machine LoRA)
bash cookbooks/deepcoder/train_tinker.sh

# Verl (distributed GPU)
bash cookbooks/deepcoder/train_verl.sh

Key code

The flow is a single LLM call — all the reasoning lives inside the one assistant message:
@rllm.rollout(name="deepcoder")
async def deepcoder_flow(task: Task, config: AgentConfig) -> Episode:
    question = str(task.metadata.get("question") or task.instruction or "")
    client = AsyncOpenAI(base_url=config.base_url, api_key="EMPTY")
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": question},
    ]

    resp = await client.chat.completions.create(
        model=config.model, messages=messages,
        temperature=0.6, max_tokens=16384, timeout=600,
    )
    content = resp.choices[0].message.content or ""
    messages.append({"role": "assistant", "content": content})
    step = Step(chat_completions=list(messages), model_response=content, action=content, thought=content)

    return Episode(
        trajectories=[Trajectory(name="deepcoder", steps=[step])],
        artifacts={"answer": content},
    )
The evaluator delegates to rllm.rewards.code_reward.RewardCodeFn, which runs the extracted code against the hidden tests in a sandboxed subprocess.

Files

FileDescription
deepcoder_flow.pyThe single-turn AgentFlow
evaluator.pyWraps RewardCodeFn for hidden-test grading
prepare_data.pyPull + normalize Deepcoder splits via DatasetRegistry
train.py + train_{tinker,verl}.shHydra entry points
pyproject.tomlPlugin entry-point declarations
test.py5 unit tests covering correct / wrong / no-fence / multi-block / Task vs dict

On GitHub

cookbooks/deepcoder

Full source, README, and runnable launch scripts