```python block; the evaluator extracts the last block and runs it against hidden test cases.
Pattern
| Aspect | Value |
|---|---|
| Loop shape | Single-turn (one LLM call per task) |
| Tools | None — code is parsed out of the response |
| Termination | Single LLM call returns; evaluator runs hidden tests |
| Reward shape | 1.0 if all hidden tests pass, 0.0 otherwise |
Architecture
Long chain-of-thought reasoning happens inside the assistant message — there is no multi-turn revise/feedback loop. This matches the original deepcoder training setup.Install
Dataset
agentica-org/DeepCoder-Preview-Dataset (primeintellect + taco + lcbv5 train; codeforces + lcbv5 test) and normalizes the test schemas (TACO’s nested dict → flat list).
Eval
rllm eval deepcoder --max-examples 10 against gpt-5.4-mini reports 5/10 correct (50% accuracy) with per-item rewards split mixed 1.0 / 0.0.
Training
Key code
The flow is a single LLM call — all the reasoning lives inside the one assistant message:rllm.rewards.code_reward.RewardCodeFn, which runs the extracted code against the hidden tests in a sandboxed subprocess.
Files
| File | Description |
|---|---|
deepcoder_flow.py | The single-turn AgentFlow |
evaluator.py | Wraps RewardCodeFn for hidden-test grading |
prepare_data.py | Pull + normalize Deepcoder splits via DatasetRegistry |
train.py + train_{tinker,verl}.sh | Hydra entry points |
pyproject.toml | Plugin entry-point declarations |
test.py | 5 unit tests covering correct / wrong / no-fence / multi-block / Task vs dict |
On GitHub
cookbooks/deepcoder
Full source, README, and runnable launch scripts

