Pattern
| Aspect | Value |
|---|---|
| Loop shape | Single-turn (one VLM call per task) |
| Tools | None — answer is parsed out of the response |
| Inputs | Multimodal — text question + base64-encoded diagram image |
| Termination | Single LLM call returns; evaluator extracts \boxed{…} |
| Reward shape | 1.0 if boxed answer matches ground truth (symbolic math), else 0.0 |
Architecture
The cookbook demonstrates the multimodal content-block pattern in an AgentFlow — themessages list contains a {"type": "image_url", "image_url": {"url": f"data:image/png;base64,…"}} block alongside the text content.
Install
Dataset
Eval
Training
Files
| File | Description |
|---|---|
geo3k_flow.py | Single-turn VLM AgentFlow with multimodal content blocks |
evaluator.py | \boxed{} extraction + symbolic math grading |
train.py + train_{tinker,verl}.sh | Hydra entry points |
pyproject.toml | Plugin entry-point declarations |
test.py | Unit tests |
On GitHub
cookbooks/geo3k
Full source, README, and runnable launch scripts

