AgentFlow
AnAgentFlow is any callable that takes a task and returns an Episode. It is the eval-side equivalent of a Workflow (used during training), but has no training dependencies — it only needs a base_url and model to make LLM calls.
run (sync) or arun (async) — the runner prefers arun when available.
Task and AgentConfig
The two arguments passed to every AgentFlow call:Task wraps a raw dataset row. Agents can use task.spec for structured rendering or fall back to reading task.data directly.
Example AgentFlow
A minimal math agent that solves a problem in one turn:Evaluator
AnEvaluator scores an Episode produced by an AgentFlow. It examines the task and the episode’s trajectories, then returns an EvalOutput with a reward and correctness flag.
EvalOutput
The result of evaluating a single episode:signals lets an evaluator report multiple dimensions of quality. For example, an IFEval evaluator might return separate signals for instruction-following accuracy and format compliance.
Example Evaluator
A simple exact-match evaluator for math:Built-in Evaluators
| Evaluator | Method | Use case |
|---|---|---|
ExactMatchEvaluator | String comparison | Math, factoid QA |
LLMJudgeEvaluator | LLM-as-judge | Open-ended generation, summarization |
BFCLEvaluator | Function call matching | Tool-use benchmarks |
IFEvalEvaluator | Instruction-following checks | Instruction-following benchmarks |
How they work together
The eval runner connects AgentFlow and Evaluator in a simple pipeline:Score results
The runner passes each Episode to
evaluator.evaluate(task, episode) to get an EvalOutput.Connection to training
AgentFlow and Evaluator are designed to be reusable across eval and training. During training:- The AgentFlow logic is replaced by a
Workflowthat also captures token-level data (prompt IDs, logprobs) needed for policy gradients - The Evaluator logic maps directly to reward functions — the same scoring logic that produces
EvalOutput.rewardduring eval produces the reward signal during training - The Episode structure is identical in both paths, so evaluation results are directly comparable to training metrics

