nat.eval.eval_callbacks#

Attributes#

Classes#

EvalResultItem

Per-dataset-item result from evaluation.

EvalResult

Full result of a single evaluation run.

EvalCallback

Base class for protocol classes.

EvalCallbackManager

Functions#

build_eval_result(→ EvalResult)

Build an EvalResult from raw evaluation data.

Module Contents#

logger#
class EvalResultItem#

Per-dataset-item result from evaluation.

item_id: Any#
input_obj: Any#
expected_output: Any#
actual_output: Any#
scores: dict[str, float]#
reasoning: dict[str, Any]#
total_tokens: int | None = None#
llm_latency: float | None = None#
runtime: float | None = None#
root_span_id: int | None = None#
class EvalResult#

Full result of a single evaluation run.

metric_scores: dict[str, float]#
items: list[EvalResultItem]#
build_eval_result(
*,
eval_input_items: list,
evaluation_results: list[tuple[str, Any]],
metric_scores: dict[str, float],
usage_stats: Any | None = None,
item_span_ids: dict[str, int] | None = None,
) EvalResult#

Build an EvalResult from raw evaluation data.

This is the single place that maps eval-input items + evaluator outputs into the callback-friendly EvalResult / EvalResultItem structure.

class EvalCallback#

Bases: Protocol

Base class for protocol classes.

Protocol classes are defined as:

class Proto(Protocol):
    def meth(self) -> int:
        ...

Such classes are primarily used with static type checkers that recognize structural subtyping (static duck-typing).

For example:

class C:
    def meth(self) -> int:
        return 0

def func(x: Proto) -> int:
    return x.meth()

func(C())  # Passes static type check

See PEP 544 for details. Protocol classes decorated with @typing.runtime_checkable act as simple-minded runtime protocols that check only the presence of given attributes, ignoring their type signatures. Protocol classes can be generic, they are defined as:

class GenProto[T](Protocol):
    def meth(self) -> T:
        ...
on_dataset_loaded(
*,
dataset_name: str,
items: list[nat.eval.evaluator.evaluator_model.EvalInputItem],
) None#
on_eval_complete(result: EvalResult) None#
class EvalCallbackManager#
_callbacks: list[EvalCallback] = []#
register(callback: EvalCallback) None#
property has_callbacks: bool#
property needs_root_span_ids: bool#

Check if any registered callback declares it needs pre-generated root span_ids.

on_dataset_loaded(
*,
dataset_name: str,
items: list[nat.eval.evaluator.evaluator_model.EvalInputItem],
) None#
on_eval_complete(result: EvalResult) None#
get_eval_project_name() str | None#

Get an eval-specific project name from the first callback that supports it.