nat.plugins.eval.eval_callbacks#

Attributes#

logger

Classes#

`EvalResultItem`	Per-dataset-item result from evaluation.
`EvalResult`	Full result of a single evaluation run.
`EvalCallback`	Base class for protocol classes.
`EvalCallbackManager`	Dispatches eval lifecycle callbacks to registered integrations.

Functions#

build_eval_result(→ EvalResult)

Build an EvalResult from raw evaluation data.

Module Contents#

logger#

class EvalResultItem#

Per-dataset-item result from evaluation.

item_id: Any#

input_obj: Any#

expected_output: Any#

actual_output: Any#

scores: dict[str, float]#

reasoning: dict[str, Any]#

total_tokens: int | None = None#

llm_latency: float | None = None#

runtime: float | None = None#

root_span_id: int | None = None#

class EvalResult#

Full result of a single evaluation run.

The metric_scores and items fields are always populated. The remaining fields are optional context that exporters (e.g. FileEvalCallback) can use to persist richer output without breaking callbacks that only inspect scores.

metric_scores: dict[str, float]#

items: list[EvalResultItem]#

evaluation_outputs: list[tuple[str, Any]] = []#

workflow_output_json: str | None = None#

atif_workflow_output_json: str | None = None#

run_config: Any | None = None#

effective_config: Any | None = None#

output_dir: pathlib.Path | None = None#

build_eval_result( *, eval_input_items: list, evaluation_results: list[tuple[str, Any]], metric_scores: dict[str, float], usage_stats: Any | None = None, item_span_ids: dict[str, int] | None = None, workflow_output_json: str | None = None, atif_workflow_output_json: str | None = None, run_config: Any | None = None, effective_config: Any | None = None, output_dir: pathlib.Path | None = None, ) → EvalResult#

Build an EvalResult from raw evaluation data.

This is the single place that maps eval-input items + evaluator outputs into the callback-friendly EvalResult / EvalResultItem structure.

class EvalCallback#

Bases: Protocol

Base class for protocol classes.

Protocol classes are defined as:

class Proto(Protocol):
    def meth(self) -> int:
        ...

Such classes are primarily used with static type checkers that recognize structural subtyping (static duck-typing).

For example:

class C:
    def meth(self) -> int:
        return 0

def func(x: Proto) -> int:
    return x.meth()

func(C())  # Passes static type check

See PEP 544 for details. Protocol classes decorated with @typing.runtime_checkable act as simple-minded runtime protocols that check only the presence of given attributes, ignoring their type signatures. Protocol classes can be generic, they are defined as:

class GenProto[T](Protocol):
    def meth(self) -> T:
        ...

on_dataset_loaded( *, dataset_name: str, items: list[nat.eval.evaluator.evaluator_model.EvalInputItem], ) → None#

on_eval_complete(result: EvalResult) → None#

on_eval_started( *, workflow_alias: str, eval_input: Any, config: Any, job_id: str | None = None, ) → None#

on_prediction(*, item: Any, output: Any) → None#

async a_on_usage_stats(*, item: Any, usage_stats_item: Any) → None#

async a_on_evaluator_score(*, eval_output: Any, evaluator_name: str) → None#

async a_on_export_flush() → None#

on_eval_summary( *, usage_stats: Any, evaluation_results: Any, profiler_results: Any, ) → None#

evaluation_context()#

class EvalCallbackManager#

Dispatches eval lifecycle callbacks to registered integrations.

Maintainer note: Keep this callback surface stable for provider plugins. If we later adopt an internal event-subscriber bus (typed events, async fan-out, retries), it can be introduced behind this manager as a near-term design evolution.

_callbacks: list[EvalCallback] = []#

register(callback: EvalCallback) → None#

property has_callbacks: bool#

property needs_root_span_ids: bool#: Check if any registered callback declares it needs pre-generated root span_ids.

on_dataset_loaded( *, dataset_name: str, items: list[nat.eval.evaluator.evaluator_model.EvalInputItem], ) → None#

on_eval_started( *, workflow_alias: str, eval_input: Any, config: Any, job_id: str | None = None, ) → None#

on_prediction(*, item: Any, output: Any) → None#

async a_on_usage_stats(*, item: Any, usage_stats_item: Any) → None#

async a_on_evaluator_score(*, eval_output: Any, evaluator_name: str) → None#

async a_on_export_flush() → None#

on_eval_summary( *, usage_stats: Any, evaluation_results: Any, profiler_results: Any, ) → None#

evaluation_context()#

on_eval_complete(result: EvalResult) → None#

get_eval_project_name() → str | None#: Get an eval-specific project name from the first callback that supports it.