nat.data_models.evaluate_runtime#

Runtime-only evaluation models used by nat eval programmatic execution.

Attributes#

EvaluationResultOutput

Classes#

`EndpointRetryConfig`	Configuration for HTTP retry behavior on remote workflow endpoints.
`EvaluationRunConfig`	Parameters used for a single evaluation run. This is used by the `nat eval` command. It
`UsageStatsLLM`	Token usage counters aggregated for one LLM.
`UsageStatsItem`	Usage metrics for one evaluated input item.
`UsageStats`	Aggregated usage metrics across an evaluation run.
`InferenceMetricsModel`	Confidence intervals and percentiles for a sampled profiler metric.
`WorkflowRuntimeMetrics`	p90/p95/p99 workflow runtimes across evaluation examples.
`ProfilerResults`	High-level profiler output attached to an evaluation run.
`EvaluationRunOutput`	Output of a single evaluation run.

Module Contents#

type EvaluationResultOutput = EvalOutputLike#

class EndpointRetryConfig(/, **data: Any)#

Bases: pydantic.BaseModel

Configuration for HTTP retry behavior on remote workflow endpoints.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

do_auto_retry: bool = None#

max_retries: int = None#

retry_status_codes: list[int] = None#

class EvaluationRunConfig(/, **data: Any)#

Bases: pydantic.BaseModel

Parameters used for a single evaluation run. This is used by the nat eval command. It can also be used for programmatic evaluation.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

config_file: pathlib.Path | pydantic.BaseModel = None#

dataset: str | None = None#

result_json_path: str = None#

skip_workflow: bool = None#

skip_completed_entries: bool = None#

endpoint: str | None = None#

endpoint_timeout: int = None#

endpoint_retry: EndpointRetryConfig = None#

reps: int = None#

override: tuple[tuple[str, str], Ellipsis] = None#

write_output: bool = None#

adjust_dataset_size: bool = None#

num_passes: int = None#

export_timeout: float = None#

user_id: str = None#

class UsageStatsLLM(/, **data: Any)#

Bases: pydantic.BaseModel

Token usage counters aggregated for one LLM.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

prompt_tokens: int = 0#

completion_tokens: int = 0#

cached_tokens: int = 0#

reasoning_tokens: int = 0#

total_tokens: int = 0#

class UsageStatsItem(/, **data: Any)#

Bases: pydantic.BaseModel

Usage metrics for one evaluated input item.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

usage_stats_per_llm: dict[str, UsageStatsLLM]#

total_tokens: int | None = None#

runtime: float = 0.0#

min_timestamp: float = 0.0#

max_timestamp: float = 0.0#

llm_latency: float = 0.0#

class UsageStats(/, **data: Any)#

Bases: pydantic.BaseModel

Aggregated usage metrics across an evaluation run.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

min_timestamp: float = 0.0#

max_timestamp: float = 0.0#

total_runtime: float = 0.0#

usage_stats_items: dict[object, UsageStatsItem]#

class InferenceMetricsModel(/, **data: Any)#

Bases: pydantic.BaseModel

Confidence intervals and percentiles for a sampled profiler metric.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

n: int = None#

mean: float = None#

ninetieth_interval: tuple[float, float] = None#

ninety_fifth_interval: tuple[float, float] = None#

ninety_ninth_interval: tuple[float, float] = None#

p90: float = None#

p95: float = None#

p99: float = None#

class WorkflowRuntimeMetrics(/, **data: Any)#

Bases: pydantic.BaseModel

p90/p95/p99 workflow runtimes across evaluation examples.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

p90: float#

p95: float#

p99: float#

class ProfilerResults(/, **data: Any)#

Bases: pydantic.BaseModel

High-level profiler output attached to an evaluation run.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

workflow_runtime_metrics: WorkflowRuntimeMetrics | None = None#

llm_latency_ci: InferenceMetricsModel | None = None#

class EvaluationRunOutput(/, **data: Any)#

Bases: pydantic.BaseModel

Output of a single evaluation run.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

workflow_output_file: pathlib.Path | None = None#

evaluator_output_files: list[pathlib.Path] = None#

workflow_interrupted: bool = None#

eval_input: nat.data_models.evaluator.EvalInput = None#

evaluation_results: list[tuple[str, EvaluationResultOutput]] = None#

usage_stats: UsageStats | None = None#

profiler_results: ProfilerResults = None#

config_original_file: pathlib.Path | None = None#

config_effective_file: pathlib.Path | None = None#

config_metadata_file: pathlib.Path | None = None#