nat.eval.runtime_evaluator.evaluate#

Classes#

_CallTiming

AverageLLMLatencyEvaluator

Mean difference between connected LLM_START and LLM_END events (same UUID).

AverageWorkflowRuntimeEvaluator

Average workflow runtime per item: max(event_timestamp) - min(event_timestamp) across the trajectory.

AverageNumberOfLLMCallsEvaluator

Average number of LLM calls per item. The score is the count for the item.

AverageTokensPerLLMEndEvaluator

Average total tokens per LLM_END event: sum of prompt and completion tokens if available.

Module Contents#

class _CallTiming#
start_ts: float | None = None#
end_ts: float | None = None#
property latency: float | None#
class AverageLLMLatencyEvaluator(max_concurrency: int = 8)#

Bases: nat.eval.evaluator.base_evaluator.BaseEvaluator

Mean difference between connected LLM_START and LLM_END events (same UUID). The score is the average latency in seconds for the item. Reasoning contains per-call latencies.

async evaluate_item(
item: nat.eval.evaluator.evaluator_model.EvalInputItem,
) nat.eval.evaluator.evaluator_model.EvalOutputItem#

Each evaluator must implement this for item-level evaluation

class AverageWorkflowRuntimeEvaluator(max_concurrency: int = 8)#

Bases: nat.eval.evaluator.base_evaluator.BaseEvaluator

Average workflow runtime per item: max(event_timestamp) - min(event_timestamp) across the trajectory. The score is the runtime in seconds for the item.

async evaluate_item(
item: nat.eval.evaluator.evaluator_model.EvalInputItem,
) nat.eval.evaluator.evaluator_model.EvalOutputItem#

Each evaluator must implement this for item-level evaluation

class AverageNumberOfLLMCallsEvaluator(max_concurrency: int = 8)#

Bases: nat.eval.evaluator.base_evaluator.BaseEvaluator

Average number of LLM calls per item. The score is the count for the item.

async evaluate_item(
item: nat.eval.evaluator.evaluator_model.EvalInputItem,
) nat.eval.evaluator.evaluator_model.EvalOutputItem#

Each evaluator must implement this for item-level evaluation

class AverageTokensPerLLMEndEvaluator(max_concurrency: int = 8)#

Bases: nat.eval.evaluator.base_evaluator.BaseEvaluator

Average total tokens per LLM_END event: sum of prompt and completion tokens if available. The score is the average tokens per LLM_END for the item (0 if none).

async evaluate_item(
item: nat.eval.evaluator.evaluator_model.EvalInputItem,
) nat.eval.evaluator.evaluator_model.EvalOutputItem#

Each evaluator must implement this for item-level evaluation