nat.eval.runtime_evaluator.evaluate#

Classes#

`_CallTiming`
`AverageLLMLatencyEvaluator`	Mean difference between connected LLM_START and LLM_END events (same UUID).
`AverageWorkflowRuntimeEvaluator`	Average workflow runtime per item: max(event_timestamp) - min(event_timestamp) across the trajectory.
`AverageNumberOfLLMCallsEvaluator`	Average number of LLM calls per item. The score is the count for the item.
`AverageTokensPerLLMEndEvaluator`	Average total tokens per LLM_END event: sum of prompt and completion tokens if available.

Module Contents#

class _CallTiming#

start_ts: float | None = None#

end_ts: float | None = None#

property latency: float | None#

class AverageLLMLatencyEvaluator(max_concurrency: int = 8)#

Bases: nat.eval.evaluator.base_evaluator.BaseEvaluator

Mean difference between connected LLM_START and LLM_END events (same UUID). The score is the average latency in seconds for the item. Reasoning contains per-call latencies.

async evaluate_item( item: nat.eval.evaluator.evaluator_model.EvalInputItem, ) → nat.eval.evaluator.evaluator_model.EvalOutputItem#: Each evaluator must implement this for item-level evaluation

class AverageWorkflowRuntimeEvaluator(max_concurrency: int = 8)#

Bases: nat.eval.evaluator.base_evaluator.BaseEvaluator

Average workflow runtime per item: max(event_timestamp) - min(event_timestamp) across the trajectory. The score is the runtime in seconds for the item.

async evaluate_item( item: nat.eval.evaluator.evaluator_model.EvalInputItem, ) → nat.eval.evaluator.evaluator_model.EvalOutputItem#: Each evaluator must implement this for item-level evaluation

class AverageNumberOfLLMCallsEvaluator(max_concurrency: int = 8)#

Bases: nat.eval.evaluator.base_evaluator.BaseEvaluator

Average number of LLM calls per item. The score is the count for the item.

async evaluate_item( item: nat.eval.evaluator.evaluator_model.EvalInputItem, ) → nat.eval.evaluator.evaluator_model.EvalOutputItem#: Each evaluator must implement this for item-level evaluation

class AverageTokensPerLLMEndEvaluator(max_concurrency: int = 8)#

Bases: nat.eval.evaluator.base_evaluator.BaseEvaluator

Average total tokens per LLM_END event: sum of prompt and completion tokens if available. The score is the average tokens per LLM_END for the item (0 if none).

async evaluate_item( item: nat.eval.evaluator.evaluator_model.EvalInputItem, ) → nat.eval.evaluator.evaluator_model.EvalOutputItem#: Each evaluator must implement this for item-level evaluation