nat.eval.runtime_evaluator.evaluate#
Classes#
Mean difference between connected LLM_START and LLM_END events (same UUID). |
|
Average workflow runtime per item: max(event_timestamp) - min(event_timestamp) across the trajectory. |
|
Average number of LLM calls per item. The score is the count for the item. |
|
Average total tokens per LLM_END event: sum of prompt and completion tokens if available. |
Module Contents#
- class _CallTiming#
- class AverageLLMLatencyEvaluator(max_concurrency: int = 8)#
Bases:
nat.eval.evaluator.base_evaluator.BaseEvaluatorMean difference between connected LLM_START and LLM_END events (same UUID). The score is the average latency in seconds for the item. Reasoning contains per-call latencies.
- async evaluate_item( ) nat.eval.evaluator.evaluator_model.EvalOutputItem#
Each evaluator must implement this for item-level evaluation
- class AverageWorkflowRuntimeEvaluator(max_concurrency: int = 8)#
Bases:
nat.eval.evaluator.base_evaluator.BaseEvaluatorAverage workflow runtime per item: max(event_timestamp) - min(event_timestamp) across the trajectory. The score is the runtime in seconds for the item.
- async evaluate_item( ) nat.eval.evaluator.evaluator_model.EvalOutputItem#
Each evaluator must implement this for item-level evaluation
- class AverageNumberOfLLMCallsEvaluator(max_concurrency: int = 8)#
Bases:
nat.eval.evaluator.base_evaluator.BaseEvaluatorAverage number of LLM calls per item. The score is the count for the item.
- async evaluate_item( ) nat.eval.evaluator.evaluator_model.EvalOutputItem#
Each evaluator must implement this for item-level evaluation
- class AverageTokensPerLLMEndEvaluator(max_concurrency: int = 8)#
Bases:
nat.eval.evaluator.base_evaluator.BaseEvaluatorAverage total tokens per LLM_END event: sum of prompt and completion tokens if available. The score is the average tokens per LLM_END for the item (0 if none).
- async evaluate_item( ) nat.eval.evaluator.evaluator_model.EvalOutputItem#
Each evaluator must implement this for item-level evaluation