AIPerf Metrics Reference

This document provides a comprehensive reference of all metrics available in AIPerf for benchmarking LLM inference performance. Metrics are organized by computation type to help you understand when and how each metric is calculated.

Quick Reference

The sections below provide detailed descriptions, requirements, and notes for each metric.

Understanding Metric Types

AIPerf computes metrics in three distinct phases during benchmark execution: Record Metrics, Aggregate Metrics, and Derived Metrics.

The metric type also determines which stat fields appear in profile_export_aiperf.json per metric — see JSON Export Schema for the per-field presence rules and version history.

Record Metrics

Record Metrics are computed individually for each request and its response(s) during the benchmark run. A single request may have one response (non-streaming) or multiple responses (streaming). These metrics capture per-request characteristics such as latency, token counts, and streaming behavior. Record metrics produce statistical distributions (min, max, mean, median, p90, p99, etc.) that reveal performance variability across requests.

Example Metrics

request_latency, time_to_first_token, inter_token_latency, output_token_count, input_sequence_length

Dependencies

Record Metrics can depend on raw request/response data and other Record Metrics from the same request.

Example Scenario

request_latency measures the time for each individual request from start to final response. If you send 100 requests, you get 100 latency values that form a distribution showing how latency varies across requests.

Aggregate Metrics

Aggregate Metrics are computed by tracking or accumulating values across all requests in real-time during the benchmark. These include counters, min/max timestamps, and other global statistics. Aggregate metrics produce a single value representing the entire benchmark run.

Example Metrics

request_count, error_request_count, min_request_timestamp, max_response_timestamp

Dependencies

Aggregate Metrics can depend on raw request/response data, Record Metrics and other Aggregate Metrics.

Example Scenario

request_count increments by 1 for each successful request. At the end of a benchmark with 100 successful requests, this metric equals 100 (a single value, not a distribution).

Derived Metrics

Derived Metrics are computed by applying mathematical formulas to other metric results, but are not computed per-record like Record Metrics. Instead, these metrics depend on one or more prerequisite metrics being available first and are calculated either after the benchmark completes for final results or in real-time across all current data for live metrics display. Derived metrics can produce either single values or distributions depending on their dependencies.

Example Metrics

request_throughput, output_token_throughput, benchmark_duration

Dependencies

Derived Metrics can depend on Record Metrics, Aggregate Metrics, and other Derived Metrics, but do not have any knowledge of the individual request/response data.

Example Scenario

request_throughput is computed from request_count / benchmark_duration_seconds. This requires both request_count and benchmark_duration to be available first, then applies a formula to produce a single throughput value (e.g., 10.5 requests/sec).

Detailed Metric Descriptions

Streaming Metrics

All metrics in this section require the --streaming flag with a token-producing endpoint and at least one non-empty response chunk.

1	# nanoseconds
2	ttft_ns = request.content_responses[0].perf_ns - request.start_perf_ns
3
4	# Convert to milliseconds for display
5	ttft_ms = ttft_ns / 1e6
6
7	# Convert to seconds for throughput calculations
8	ttft_seconds = ttft_ns / 1e9

1	# nanoseconds
2	ttst_ns = request.content_responses[1].perf_ns - request.content_responses[0].perf_ns
3
4	# Convert to milliseconds for display
5	ttst_ms = ttst_ns / 1e6

1	# nanoseconds
2	# First non-reasoning token: TextResponseData with non-empty text, or
3	# ReasoningResponseData with non-empty content field
4	ttfo_ns = first_non_reasoning_token_perf_ns - request.start_perf_ns
5
6	# Convert to milliseconds for display
7	ttfo_ms = ttfo_ns / 1e6

1	# Calculate in nanoseconds, then convert to seconds
2	inter_token_latency_ns = (request_latency_ns - time_to_first_token_ns) / (output_sequence_length - 1)
3
4	# Convert to seconds for throughput calculations
5	inter_token_latency_seconds = inter_token_latency_ns / 1e9
6
7	# Convert to milliseconds for display
8	inter_token_latency_ms = inter_token_latency_ns / 1e6

1	# OpenAI shape: nested under prompt_tokens_details
2	usage_prompt_cache_read_tokens = response.usage.prompt_tokens_details.cached_tokens # from last non-None response
3	# Anthropic shape: top-level
4	usage_prompt_cache_read_tokens = response.usage.cache_read_input_tokens # from last non-None response

1	# Gemini wraps usage in usageMetadata; the property reads through the envelope.
2	usage_tool_use_prompt_tokens = response.usage.toolUsePromptTokenCount # from last non-None response

1	overall_usage_prompt_cache_read_pct = (
2	total_usage_prompt_cache_read_tokens / total_usage_prompt_tokens
3	) * 100

1	# Effective threshold is capped to be tighter for large OSL values
2	threshold_tokens = min(requested_osl * (pct_threshold / 100), max_token_threshold)
3	diff_tokens = abs(actual_osl - requested_osl)
4	osl_mismatch_count = sum(1 for r in records if diff_tokens > threshold_tokens)

Server	Parameter	Notes
vLLM	`min_tokens`	Default: 0
TensorRT-LLM	`min_tokens`	Default: 1
SGLang	`min_new_tokens`	Default: 0
TGI	`min_new_tokens`	Unclear API support; TGI in maintenance mode

1	attempted = request_count + error_request_count
2	good_request_fraction = good_request_count / attempted if attempted > 0 else 0.0

1	# Per GPU: average of gpu_power_usage gauge samples in the profiling window
2	# (warmup excluded). Summed across all GPUs that reported valid samples.
3	total_gpu_power_w = sum(
4	avg(gpu_power_usage[start_ns:end_ns])
5	for gpu in reporting_gpus
6	)

1	# Per GPU: delta of the energy_consumption monotonic counter over the
2	# profiling window, widened on the end by FINAL_SCRAPE_GRACE_NS so the
3	# trailing scrape that lands just after requests_end_ns is captured.
4	grace_ns = Environment.GPU.FINAL_SCRAPE_GRACE_NS # default 666_000_000 (~666 ms)
5	total_gpu_energy_j = sum(
6	delta(energy_consumption[start_ns : end_ns + grace_ns])
7	for gpu in reporting_gpus
8	)
9	# Negative deltas are clamped to 0 to handle counter resets (DCGM restart).

1	# concurrency from the resolved profiling phase config
2	# (run.cfg.get_profiling_phases()[0].concurrency).
3	energy_per_user_j = total_gpu_energy / concurrency

Flag	Description	Impact
`NONE`	No flags set	Metric has default behavior with no special restrictions
`STREAMING_ONLY`	Only computed for streaming responses	Requires Server-Sent Events (SSE) with multiple response chunks; skipped for non-streaming requests
`ERROR_ONLY`	Only computed for error requests	Tracks error-specific information; computed only for invalid/failed requests
`PRODUCES_TOKENS_ONLY`	Only computed for token-producing endpoints	Requires endpoints that return text/token content; skipped for embeddings and non-generative endpoints
`LARGER_IS_BETTER`	Higher values indicate better performance	Used for throughput and count metrics to indicate optimization direction
`INTERNAL`	Internal AIPerf metric	Used for AIPerf system diagnostics; not displayed in console or exported without developer mode
`SUPPORTS_AUDIO_ONLY`	Only computed for audio endpoints	Requires audio-capable endpoints; skipped for other endpoint types
`SUPPORTS_IMAGE_ONLY`	Only computed for image endpoints	Requires image-capable endpoints; skipped for other endpoint types
`SUPPORTS_REASONING`	Requires reasoning token support	Only available for models and endpoints that expose reasoning content in separate fields
`EXPERIMENTAL`	Experimental/unstable metric	May change or be removed in future releases; not displayed in console or exported without developer mode
`GOODPUT`	Only computed when goodput is enabled	Requires SLO thresholds to be configured (e.g., `--goodput`); skipped otherwise
`NO_INDIVIDUAL_RECORDS`	Not exported for individual records	Aggregate metrics not relevant to individual records (e.g., request count, min/max timestamps); excluded from per-record exports
`TOKENIZES_INPUT_ONLY`	Only computed when endpoint tokenizes input	Requires endpoints that process and tokenize input text; skipped for non-text endpoints
`HTTP_TRACE_ONLY`	Only computed when HTTP trace data is available	Requires HTTP request tracing to be enabled; provides detailed HTTP lifecycle timing metrics
`SUPPORTS_VIDEO_ONLY`	Only computed for video endpoints	Requires video-capable endpoints; skipped for other endpoint types
`USAGE_DIFF_ONLY`	Only computed when usage field data is available	Requires API responses to include usage field with token counts for comparison with client-computed values
`PRODUCES_VIDEO_ONLY`	Only computed for video-producing endpoints	Requires endpoints that produce video output (e.g., SGLang video generation)

Group	Description
`MetricConsoleGroup.NONE`	Hidden from console; still exported to JSON/CSV/JSONL. Replaces the legacy `NO_CONSOLE` flag.
`MetricConsoleGroup.DEFAULT`	Standard `LLM Metrics` table. Default for new metrics.
`MetricConsoleGroup.USAGE`	API-reported usage token metrics (prompt/completion/total). Rendered as `LLM Metrics: Usage`.
`MetricConsoleGroup.CACHE`	Cache-related token metrics (e.g. prompt cache hits).
`MetricConsoleGroup.PREDICTION`	Speculative prediction token metrics (accepted/rejected).
`MetricConsoleGroup.AUDIO`	Audio token metrics (prompt/completion).
`MetricConsoleGroup.REASONING`	Reasoning token metrics.

1	class MyUsageMetric(BaseRecordMetric[int]):
2	tag = "my_usage_metric"
3	console_group = MetricConsoleGroup.USAGE

Metric Name	OTel Instrument	Unit	Description	`CreditPhaseStats` Field	Requirement
`aiperf.timing.requests.sent`	Counter	`1`	Total requests dispatched in this phase	`requests_sent`	13.2
`aiperf.timing.requests.completed`	Counter	`1`	Requests that received a complete response	`requests_completed`	13.2
`aiperf.timing.requests.cancelled`	Counter	`1`	Requests cancelled before completion	`requests_cancelled`	13.2
`aiperf.timing.requests.errors`	Counter	`1`	Requests that ended in error	`request_errors`	13.2
`aiperf.timing.sessions.sent`	Counter	`1`	Sessions initiated in this phase	`sent_sessions`	13.2
`aiperf.timing.sessions.completed`	Counter	`1`	Sessions that finished all turns	`completed_sessions`	13.2
`aiperf.timing.sessions.cancelled`	Counter	`1`	Sessions cancelled before completion	`cancelled_sessions`	13.2
`aiperf.timing.sessions.turns_total`	Counter	`1`	Cumulative session turns executed	`total_session_turns`	13.2

Metric Name	OTel Instrument	Unit	Description	`CreditPhaseStats` Field	Requirement
`aiperf.timing.requests.in_flight`	UpDownCounter	`1`	Requests currently awaiting a response	`in_flight_requests`	13.2
`aiperf.timing.sessions.in_flight`	UpDownCounter	`1`	Sessions with at least one turn in progress	`in_flight_sessions`	13.2
`aiperf.timing.phase.timeout_triggered`	UpDownCounter	`1`	Whether the phase hard-timeout fired (0 or 1)	`timeout_triggered`	13.2
`aiperf.timing.phase.grace_timeout_triggered`	UpDownCounter	`1`	Whether the grace-period timeout fired (0 or 1)	`grace_period_timeout_triggered`	13.2
`aiperf.timing.phase.was_cancelled`	UpDownCounter	`1`	Whether the phase was user-cancelled (0 or 1)	`was_cancelled`	13.2
`aiperf.timing.phase.elapsed_sec`	UpDownCounter	`s`	Wall-clock seconds elapsed in the phase	`requests_elapsed_time`	13.2

AIPerf Source	GenAI Spec Metric	Unit	Instrument
`request_latency`	`gen_ai.client.operation.duration`	s	Histogram
`time_to_first_token`	`gen_ai.client.operation.time_to_first_chunk`	s	Histogram
`inter_token_latency`	`gen_ai.client.operation.time_per_output_chunk`	s	Histogram
`input_token_count` + `output_token_count` (merged)	`gen_ai.client.token.usage` with `gen_ai.token.type=input\|output`	{token}	Histogram

AIPerf `endpoint.type`	`gen_ai.operation.name`
`chat`	`chat`
`completions`	`text_completion`
`embeddings`	`embeddings`
anything else	`chat` (fallback)

Table of Contents

Quick Reference

Understanding Metric Types

Record Metrics

Example Metrics

Dependencies

Example Scenario

Aggregate Metrics

Example Metrics

Dependencies

Example Scenario

Derived Metrics

Example Metrics

Dependencies

Example Scenario

Detailed Metric Descriptions

Streaming Metrics

Time to First Token (TTFT)

Time to Second Token (TTST)

Time to First Output Token (TTFO)

Inter Token Latency (ITL)

Inter Chunk Latency (ICL)

Output Token Throughput Per User

Prefill Throughput Per User

Token Based Metrics

Output Token Count

Output Sequence Length (OSL)

Input Sequence Length (ISL)

Total Output Tokens

Total Output Sequence Length

Total Input Sequence Length

E2E Output Token Throughput

Output Token Throughput

Total Token Throughput

Image Metrics

Number of Images

Image Throughput

Image Latency

Video Metrics

Video Inference Time

Video Peak Memory

Audio Metrics

Audio Duration

Inverse Real-Time Factor (RTFx)

Reasoning Metrics

Reasoning Token Count

Total Reasoning Tokens

Usage Field Metrics

Usage Prompt Tokens

Usage Completion Tokens

Usage Total Tokens

Usage Reasoning Tokens

Usage Prompt Cache Read Tokens

Usage Prompt Cache Write Tokens

Usage Prompt Audio Tokens

Usage Completion Audio Tokens

Usage Accepted Prediction Tokens

Usage Rejected Prediction Tokens

Usage Prompt Cache Miss Tokens

Usage Tool Use Prompt Tokens

Usage Prompt Audio Seconds

Total Usage Prompt Tokens

Total Usage Completion Tokens

Total Usage Total Tokens

Total Usage Reasoning Tokens

Total Usage Prompt Cache Read Tokens

Overall Usage Prompt Cache Read %

Total Usage Prompt Cache Write Tokens

Total Usage Prompt Audio Tokens

Total Usage Completion Audio Tokens

Total Usage Accepted Prediction Tokens

Total Usage Rejected Prediction Tokens

Total Usage Prompt Cache Miss Tokens

Total Usage Tool Use Prompt Tokens

Total Usage Prompt Audio Seconds

Usage Discrepancy Metrics

Usage Prompt Diff %

Usage Completion Diff %

Usage Reasoning Diff %

Usage Discrepancy Count

Timing Namespace (`aiperf.timing.*`)

URL Host Pattern	Provider Value
`api.openai.com`	`openai`
`api.anthropic.com`	`anthropic`
`api.deepseek.com`	`deepseek`
`api.mistral.ai`	`mistral_ai`
`api.cohere.ai` / `api.cohere.com`	`cohere`
`api.x.ai`	`x_ai`
`api.groq.com`	`groq`
`api.perplexity.ai`	`perplexity`
`generativelanguage.googleapis.com`	`gcp.gemini`
`*-aiplatform.googleapis.com`	`gcp.vertex_ai`
`bedrock-runtime.*.amazonaws.com`	`aws.bedrock`
`*.openai.azure.com`	`azure.ai.openai`
`*.services.ai.azure.com`	`azure.ai.inference`
`*.ibm.com` (with Watsonx paths)	`ibm.watsonx.ai`
anything else	`_OTHER`

AIPerf Condition	`error.type` Value
asyncio/HTTP timeout	`timeout`
HTTP 5xx response	`http_5xx`
HTTP 4xx response	`http_4xx`
JSON parse error	`parse_error`
User-initiated cancel	`cancelled`
anything else	`_OTHER`