This document provides a comprehensive reference of all metrics available in AIPerf for benchmarking LLM inference performance. Metrics are organized by computation type to help you understand when and how each metric is calculated.
The sections below provide detailed descriptions, requirements, and notes for each metric.
AIPerf computes metrics in three distinct phases during benchmark execution: Record Metrics, Aggregate Metrics, and Derived Metrics.
The metric type also determines which stat fields appear in
profile_export_aiperf.jsonper metric — see JSON Export Schema for the per-field presence rules and version history.
Record Metrics are computed individually for each request and its response(s) during the benchmark run. A single request may have one response (non-streaming) or multiple responses (streaming). These metrics capture per-request characteristics such as latency, token counts, and streaming behavior. Record metrics produce statistical distributions (min, max, mean, median, p90, p99, etc.) that reveal performance variability across requests.
request_latency, time_to_first_token, inter_token_latency, output_token_count, input_sequence_length
Record Metrics can depend on raw request/response data and other Record Metrics from the same request.
request_latency measures the time for each individual request from start to final response. If you send 100 requests, you get 100 latency values that form a distribution showing how latency varies across requests.
Aggregate Metrics are computed by tracking or accumulating values across all requests in real-time during the benchmark. These include counters, min/max timestamps, and other global statistics. Aggregate metrics produce a single value representing the entire benchmark run.
request_count, error_request_count, min_request_timestamp, max_response_timestamp
Aggregate Metrics can depend on raw request/response data, Record Metrics and other Aggregate Metrics.
request_count increments by 1 for each successful request. At the end of a benchmark with 100 successful requests, this metric equals 100 (a single value, not a distribution).
Derived Metrics are computed by applying mathematical formulas to other metric results, but are not computed per-record like Record Metrics. Instead, these metrics depend on one or more prerequisite metrics being available first and are calculated either after the benchmark completes for final results or in real-time across all current data for live metrics display. Derived metrics can produce either single values or distributions depending on their dependencies.
request_throughput, output_token_throughput, benchmark_duration
Derived Metrics can depend on Record Metrics, Aggregate Metrics, and other Derived Metrics, but do not have any knowledge of the individual request/response data.
request_throughput is computed from request_count / benchmark_duration_seconds. This requires both request_count and benchmark_duration to be available first, then applies a formula to produce a single throughput value (e.g., 10.5 requests/sec).
All metrics in this section require the --streaming flag with a token-producing endpoint and at least one non-empty response chunk.
Type: Record Metric
Measures how long it takes to receive the first token (or chunk of tokens) after sending a request. This is critical for user-perceived responsiveness in streaming scenarios, as it represents how quickly the model begins generating output.
Formula:
Notes:
Type: Record Metric
Measures the time gap between the first and second chunk of tokens. This metric helps identify generation startup overhead separate from steady-state streaming throughput.
Formula:
Notes:
Type: Record Metric
Calculates the time elapsed from request start to the first non-reasoning output token. This metric measures the latency from when a request is initiated to when the first actual output token (non-reasoning content) is received. It is particularly relevant for models that perform extended reasoning before generating output.
Formula:
Notes:
Type: Record Metric
Measures the average time between consecutive tokens during generation, excluding the initial TTFT overhead. This represents the steady-state token generation rate.
Formula:
Notes:
time_to_first_token, request_latency, and output_sequence_length metrics.Type: Record Metric
Captures the time gaps between all consecutive response chunks in a streaming response, providing a distribution of chunk arrival times rather than a single average. Note that this is different from the ITL metric, which measures the time between consecutive tokens regardless of chunk size.
Formula:
Notes:
Type: Record Metric
This metric is computed per-request, and it excludes the TTFT from the equation, so it is not directly comparable to the Output Token Throughput metric.
The token generation rate experienced by an individual user/request, measured as the inverse of inter-token latency. This represents single-request streaming performance.
Formula:
Notes:
Type: Record Metric
Measures the rate at which input tokens are processed during the prefill phase, calculated as input tokens per second based on TTFT. This is only applicable to streaming responses.
Formula:
Notes:
All metrics in this section require token-producing endpoints that return text content (chat, completion, etc.). These metrics are not available for embeddings or other non-generative endpoints.
Type: Record Metric
The number of output tokens generated for a single request, excluding reasoning tokens. This represents the output tokens returned to the user across all responses for the request.
Formula:
Notes:
add_special_tokens=False to count only content tokens, excluding special tokens added by the tokenizer.reasoning_content field, this metric counts only non-reasoning output tokens.content (e.g., <think> blocks), those tokens will be counted unless explicitly filtered.Type: Record Metric
The total number of completion tokens (output + reasoning) generated for a single request across all its responses. This represents the complete token generation workload for the request.
Formula:
Notes:
Type: Record Metric
The number of input/prompt tokens for a single request. This represents the size of the input sent to the model.
Formula:
Notes:
add_special_tokens=False to count only content tokens, excluding special tokens added by the tokenizer.Type: Derived Metric
The sum of all output tokens (excluding reasoning tokens) generated across all requests. This represents the total output token workload.
Formula:
Notes:
Type: Derived Metric
The sum of all completion tokens (output + reasoning) generated across all requests. This represents the complete token generation workload.
Formula:
Notes:
Type: Derived Metric
The sum of all input/prompt tokens processed across all requests. This represents the total input workload sent to the model.
Formula:
Notes:
Type: Record Metric
Per-request output token throughput based on end-to-end request latency. Unlike Output Token Throughput Per User (which uses 1/ITL and excludes TTFT), this metric includes TTFT, queuing, and all other overhead in the denominator. Available for both streaming and non-streaming responses.
Formula:
Notes:
PRODUCES_TOKENS_ONLY | LARGER_IS_BETTERType: Derived Metric
This metric is computed as a single value across all requests and includes TTFT in the equation, so it is not directly comparable to the Output Token Throughput Per User metric.
The aggregate token generation rate across all concurrent requests, measured as total tokens per second. This represents the system’s overall token generation capacity.
Formula:
Notes:
Type: Derived Metric
Calculates the total token throughput metric, combining both input and output token processing across all concurrent requests.
Formula:
Notes:
All metrics in this section require image-capable endpoints (e.g., image generation APIs). These metrics are not available for text-only or other non-image endpoints.
Type: Record Metric
The number of images in the request, summed across all turns. This is the foundation metric used by Image Throughput and Image Latency.
Formula:
Notes:
console_group = MetricConsoleGroup.NONE).Type: Record Metric
Calculates the image throughput from the record by dividing the number of images by the request latency.
Formula:
Notes:
Type: Record Metric
Calculates the image latency from the record by dividing the request latency by the number of images.
Formula:
Notes:
All metrics in this section require video-producing endpoints (e.g., SGLang video generation). These metrics rely on server-reported fields in the response and are not available for non-video endpoints.
Type: Record Metric
Server-reported GPU generation time for video inference, extracted from the inference_time_s field in video generation responses (e.g., SGLang).
Formula:
Notes:
Type: Record Metric
Server-reported peak GPU memory usage during video generation, extracted from the peak_memory_mb field in video generation responses.
Formula:
Notes:
Metrics in this section require an audio input on the request (e.g., ASR datasets such as LibriSpeech, GigaSpeech, AMI, VoxPopuli). They are not computed for text-only or non-audio requests.
Type: Record Metric
Per-request input audio duration in seconds. Hidden from the console summary; available in JSON / CSV record exports for characterizing dataset shape and verifying RTFx calculations.
Notes:
audio_duration_seconds (e.g., ASR datasets such as LibriSpeech).Type: Record Metric
The ratio of input audio duration to request latency. The standard ASR throughput metric, used by the HuggingFace Open ASR Leaderboard, NVIDIA Riva, and NVIDIA NeMo.
Formula:
Notes:
audio_duration and request_latency metrics to be computed first.All metrics in this section require models and backends that expose reasoning content in a separate reasoning_content field, distinct from the regular content field.
Type: Record Metric
The number of reasoning tokens generated for a single request. These are tokens used for “thinking” or chain-of-thought reasoning before generating the final output.
Formula:
Notes:
add_special_tokens=False to count only content tokens, excluding special tokens added by the tokenizer.<think> tags or extract reasoning from within the regular content field.Type: Derived Metric
The sum of all reasoning tokens generated across all requests. This represents the total reasoning/thinking workload.
Formula:
Notes:
All metrics in this section track API-reported token counts from the usage field in API responses. These are not displayed in console output but are available in exports. These metrics are useful for comparing client-side token counts with server-reported counts to detect discrepancies.
Type: Record Metric
The number of input/prompt tokens as reported by the API’s usage.prompt_tokens field for a single request.
Formula:
Notes:
usage object, not computed by AIPerf.Type: Record Metric
The number of completion tokens as reported by the API’s usage.completion_tokens field for a single request.
Formula:
Notes:
usage object, not computed by AIPerf.Type: Record Metric
The total number of tokens (prompt + completion) as reported by the API’s usage.total_tokens field for a single request.
Formula:
Notes:
usage object, not computed by AIPerf.usage_prompt_tokens + usage_completion_tokens.Type: Record Metric
The number of reasoning tokens as reported by the API’s usage.completion_tokens_details.reasoning_tokens field for a single request. Only available for reasoning-enabled models.
Formula:
Notes:
Type: Record Metric
The number of prompt tokens that were served from cache (cache hits) as reported by the API’s usage field for a single request.
Formula:
Notes:
usage object, not computed by AIPerf.prompt_tokens_details.cached_tokens (or input_tokens_details.cached_tokens); writes are transparent and not reported.cache_read_input_tokens; writes are reported separately as Usage Prompt Cache Write Tokens.Type: Record Metric
The number of prompt tokens written to cache (cache creations) as reported by the API’s usage.cache_creation_input_tokens field for a single request. Anthropic-specific.
Formula:
Notes:
usage object, not computed by AIPerf.Type: Record Metric
The number of audio tokens from the prompt as reported by the API’s usage.prompt_tokens_details.audio_tokens field for a single request.
Formula:
Notes:
usage object, not computed by AIPerf.Type: Record Metric
The number of audio tokens in the completion as reported by the API’s usage.completion_tokens_details.audio_tokens field for a single request.
Formula:
Notes:
usage object, not computed by AIPerf.Type: Record Metric
The number of accepted prediction tokens as reported by the API’s usage.completion_tokens_details.accepted_prediction_tokens field for a single request. These are tokens from a predicted completion that the model actually used.
Formula:
Notes:
usage object, not computed by AIPerf.Type: Record Metric
The number of rejected prediction tokens as reported by the API’s usage.completion_tokens_details.rejected_prediction_tokens field for a single request. These are tokens from a predicted completion that the model did not use.
Formula:
Notes:
usage object, not computed by AIPerf.Type: Record Metric
The number of prompt tokens that missed cache (and required fresh processing) as reported by the API’s usage.prompt_cache_miss_tokens field for a single request. DeepSeek-specific.
Formula:
Notes:
prompt_tokens - prompt_cache_read_tokens, but it’s not its own first-class field).Type: Record Metric
The number of prompt tokens consumed by tool / function-call declarations sent in the request, separate from user-content prompt tokens. Gemini-specific.
Formula:
Notes:
prompt_tokens count, so this metric will raise NoMetricValue for OpenAI / Anthropic / etc.Type: Record Metric
The audio duration of the input prompt in seconds (not tokens) as reported by the API’s usage.prompt_audio_seconds field for a single request. Mistral-specific.
Formula:
Notes:
float (so 12.5s is preserved exactly even when the API reports an integer).Type: Derived Metric
The sum of all API-reported prompt tokens across all requests.
Formula:
Notes:
Type: Derived Metric
The sum of all API-reported completion tokens across all requests.
Formula:
Notes:
Type: Derived Metric
The sum of all API-reported total tokens across all requests.
Formula:
Notes:
Type: Derived Metric
The sum of all API-reported reasoning tokens across all requests.
Formula:
Notes:
Type: Derived Metric
The sum of all API-reported prompt cache-read tokens across all requests.
Formula:
Notes:
prompt_tokens_details.cached_tokens or Anthropic top-level cache_read_input_tokens).Type: Derived Metric
Run-aggregate share of input tokens served from prompt cache, weighted by token volume. Computed from the run totals so a request with 10k prompt tokens contributes 100x as much weight as a request with 100 prompt tokens — the resulting number reflects the actual fraction of input tokens the API served from cache across the whole benchmark.
Formula:
Notes:
total_usage_prompt_tokens is zero (e.g. all requests errored before reporting usage).Type: Derived Metric
The sum of all API-reported prompt cache-write (cache creation) tokens across all requests. Anthropic-specific.
Formula:
Notes:
cache_creation_input_tokens). Empty for OpenAI workloads.Type: Derived Metric
The sum of all API-reported prompt audio tokens across all requests.
Formula:
Notes:
Type: Derived Metric
The sum of all API-reported completion audio tokens across all requests.
Formula:
Notes:
Type: Derived Metric
The sum of all API-reported accepted prediction tokens across all requests.
Formula:
Notes:
Type: Derived Metric
The sum of all API-reported rejected prediction tokens across all requests.
Formula:
Notes:
Type: Derived Metric
The sum of all API-reported prompt cache-miss tokens across all requests. DeepSeek-specific.
Formula:
Notes:
prompt_cache_miss_tokens across all requests. Empty for vendors that don’t surface a separate miss field.Type: Derived Metric
The sum of all API-reported tool-use prompt tokens across all requests. Gemini-specific.
Formula:
Notes:
toolUsePromptTokenCount across all requests. Useful for understanding what fraction of total prompt tokens were spent on tool/function declarations in tool-heavy agentic workloads.Type: Derived Metric
The sum of all API-reported prompt audio durations across all requests, in seconds (not tokens). Mistral-specific.
Formula:
Notes:
prompt_audio_seconds. Unit is seconds; do not confuse with Total Usage Prompt Audio Tokens.These metrics measure the percentage difference between API-reported token counts (usage fields) and client-computed token counts. They are not displayed in console output but help identify tokenizer mismatches or counting discrepancies.
Type: Record Metric
The percentage difference between API-reported prompt tokens and client-computed Input Sequence Length.
Formula:
Notes:
Type: Record Metric
The percentage difference between API-reported completion tokens and client-computed Output Sequence Length.
Formula:
Notes:
Type: Record Metric
The percentage difference between API-reported reasoning tokens and client-computed Reasoning Token Count.
Formula:
Notes:
Type: Aggregate Metric
The number of requests where token count differences exceed a threshold (default 10%).
Formula:
Notes:
These metrics measure the difference between requested output sequence length (--osl/max_tokens) and actual output tokens generated. They help identify when the server is not honoring the requested output length, typically because EOS tokens stop generation early. These metrics are not displayed in console output but are available in exports and used by the end-of-benchmark warning.
Type: Record Metric
The signed percentage difference between actual output sequence length and requested OSL. Negative values mean the server stopped early (actual < requested), positive values mean it generated more than requested.
Formula:
Notes:
Type: Aggregate Metric
The count of requests where the absolute token difference exceeds the effective threshold. Used to trigger the end-of-benchmark warning panel.
Formula:
Notes:
AIPERF_METRICS_OSL_MISMATCH_PCT_THRESHOLD).AIPERF_METRICS_OSL_MISMATCH_MAX_TOKEN_THRESHOLD).min() makes threshold tighter for large OSL: requesting 2000 tokens caps at 50 token diff instead of 100 (5%).--osl, use --extra-inputs ignore_eos:true or --extra-inputs min_tokens:<value>.--use-server-token-count.Server support for min_tokens:
Goodput metrics measure the throughput of requests that meet user-defined Service Level Objectives (SLOs). See the Goodput tutorial for configuration details.
Type: Aggregate Metric
The number of requests that meet all user-defined SLO thresholds during the benchmark.
Formula:
Notes:
--goodput).Type: Derived Metric
Tag: good_request_fraction
The fraction of all attempted requests that satisfied every per-request SLO. Returns a ratio in [0.0, 1.0]. Errored requests count toward the denominator so a backend that drops traffic under load cannot look “good” simply because the surviving requests stayed under the latency budget.
Formula:
Flags: GOODPUT | LARGER_IS_BETTER | NO_CONSOLE
Unit: RATIO (0.0–1.0)
Required upstream metrics: good_request_count, request_count. error_request_count is included in the denominator when present (it is ERROR_ONLY and absent on clean runs).
Notes:
--goodput); without SLOs, good_request_count is always 0 and this metric is 0.0.0 when no requests were attempted (request_count + error_request_count == 0).NO_CONSOLE); appears in JSON, CSV, and Parquet exports.max-goodput-under-slo search recipe (good_request_fraction:avg:ge:<attainment>); without it, the recipe filter dereferences a missing tag and Bayesian optimization treats every iteration as infeasible.Type: Derived Metric
The rate of SLO-compliant requests per second. This represents the effective throughput of requests meeting quality requirements.
Formula:
Notes:
These metrics are computed only for failed/error requests and are not displayed in console output.
Type: Record Metric
The number of input tokens for requests that resulted in errors. This helps analyze whether input size correlates with errors.
Formula:
Notes:
Type: Derived Metric
The sum of all input tokens from requests that resulted in errors.
Formula:
Notes:
Type: Record Metric
Measures the total end-to-end time from sending a request until receiving the final response. For streaming requests with multiple responses, this measures until the last response is received. This is the complete time experienced by the client for a single request.
Formula:
Notes:
Type: Derived Metric
The overall rate of completed requests per second across the entire benchmark. This represents the system’s ability to process requests under the given concurrency and load.
Formula:
Notes:
Type: Aggregate Metric
The total number of successfully completed requests in the benchmark. This includes all requests that received valid responses, regardless of streaming mode.
Formula:
Type: Aggregate Metric
The total number of failed/error requests encountered during the benchmark. This includes network errors, HTTP errors, timeout errors, and other failures.
Formula:
Notes:
error_request_count / (request_count + error_request_count).Type: Aggregate Metric
The wall-clock timestamp of the first request sent in the benchmark. This is used to calculate the benchmark duration and represents the start of the benchmark run.
Formula:
Type: Aggregate Metric
The wall-clock timestamp of the last response received in the benchmark. This is used to calculate the benchmark duration and represents the end of the benchmark run.
Formula:
Type: Derived Metric
The total elapsed time from the first request sent to the last response received. This represents the complete wall-clock duration of the benchmark run.
Formula:
Notes:
All metrics in this section require HTTP trace data to be collected during requests. These metrics provide detailed HTTP request lifecycle timing following k6 naming conventions. See the HTTP Trace Metrics tutorial for configuration details.
Type: Record Metric
Time spent blocked waiting for a free TCP connection slot from the pool. This metric measures the time a request spent waiting in the connection pool queue before a connection became available. High values indicate connection pool saturation.
Formula:
Notes:
http_req_blockedblockedType: Record Metric
Time spent on DNS resolution. This metric measures the time spent resolving the hostname to an IP address.
Formula:
Notes:
http_req_looking_updnsType: Record Metric
Time spent establishing TCP connection to the remote host. For HTTPS requests, this includes both TCP connection establishment and TLS handshake time (combined measurement from aiohttp).
Formula:
Notes:
http_req_connectingconnectType: Record Metric
Time spent sending data to the remote host. This metric measures the time from when the request started being sent to when the full request (headers + body) was transmitted.
Formula:
Notes:
http_req_sendingsendType: Record Metric
Time to First Byte (TTFB) - time waiting for the server to respond. This metric measures the time from when the request was fully sent to when the first byte of the response body was received. This represents server processing time plus network latency.
Formula:
Notes:
http_req_waiting (also known as TTFB)waitType: Record Metric
Time spent receiving response data from the remote host. This metric measures the time from when the first byte of the response was received to when the last byte was received.
Formula:
Notes:
http_req_receivingreceiveType: Record Metric
Time for HTTP request/response exchange, excluding connection overhead. This measures only the request/response exchange time: sending + waiting + receiving.
Formula:
Notes:
http_req_durationtimehttp_req_total.Type: Record Metric
Total connection overhead time (blocked + dns_lookup + connecting). This metric combines all pre-request overhead.
Formula:
Notes:
Type: Record Metric
Sum of all HTTP timing phases from connection pool to last chunk received. This is the sum of all 6 timing components: blocked + dns_lookup + connecting + sending + waiting + receiving.
Formula:
Notes:
Type: Record Metric
Total bytes sent in the HTTP request (headers + body).
Formula:
Notes:
data_sent (per request)Type: Record Metric
Total bytes received in the HTTP response (headers + body).
Formula:
Notes:
data_received (per request)Type: Record Metric
Whether the HTTP connection was reused from the connection pool. Returns 1 if reused, 0 if new connection was established.
Formula:
Notes:
Type: Record Metric
Number of transport-level write operations during the request. Useful for debugging chunked transfers.
Formula:
Notes:
console_group = MetricConsoleGroup.NONE).Type: Record Metric
Number of transport-level read operations during the response. Useful for debugging chunked/streaming responses.
Formula:
Notes:
console_group = MetricConsoleGroup.NONE).--num-profile-runs > 1 for confidence reporting.When running multiple profile iterations with --num-profile-runs, AIPerf computes aggregate statistics across all runs to quantify measurement variance and repeatability. These statistics are written to aggregate/profile_export_aiperf_aggregate.json and aggregate/profile_export_aiperf_aggregate.csv.
For detailed information about aggregate statistics, their mathematical definitions, and interpretation guidelines, see the Multi-Run Confidence Tutorial.
The following aggregate statistics are computed for each metric:
The aggregate output also includes metadata about the multi-run benchmark:
Metric flags are used to control when and how metrics are computed, displayed, and grouped. Flags can be combined using bitwise operations to create composite behaviors.
These flags are combinations of multiple individual flags for convenience:
The console_group class attribute on a metric controls which console table the metric appears in (or hides it entirely). It is independent of MetricFlags — flags filter by axis (ERROR_ONLY, INTERNAL, EXPERIMENTAL); console_group selects a display bucket.
Set as a class attribute on a BaseMetric subclass:
aiperf.timing.*)The TimingResultsStrategy emits phase-level timing snapshots as OTel counters and up-down-counters under the aiperf.timing.* namespace. These metrics track credit-phase progression in real time and are sourced from CreditPhaseStats fields.
Notes:
gen_ai.operation.name, gen_ai.provider.name, gen_ai.request.model) so they can be joined with spec-named request metrics in dashboards.AIPerf translates its internal metric names onto the OTel GenAI semantic conventions so that downstream dashboards and alerting can consume spec-standard metric names directly.
Duration metrics are converted from nanoseconds to seconds. Token counts use the identity conversion.
gen_ai.operation.name MappingDerived from the AIPerf endpoint.type configuration value:
gen_ai.provider.name Host Auto-InferenceThe provider attribute is resolved using the following precedence:
--gen-ai-provider CLI override (highest priority)_OTHER fallbackerror.type ClassificationError conditions on individual requests are classified into spec-standard error.type attribute values:
The error.type attribute is only attached when an error is present; successful requests omit it entirely.
The aiperf.timing.* metrics retain AIPerf-specific names because the GenAI semantic convention specification has no equivalent phase-level timing metrics. However, these metrics receive the same Required attributes (gen_ai.operation.name, gen_ai.provider.name, gen_ai.request.model) as the spec-named request metrics so that downstream systems can join across both namespaces for correlation and alerting.
AIPerf is a client-side benchmarking tool and does not emit any server-side metrics:
gen_ai.server.* metrics are produced.AIPerf also does not emit any opt-in GenAI events:
gen_ai.input.messagesgen_ai.output.messagesgen_ai.system_instructionsgen_ai.tool.definitionsThese events are excluded because AIPerf’s purpose is performance measurement, not request/response content logging.