AIPerf Metrics Reference
This document provides a comprehensive reference of all metrics available in AIPerf for benchmarking LLM inference performance. Metrics are organized by computation type to help you understand when and how each metric is calculated.
Table of Contents
- Quick Reference
- Understanding Metric Types
- Detailed Metric Descriptions
- Metric Flags Reference
Quick Reference
The sections below provide detailed descriptions, requirements, and notes for each metric.
Understanding Metric Types
AIPerf computes metrics in three distinct phases during benchmark execution: Record Metrics, Aggregate Metrics, and Derived Metrics.
Record Metrics
Record Metrics are computed individually for each request and its response(s) during the benchmark run. A single request may have one response (non-streaming) or multiple responses (streaming). These metrics capture per-request characteristics such as latency, token counts, and streaming behavior. Record metrics produce statistical distributions (min, max, mean, median, p90, p99, etc.) that reveal performance variability across requests.
Example Metrics
request_latency, time_to_first_token, inter_token_latency, output_token_count, input_sequence_length
Dependencies
Record Metrics can depend on raw request/response data and other Record Metrics from the same request.
Example Scenario
request_latency measures the time for each individual request from start to final response. If you send 100 requests, you get 100 latency values that form a distribution showing how latency varies across requests.
Aggregate Metrics
Aggregate Metrics are computed by tracking or accumulating values across all requests in real-time during the benchmark. These include counters, min/max timestamps, and other global statistics. Aggregate metrics produce a single value representing the entire benchmark run.
Example Metrics
request_count, error_request_count, min_request_timestamp, max_response_timestamp
Dependencies
Aggregate Metrics can depend on raw request/response data, Record Metrics and other Aggregate Metrics.
Example Scenario
request_count increments by 1 for each successful request. At the end of a benchmark with 100 successful requests, this metric equals 100 (a single value, not a distribution).
Derived Metrics
Derived Metrics are computed by applying mathematical formulas to other metric results, but are not computed per-record like Record Metrics. Instead, these metrics depend on one or more prerequisite metrics being available first and are calculated either after the benchmark completes for final results or in real-time across all current data for live metrics display. Derived metrics can produce either single values or distributions depending on their dependencies.
Example Metrics
request_throughput, output_token_throughput, benchmark_duration
Dependencies
Derived Metrics can depend on Record Metrics, Aggregate Metrics, and other Derived Metrics, but do not have any knowledge of the individual request/response data.
Example Scenario
request_throughput is computed from request_count / benchmark_duration_seconds. This requires both request_count and benchmark_duration to be available first, then applies a formula to produce a single throughput value (e.g., 10.5 requests/sec).
Detailed Metric Descriptions
Streaming Metrics
All metrics in this section require the --streaming flag with a token-producing endpoint and at least one non-empty response chunk.
Time to First Token (TTFT)
Type: Record Metric
Measures how long it takes to receive the first token (or chunk of tokens) after sending a request. This is critical for user-perceived responsiveness in streaming scenarios, as it represents how quickly the model begins generating output.
Formula:
Notes:
- Includes network latency, queuing time, prompt processing, and generation of the first token (or chunk of tokens).
- Raw timestamps are in nanoseconds; converted to milliseconds for display and seconds for rate calculations.
- Response chunks refer to individual messages with non-empty content received during streaming.
Time to Second Token (TTST)
Type: Record Metric
Measures the time gap between the first and second chunk of tokens. This metric helps identify generation startup overhead separate from steady-state streaming throughput.
Formula:
Notes:
- Requires at least 2 non-empty response chunks to compute the time between first and second tokens.
- Raw timestamps are in nanoseconds; converted to milliseconds for display.
Time to First Output Token (TTFO)
Type: Record Metric
Calculates the time elapsed from request start to the first non-reasoning output token. This metric measures the latency from when a request is initiated to when the first actual output token (non-reasoning content) is received. It is particularly relevant for models that perform extended reasoning before generating output.
Formula:
Notes:
- TTFO vs TTFT: Time to First Output (TTFO) measures time to the first non-reasoning token, while Time to First Token (TTFT) measures time to any first token including reasoning tokens. For models without reasoning, TTFO and TTFT are equivalent.
- Non-reasoning tokens include TextResponseData with non-empty text, or ReasoningResponseData with non-empty content field (regardless of reasoning field).
- Requires at least one non-empty non-reasoning response chunk.
Inter Token Latency (ITL)
Type: Record Metric
Measures the average time between consecutive tokens during generation, excluding the initial TTFT overhead. This represents the steady-state token generation rate.
Formula:
Notes:
- Requires at least 2 non-empty response chunks and valid
time_to_first_token,request_latency, andoutput_sequence_lengthmetrics. - Result is in seconds when used for throughput calculations (Output Token Throughput Per User).
Inter Chunk Latency (ICL)
Type: Record Metric
Captures the time gaps between all consecutive response chunks in a streaming response, providing a distribution of chunk arrival times rather than a single average. Note that this is different from the ITL metric, which measures the time between consecutive tokens regardless of chunk size.
Formula:
Notes:
- Requires at least 2 response chunks.
- Unlike ITL (which produces a single average), ICL provides the full distribution of inter-chunk times.
- Useful for detecting variability, jitter, or issues in streaming delivery.
- Analyzing ICL distributions can reveal batching behavior, scheduling issues, or network variability.
Output Token Throughput Per User
Type: Record Metric
This metric is computed per-request, and it excludes the TTFT from the equation, so it is not directly comparable to the Output Token Throughput metric.
The token generation rate experienced by an individual user/request, measured as the inverse of inter-token latency. This represents single-request streaming performance.
Formula:
Notes:
- Computes the inverse of ITL to show tokens per second from an individual user’s perspective.
- Differs from Output Token Throughput (aggregate across all concurrent requests) by focusing on single-request experience.
- Useful for understanding the user experience independent of concurrency effects.
Prefill Throughput Per User
Type: Record Metric
Measures the rate at which input tokens are processed during the prefill phase, calculated as input tokens per second based on TTFT. This is only applicable to streaming responses.
Formula:
Notes:
- Higher values indicate faster prompt processing.
- Useful for understanding input processing capacity and bottlenecks.
- Depends on Input Sequence Length and TTFT metrics.
Token Based Metrics
All metrics in this section require token-producing endpoints that return text content (chat, completion, etc.). These metrics are not available for embeddings or other non-generative endpoints.
Output Token Count
Type: Record Metric
The number of output tokens generated for a single request, excluding reasoning tokens. This represents the output tokens returned to the user across all responses for the request.
Formula:
Notes:
- Tokenization uses
add_special_tokens=Falseto count only content tokens, excluding special tokens added by the tokenizer. - For streaming requests with multiple responses, the responses are joined together and then tokens are counted.
- For models that expose reasoning in a separate
reasoning_contentfield, this metric counts only non-reasoning output tokens. - If reasoning appears inside the regular
content(e.g.,<think>blocks), those tokens will be counted unless explicitly filtered.
Output Sequence Length (OSL)
Type: Record Metric
The total number of completion tokens (output + reasoning) generated for a single request across all its responses. This represents the complete token generation workload for the request.
Formula:
Notes:
- For models that do not support/separate reasoning tokens, OSL equals the output token count.
Input Sequence Length (ISL)
Type: Record Metric
The number of input/prompt tokens for a single request. This represents the size of the input sent to the model.
Formula:
Notes:
- Tokenization uses
add_special_tokens=Falseto count only content tokens, excluding special tokens added by the tokenizer. - Useful for understanding the relationship between input size and latency/throughput.
Total Output Tokens
Type: Derived Metric
The sum of all output tokens (excluding reasoning tokens) generated across all requests. This represents the total output token workload.
Formula:
Notes:
- Aggregates output tokens across all successful requests.
- Useful for capacity planning and cost estimation.
Total Output Sequence Length
Type: Derived Metric
The sum of all completion tokens (output + reasoning) generated across all requests. This represents the complete token generation workload.
Formula:
Notes:
- Aggregates the complete token generation workload including both output and reasoning tokens.
- For models without reasoning tokens, this equals Total Output Tokens.
Total Input Sequence Length
Type: Derived Metric
The sum of all input/prompt tokens processed across all requests. This represents the total input workload sent to the model.
Formula:
Notes:
- Useful for understanding the input workload, capacity planning, and analyzing the relationship between input size and system performance.
E2E Output Token Throughput
Type: Record Metric
Per-request output token throughput based on end-to-end request latency. Unlike Output Token Throughput Per User (which uses 1/ITL and excludes TTFT), this metric includes TTFT, queuing, and all other overhead in the denominator. Available for both streaming and non-streaming responses.
Formula:
Notes:
- Uses total request latency (not ITL), so values will be slightly lower than Output Token Throughput Per User for streaming responses.
- Available for non-streaming responses (unlike Output Token Throughput Per User which requires streaming).
- Flags:
PRODUCES_TOKENS_ONLY | LARGER_IS_BETTER - Depends on Output Sequence Length and Request Latency metrics.
Output Token Throughput
Type: Derived Metric
This metric is computed as a single value across all requests and includes TTFT in the equation, so it is not directly comparable to the Output Token Throughput Per User metric.
The aggregate token generation rate across all concurrent requests, measured as total tokens per second. This represents the system’s overall token generation capacity.
Formula:
Notes:
- Measures aggregate throughput across all concurrent requests; represents the overall system token generation rate.
- Higher values indicate better system utilization and capacity.
Total Token Throughput
Type: Derived Metric
Calculates the total token throughput metric, combining both input and output token processing across all concurrent requests.
Formula:
Notes:
- Measures the combined input and output token processing rate.
- Includes reasoning tokens in the output count (via total_osl).
- Useful for understanding total system token processing capacity.
Image Metrics
All metrics in this section require image-capable endpoints (e.g., image generation APIs). These metrics are not available for text-only or other non-image endpoints.
Number of Images
Type: Record Metric
The number of images in the request, summed across all turns. This is the foundation metric used by Image Throughput and Image Latency.
Formula:
Notes:
- Requires at least one image in at least one turn.
- Not displayed in console output (
NO_CONSOLEflag).
Image Throughput
Type: Record Metric
Calculates the image throughput from the record by dividing the number of images by the request latency.
Formula:
Notes:
- Higher values indicate faster image generation.
Image Latency
Type: Record Metric
Calculates the image latency from the record by dividing the request latency by the number of images.
Formula:
Notes:
- Lower values indicate faster per-image generation.
Video Metrics
All metrics in this section require video-producing endpoints (e.g., SGLang video generation). These metrics rely on server-reported fields in the response and are not available for non-video endpoints.
Video Inference Time
Type: Record Metric
Server-reported GPU generation time for video inference, extracted from the inference_time_s field in video generation responses (e.g., SGLang).
Formula:
Notes:
- Value comes from the server, not computed by AIPerf.
- Displayed in milliseconds.
Video Peak Memory
Type: Record Metric
Server-reported peak GPU memory usage during video generation, extracted from the peak_memory_mb field in video generation responses.
Formula:
Notes:
- Value comes from the server, not computed by AIPerf.
- Unit is megabytes.
Reasoning Metrics
All metrics in this section require models and backends that expose reasoning content in a separate reasoning_content field, distinct from the regular content field.
Reasoning Token Count
Type: Record Metric
The number of reasoning tokens generated for a single request. These are tokens used for “thinking” or chain-of-thought reasoning before generating the final output.
Formula:
Notes:
- Tokenization uses
add_special_tokens=Falseto count only content tokens, excluding special tokens added by the tokenizer. - Does not differentiate
<think>tags or extract reasoning from within the regularcontentfield.
Total Reasoning Tokens
Type: Derived Metric
The sum of all reasoning tokens generated across all requests. This represents the total reasoning/thinking workload.
Formula:
Notes:
- Useful for understanding the reasoning overhead and cost for reasoning-enabled models.
Usage Field Metrics
All metrics in this section track API-reported token counts from the usage field in API responses. These are not displayed in console output but are available in exports. These metrics are useful for comparing client-side token counts with server-reported counts to detect discrepancies.
Usage Prompt Tokens
Type: Record Metric
The number of input/prompt tokens as reported by the API’s usage.prompt_tokens field for a single request.
Formula:
Notes:
- Taken from the API response
usageobject, not computed by AIPerf. - May differ from client-side Input Sequence Length due to different tokenizers or special tokens.
- For streaming responses, uses the last non-None value reported.
Usage Completion Tokens
Type: Record Metric
The number of completion tokens as reported by the API’s usage.completion_tokens field for a single request.
Formula:
Notes:
- Taken from the API response
usageobject, not computed by AIPerf. - May differ from client-side Output Sequence Length due to different tokenizers or counting methods.
- For streaming responses, uses the last non-None value reported.
Usage Total Tokens
Type: Record Metric
The total number of tokens (prompt + completion) as reported by the API’s usage.total_tokens field for a single request.
Formula:
Notes:
- Taken from the API response
usageobject, not computed by AIPerf. - Should generally equal
usage_prompt_tokens + usage_completion_tokens. - For streaming responses, uses the last non-None value reported.
Usage Reasoning Tokens
Type: Record Metric
The number of reasoning tokens as reported by the API’s usage.completion_tokens_details.reasoning_tokens field for a single request. Only available for reasoning-enabled models.
Formula:
Notes:
- Taken from the API response for reasoning-enabled models.
- May differ from client-side Reasoning Token Count due to different tokenizers.
- For streaming responses, uses the last non-None value reported.
Total Usage Prompt Tokens
Type: Derived Metric
The sum of all API-reported prompt tokens across all requests.
Formula:
Notes:
- Aggregates server-reported input tokens across all requests.
Total Usage Completion Tokens
Type: Derived Metric
The sum of all API-reported completion tokens across all requests.
Formula:
Notes:
- Aggregates server-reported completion tokens across all requests.
Total Usage Total Tokens
Type: Derived Metric
The sum of all API-reported total tokens across all requests.
Formula:
Notes:
- Aggregates server-reported total tokens across all requests.
Usage Discrepancy Metrics
These metrics measure the percentage difference between API-reported token counts (usage fields) and client-computed token counts. They are not displayed in console output but help identify tokenizer mismatches or counting discrepancies.
Usage Prompt Tokens Diff %
Type: Record Metric
The percentage difference between API-reported prompt tokens and client-computed Input Sequence Length.
Formula:
Notes:
- Values close to 0% indicate good agreement between client and server token counts.
- Large differences may indicate tokenizer mismatches or special token handling differences.
Usage Completion Tokens Diff %
Type: Record Metric
The percentage difference between API-reported completion tokens and client-computed Output Sequence Length.
Formula:
Notes:
- Values close to 0% indicate good agreement between client and server token counts.
- Large differences may indicate tokenizer mismatches or different counting methods.
Usage Reasoning Tokens Diff %
Type: Record Metric
The percentage difference between API-reported reasoning tokens and client-computed Reasoning Token Count.
Formula:
Notes:
- Only available for reasoning-enabled models.
- Values close to 0% indicate good agreement between client and server reasoning token counts.
Usage Discrepancy Count
Type: Aggregate Metric
The number of requests where token count differences exceed a threshold (default 10%).
Formula:
Notes:
- Default threshold is 10% difference.
- Counts requests where prompt, completion, or reasoning token differences are significant.
- Useful for monitoring overall token count agreement quality.
OSL Mismatch Metrics
These metrics measure the difference between requested output sequence length (--osl/max_tokens) and actual output tokens generated. They help identify when the server is not honoring the requested output length, typically because EOS tokens stop generation early. These metrics are not displayed in console output but are available in exports and used by the end-of-benchmark warning.
OSL Mismatch Diff %
Type: Record Metric
The signed percentage difference between actual output sequence length and requested OSL. Negative values mean the server stopped early (actual < requested), positive values mean it generated more than requested.
Formula:
Notes:
- Negative = stopped early (hit EOS before max_tokens)
- Positive = generated more than requested
- 0% = exact match between requested and actual
- Example: Requested 100 tokens, got 50 → Diff = -50%
- Example: Requested 100 tokens, got 120 → Diff = 20%
OSL Mismatch Count
Type: Aggregate Metric
The count of requests where the absolute token difference exceeds the effective threshold. Used to trigger the end-of-benchmark warning panel.
Formula:
Notes:
- Default percentage threshold is 5% (
AIPERF_METRICS_OSL_MISMATCH_PCT_THRESHOLD). - Default max token threshold is 50 (
AIPERF_METRICS_OSL_MISMATCH_MAX_TOKEN_THRESHOLD). - The
min()makes threshold tighter for large OSL: requesting 2000 tokens caps at 50 token diff instead of 100 (5%). - Counts both early stops (negative diff) and over-generation (positive diff).
- When this count is non-zero, a warning panel is displayed at the end of the benchmark.
- To ensure servers honor
--osl, use--extra-inputs ignore_eos:trueor--extra-inputs min_tokens:<value>. - If discrepancy is due to tokenizer mismatch between client and server, use
--use-server-token-count.
Server support for min_tokens:
Goodput Metrics
Goodput metrics measure the throughput of requests that meet user-defined Service Level Objectives (SLOs). See the Goodput tutorial for configuration details.
Good Request Count
Type: Aggregate Metric
The number of requests that meet all user-defined SLO thresholds during the benchmark.
Formula:
Notes:
- Requires SLO thresholds to be configured (e.g.,
--goodput). - Only counts requests where ALL SLO constraints are satisfied.
- Used to calculate Goodput metric.
Goodput
Type: Derived Metric
The rate of SLO-compliant requests per second. This represents the effective throughput of requests meeting quality requirements.
Formula:
Notes:
- Requires SLO thresholds to be configured.
- Always less than or equal to Request Throughput.
- Useful for capacity planning and comparing systems based on quality-adjusted throughput.
Error Metrics
These metrics are computed only for failed/error requests and are not displayed in console output.
Error Input Sequence Length
Type: Record Metric
The number of input tokens for requests that resulted in errors. This helps analyze whether input size correlates with errors.
Formula:
Notes:
- Only computed for requests that failed.
- Useful for identifying if certain input sizes trigger errors.
Total Error Input Sequence Length
Type: Derived Metric
The sum of all input tokens from requests that resulted in errors.
Formula:
Notes:
- Aggregates input tokens across all failed requests.
General Metrics
Metrics in this section are available for all benchmark runs with no special requirements.
Request Latency
Type: Record Metric
Measures the total end-to-end time from sending a request until receiving the final response. For streaming requests with multiple responses, this measures until the last response is received. This is the complete time experienced by the client for a single request.
Formula:
Notes:
- Includes all components: network time, queuing, prompt processing, token generation, and response transmission.
- For streaming requests, measures from request start to the final chunk received.
Request Throughput
Type: Derived Metric
The overall rate of completed requests per second across the entire benchmark. This represents the system’s ability to process requests under the given concurrency and load.
Formula:
Notes:
- Captures the aggregate request processing rate; higher values indicate better system throughput.
- Affected by concurrency level, request complexity, output sequence length, and system capacity.
Request Count
Type: Aggregate Metric
The total number of successfully completed requests in the benchmark. This includes all requests that received valid responses, regardless of streaming mode.
Formula:
Error Request Count
Type: Aggregate Metric
The total number of failed/error requests encountered during the benchmark. This includes network errors, HTTP errors, timeout errors, and other failures.
Formula:
Notes:
- Error rate can be computed as
error_request_count / (request_count + error_request_count).
Minimum Request Timestamp
Type: Aggregate Metric
The wall-clock timestamp of the first request sent in the benchmark. This is used to calculate the benchmark duration and represents the start of the benchmark run.
Formula:
Maximum Response Timestamp
Type: Aggregate Metric
The wall-clock timestamp of the last response received in the benchmark. This is used to calculate the benchmark duration and represents the end of the benchmark run.
Formula:
Benchmark Duration
Type: Derived Metric
The total elapsed time from the first request sent to the last response received. This represents the complete wall-clock duration of the benchmark run.
Formula:
Notes:
- Uses wall-clock timestamps representing real calendar time.
- Used as the denominator for throughput calculations; represents the effective measurement window.
HTTP Trace Metrics
All metrics in this section require HTTP trace data to be collected during requests. These metrics provide detailed HTTP request lifecycle timing following k6 naming conventions. See the HTTP Trace Metrics tutorial for configuration details.
HTTP Blocked
Type: Record Metric
Time spent blocked waiting for a free TCP connection slot from the pool. This metric measures the time a request spent waiting in the connection pool queue before a connection became available. High values indicate connection pool saturation.
Formula:
Notes:
- k6 equivalent:
http_req_blocked - HAR equivalent:
blocked - Returns 0 if no pool wait occurred (connection immediately available).
- Only available for AioHttpTraceData.
HTTP DNS Lookup
Type: Record Metric
Time spent on DNS resolution. This metric measures the time spent resolving the hostname to an IP address.
Formula:
Notes:
- k6 equivalent:
http_req_looking_up - HAR equivalent:
dns - Returns 0 if DNS cache hit or connection was reused.
- Only available for AioHttpTraceData.
HTTP Connecting
Type: Record Metric
Time spent establishing TCP connection to the remote host. For HTTPS requests, this includes both TCP connection establishment and TLS handshake time (combined measurement from aiohttp).
Formula:
Notes:
- k6 equivalent:
http_req_connecting - HAR equivalent:
connect - Returns 0 if connection was reused.
- Only available for AioHttpTraceData.
HTTP Sending
Type: Record Metric
Time spent sending data to the remote host. This metric measures the time from when the request started being sent to when the full request (headers + body) was transmitted.
Formula:
Notes:
- k6 equivalent:
http_req_sending - HAR equivalent:
send
HTTP Waiting (TTFB)
Type: Record Metric
Time to First Byte (TTFB) - time waiting for the server to respond. This metric measures the time from when the request was fully sent to when the first byte of the response body was received. This represents server processing time plus network latency.
Formula:
Notes:
- k6 equivalent:
http_req_waiting(also known as TTFB) - HAR equivalent:
wait - Note that this is not the same as the time to first token (TTFT), which is the time from request start to the first valid token received. The server may send non-token data first.
HTTP Receiving
Type: Record Metric
Time spent receiving response data from the remote host. This metric measures the time from when the first byte of the response was received to when the last byte was received.
Formula:
Notes:
- k6 equivalent:
http_req_receiving - HAR equivalent:
receive - Returns 0 if response was a single chunk.
HTTP Duration
Type: Record Metric
Time for HTTP request/response exchange, excluding connection overhead. This measures only the request/response exchange time: sending + waiting + receiving.
Formula:
Notes:
- k6 equivalent:
http_req_duration - HAR equivalent:
time - EXCLUDES connection overhead (blocked, dns_lookup, connecting).
- For full end-to-end time including connection setup, use
http_req_total. - Note: This uses trace-level timestamps for more accurate measurement than application-level request latency.
HTTP Connection Overhead
Type: Record Metric
Total connection overhead time (blocked + dns_lookup + connecting). This metric combines all pre-request overhead.
Formula:
Notes:
- Useful for identifying total connection establishment costs.
- Returns 0 if connection was reused with no pool wait.
- Only available for AioHttpTraceData.
HTTP Total Time
Type: Record Metric
Sum of all HTTP timing phases from connection pool to last chunk received. This is the sum of all 6 timing components: blocked + dns_lookup + connecting + sending + waiting + receiving.
Formula:
Notes:
- This ensures the math adds up: individual timing metrics sum exactly to this total.
- Only available for AioHttpTraceData (requires connection overhead metrics).
HTTP Data Sent
Type: Record Metric
Total bytes sent in the HTTP request (headers + body).
Formula:
Notes:
- k6 equivalent:
data_sent(per request) - Measures total bytes written to the transport layer.
HTTP Data Received
Type: Record Metric
Total bytes received in the HTTP response (headers + body).
Formula:
Notes:
- k6 equivalent:
data_received(per request) - Measures total bytes read from the transport layer.
HTTP Connection Reused
Type: Record Metric
Whether the HTTP connection was reused from the connection pool. Returns 1 if reused, 0 if new connection was established.
Formula:
Notes:
- Helps identify connection reuse patterns and keep-alive effectiveness.
- Only available for AioHttpTraceData.
HTTP Chunks Sent
Type: Record Metric
Number of transport-level write operations during the request. Useful for debugging chunked transfers.
Formula:
Notes:
- Not displayed in console output (
NO_CONSOLEflag).
HTTP Chunks Received
Type: Record Metric
Number of transport-level read operations during the response. Useful for debugging chunked/streaming responses.
Formula:
Notes:
- Not displayed in console output (
NO_CONSOLEflag).
Multi-Run Aggregate Metrics
These metrics are only available when using --num-profile-runs > 1 for confidence reporting.
When running multiple profile iterations with --num-profile-runs, AIPerf computes aggregate statistics across all runs to quantify measurement variance and repeatability. These statistics are written to aggregate/profile_export_aiperf_aggregate.json and aggregate/profile_export_aiperf_aggregate.csv.
For detailed information about aggregate statistics, their mathematical definitions, and interpretation guidelines, see the Multi-Run Confidence Tutorial.
Quick Reference
The following aggregate statistics are computed for each metric:
- mean: Average value across all runs
- std: Standard deviation (measure of spread)
- min: Minimum value observed
- max: Maximum value observed
- cv: Coefficient of Variation (normalized variability)
- se: Standard Error (uncertainty in the mean)
- ci_low, ci_high: Confidence interval bounds
- t_critical: t-distribution critical value used
Aggregate Metadata
The aggregate output also includes metadata about the multi-run benchmark:
- aggregation_type: Always “confidence” for multi-run confidence reporting
- num_profile_runs: Total number of runs requested
- num_successful_runs: Number of runs that completed successfully
- failed_runs: List of failed runs with error details
- confidence_level: Confidence level used for intervals (e.g., 0.95)
- cooldown_seconds: Cooldown duration between runs
- run_labels: Labels for each run (e.g., [“run_0001”, “run_0002”, …])
Metric Flags Reference
Metric flags are used to control when and how metrics are computed, displayed, and grouped. Flags can be combined using bitwise operations to create composite behaviors.
Individual Flags
Composite Flags
These flags are combinations of multiple individual flags for convenience: