Server Metrics JSON Export Schema | NVIDIA AIPerf Documentation

This document describes the structure and semantics of every field in the AIPerf server metrics JSON export format.

Overview

The server metrics JSON export provides aggregated statistics from Prometheus metrics collected during a benchmark run.

Data Organization

Metrics are grouped by name across all endpoints. When scraping multiple servers (e.g., prefill worker at :10000 and decode worker at :10001), metrics with the same name appear under a single key.

Each unique endpoint + label combination keeps its own separate series. Within each metric, the series array contains one entry for every distinct combination of endpoint URL and Prometheus labels, with independent statistics.

For example, if vllm:num_requests_running is scraped from 3 endpoints with 2 label sets each, you get 6 per-endpoint series.

Example Command

$ aiperf profile \
>   -m Qwen/Qwen3-0.6B \
>   --url localhost:10000 \
>   --server-metrics localhost:10001 localhost:10002 \
>   --request-count 50 \
>   --concurrency 50

Note: The --url endpoint (localhost:10000) is automatically scraped for server metrics.

Format selection: By default, AIPerf generates JSON and CSV exports. This document describes the JSON format. To control which formats are generated, use --server-metrics-formats:

Default: --server-metrics-formats json csv (JSONL and Parquet excluded to avoid large files)
Include JSONL: --server-metrics-formats json csv jsonl
Include Parquet: --server-metrics-formats json csv parquet
JSON only: --server-metrics-formats json

The Parquet format exports raw time-series data with delta calculations in columnar format, optimized for SQL analytics with DuckDB, pandas, or Polars. See Parquet Schema Reference for the complete schema.

Related documentation:

Server Metrics Tutorial - Quick start guide and usage examples
Server Metrics Reference - Metric definitions by backend (vLLM, SGLang, TRT-LLM, Dynamo, Triton)
Parquet Schema Reference - Raw time-series data schema

Data Access

Metrics are organized for O(1) lookup by name with nested stats within each series:

1 data["metrics"]["metric_name"]["series"][0]["stats"]["p99"]

Top-Level Structure

1 {
2   "schema_version": "1.0",
3   "aiperf_version": "0.11.0",
4   "benchmark_id": "550e8400-e29b-41d4-a716-446655440000",
5   "summary": { ... },
6   "metrics": { ... },
7   "input_config": { ... }
8 }

Field	Type	Description
`schema_version`	string	Schema version for this export format (e.g., `"1.0"`)
`aiperf_version`	string or null	AIPerf version that generated this export (e.g., `"0.11.0"`). `null` if version unavailable.
`benchmark_id`	string or null	Unique UUID identifying this benchmark run. `null` if not available.
`summary`	object	Collection metadata and endpoint information
`metrics`	object	Metrics keyed by name, each containing type info and series data
`input_config`	object	Serialized user configuration used for this benchmark run

Summary Section

1 "summary": {
2   "endpoints_configured": [
3     "http://localhost:10000/metrics",
4     "http://localhost:10001/metrics"
5   ],
6   "endpoints_successful": [
7     "http://localhost:10000/metrics",
8     "http://localhost:10001/metrics"
9   ],
10   "start_time": "2025-12-10T16:07:13.596361",
11   "end_time": "2025-12-10T16:07:35.749758",
12   "endpoint_info": { ... }
13 }

Field	Type	Description
`endpoints_configured`	array[string]	Full endpoint URLs that were configured for scraping
`endpoints_successful`	array[string]	Full endpoint URLs that returned data
`start_time`	datetime	When metrics collection started (ISO 8601)
`end_time`	datetime	When metrics collection ended (ISO 8601)
`endpoint_info`	object	Per-endpoint collection metadata

Endpoint Info

1 "endpoint_info": {
2   "http://localhost:10000/metrics": {
3     "total_fetches": 144,
4     "first_fetch_ns": 1765529006843416914,
5     "last_fetch_ns": 1765529029508409301,
6     "avg_fetch_latency_ms": 296.8633202916667,
7     "unique_updates": 72,
8     "first_update_ns": 1765529006843416914,
9     "last_update_ns": 1765529029508409301,
10     "duration_seconds": 22.664992387,
11     "avg_update_interval_ms": 319.225244887324,
12     "median_update_interval_ms": 334.0127105
13   }
14 }

Field	Type	Description
`total_fetches`	int	Total number of HTTP fetches from this endpoint
`first_fetch_ns`	int	Timestamp of first fetch in nanoseconds
`last_fetch_ns`	int	Timestamp of last fetch in nanoseconds
`avg_fetch_latency_ms`	float	Average time to fetch metrics from this endpoint in milliseconds
`unique_updates`	int	Number of fetches that returned changed metrics
`first_update_ns`	int	Timestamp of first unique update in nanoseconds
`last_update_ns`	int	Timestamp of last unique update in nanoseconds
`duration_seconds`	float	Time span from first to last unique update in seconds
`avg_update_interval_ms`	float	Average time between unique metric updates in milliseconds
`median_update_interval_ms`	float or null	Median time between unique metric updates in milliseconds. More robust to outliers than average. `null` if fewer than 2 intervals.

Metrics Section

Each metric entry has this structure:

1 "metrics": {
2   "metric_name": {
3     "type": "gauge|counter|histogram|unknown",
4     "description": "Metric description from HELP text",
5     "unit": "seconds|tokens|requests|...",
6     "series": [ ... ]
7   }
8 }

Field	Type	Description
`type`	string	Prometheus metric type: `gauge`, `counter`, `histogram`, or `unknown`
`description`	string	Human-readable description from Prometheus HELP text
`unit`	string or null	Unit inferred from metric name suffix. See Unit Inference for complete mapping of suffixes to unit values.
`series`	array	Statistics for each unique endpoint + label combination

Series Fields (Common)

Every series entry contains these common fields:

1 {
2   "endpoint_url": "http://localhost:10000/metrics",
3   "labels": {"model": "Qwen/Qwen3-0.6B", "dynamo_component": "prefill"}
4 }

Field	Type	Description
`endpoint_url`	string	Full endpoint URL (e.g., `http://localhost:10000/metrics`)
`labels`	object or null	Prometheus labels for this time series. `null` or missing if metric has no labels.

Gauge Metrics

Gauges represent point-in-time values that can go up or down (e.g., current queue depth, memory usage).

Gauge Series Fields

Field	Type	Description
`endpoint_url`	string	Full endpoint URL
`labels`	object/null	Prometheus labels for this series
`stats`	object	Nested statistics object (always present)
`timeslices`	array	Optional: Statistics broken down by time window

Gauge Stats Fields

Field	Type	Description
`avg`	float	Mean of all observed values during collection
`min`	float	Minimum observed value
`max`	float	Maximum observed value
`std`	float	Standard deviation of observed values
`p1`	float	1st percentile
`p5`	float	5th percentile
`p10`	float	10th percentile
`p25`	float	25th percentile
`p50`	float	50th percentile (median)
`p75`	float	75th percentile
`p90`	float	90th percentile
`p95`	float	95th percentile
`p99`	float	99th percentile

Gauge with Variation

1 {
2   "endpoint_url": "http://localhost:10002/metrics",
3   "labels": {
4     "dynamo_component": "backend",
5     "dynamo_endpoint": "generate",
6     "model": "Qwen/Qwen3-0.6B"
7   },
8   "stats": {
9     "avg": 36.68055555555556,
10     "min": 0.0,
11     "max": 50.0,
12     "std": 16.87887786545273,
13     "p1": 0.0,
14     "p5": 2.0,
15     "p10": 8.0,
16     "p25": 25.0,
17     "p50": 45.5,
18     "p75": 47.0,
19     "p90": 48.0,
20     "p95": 49.0,
21     "p99": 50.0
22   },
23   "timeslices": [
24     {
25       "start_ns": 1765411635639590410,
26       "end_ns": 1765411637639590410,
27       "avg": 5.0,
28       "min": 0.0,
29       "max": 15.0
30     },
31     {
32       "start_ns": 1765411637639590410,
33       "end_ns": 1765411639639590410,
34       "avg": 31.67,
35       "min": 24.0,
36       "max": 35.0
37     }
38   ]
39 }

Example interpretation (dynamo_component_inflight_requests):

“On average, 36.7 requests were in-flight”
“In-flight requests ranged from 0 to 50”
“99% of the time, in-flight requests were at or below 50”

Gauge with No Variation (constant)

When a gauge never changes during collection (standard deviation = 0), stats are still provided for API consistency. All percentiles equal the constant value:

1 {
2   "endpoint_url": "http://localhost:11001/metrics",
3   "labels": {
4     "dynamo_component": "prefill",
5     "dynamo_namespace": "acasagrande_sglang_acasagrande_sglang_disagg"
6   },
7   "stats": {
8     "avg": 1024.0,
9     "min": 1024.0,
10     "max": 1024.0,
11     "std": 0.0,
12     "p1": 1024.0,
13     "p5": 1024.0,
14     "p10": 1024.0,
15     "p25": 1024.0,
16     "p50": 1024.0,
17     "p75": 1024.0,
18     "p90": 1024.0,
19     "p95": 1024.0,
20     "p99": 1024.0
21   }
22 }

Gauge Timeslices

Each gauge timeslice contains statistics for a fixed time window:

Field	Type	Description
`start_ns`	int	Timeslice start timestamp in nanoseconds
`end_ns`	int	Timeslice end timestamp in nanoseconds
`is_complete`	bool	Only present when `false` (partial timeslice, typically the final slice). Omitted for complete timeslices.
`avg`	float	Average value during this timeslice
`min`	float	Minimum value during this timeslice
`max`	float	Maximum value during this timeslice

1 {
2   "start_ns": 1765411635639590410,
3   "end_ns": 1765411637639590410,
4   "avg": 5.0,
5   "min": 0.0,
6   "max": 15.0
7 }

Unknown Metrics

Prometheus families declared # TYPE foo untyped — and families that ship with no # TYPE line at all, which the parser also classifies as untyped — appear in the export with type: "unknown". node-exporter’s node_netstat_Icmp_*, node_netstat_Tcp_*, and node_netstat_IpExt_* families are typical examples.

AIPerf treats unknown as gauge-equivalent for storage and statistics: the series shape and stat fields are identical to a Gauge. The dedicated type: "unknown" tag is preserved (rather than flattened to "gauge") so a real gauge and an exporter-untyped scalar remain distinguishable for downstream consumers — e.g., to flag that the exporter is explicitly not asserting monotonic or rate semantics.

Unknown Series Fields

Identical to Gauge Series Fields.

Unknown Stats Fields

Identical to Gauge Stats Fields.

Counter Metrics

Counters are monotonically increasing values (e.g., total requests processed, total bytes transferred).

Counter Series Fields

Field	Type	Description
`endpoint_url`	string	Full endpoint URL
`labels`	object/null	Prometheus labels for this series
`stats`	object	Nested statistics object (always present)
`timeslices`	array	Optional: Statistics broken down by time window

Counter Stats Fields

Field	Type	Description
`total`	float	Total increase in counter value during collection period
`rate`	float	Overall rate: `total / duration_seconds`
`rate_avg`	float	Time-weighted average rate between change points
`rate_min`	float	Minimum instantaneous rate observed between consecutive scrapes
`rate_max`	float	Maximum instantaneous rate observed between consecutive scrapes
`rate_std`	float	Standard deviation of point-to-point rates

Counter with Activity

1 {
2   "endpoint_url": "http://localhost:10001/metrics",
3   "labels": {
4     "dynamo_component": "prefill",
5     "dynamo_endpoint": "generate",
6     "model": "Qwen/Qwen3-0.6B"
7   },
8   "stats": {
9     "total": 318092.0,
10     "rate": 14206.446174934012,
11     "rate_avg": 14458.727272727272,
12     "rate_min": 0.0,
13     "rate_max": 69626.0,
14     "rate_std": 25812.771107887304
15   },
16   "timeslices": [
17     {
18       "start_ns": 1765411635103733481,
19       "end_ns": 1765411637103733481,
20       "total": 104707.0,
21       "rate": 52353.5
22     },
23     {
24       "start_ns": 1765411637103733481,
25       "end_ns": 1765411639103733481,
26       "total": 74133.0,
27       "rate": 37066.5
28     }
29   ]
30 }

Example interpretation (dynamo_component_request_bytes):

stats.total: 318092 → “318,092 bytes were received during the benchmark”
stats.rate: 14206.4 → “Overall throughput was 14,206 bytes/second”
stats.rate_avg: 14458.7 → “Average instantaneous rate was 14,459 bytes/second”
stats.rate_min: 0.0 → “Slowest period saw 0 bytes/second (idle)”
stats.rate_max: 69626.0 → “Fastest burst reached 69,626 bytes/second”

Counter with No Activity

When a counter doesn’t change during the collection period (total = 0), stats are still provided for API consistency:

1 {
2   "endpoint_url": "http://localhost:10001/metrics",
3   "labels": {
4     "dynamo_component": "prefill",
5     "dynamo_endpoint": "clear_kv_blocks",
6     "model": "Qwen/Qwen3-0.6B"
7   },
8   "stats": {
9     "total": 0.0,
10     "rate": 0.0
11   }
12 }

Counter Timeslices

Each counter timeslice contains the delta and rate for a fixed time window:

Field	Type	Description
`start_ns`	int	Timeslice start timestamp in nanoseconds
`end_ns`	int	Timeslice end timestamp in nanoseconds
`is_complete`	bool	Only present when `false` (partial timeslice, typically the final slice). Omitted for complete timeslices.
`total`	float	Total increase in counter value during this timeslice
`rate`	float	Rate of counter value increase per second during this timeslice

1 {
2   "start_ns": 1765411635103733481,
3   "end_ns": 1765411637103733481,
4   "total": 104707.0,
5   "rate": 52353.5
6 }

Histogram Metrics

Histograms track distributions of values (e.g., request latencies, token counts). Prometheus histograms maintain cumulative bucket counts and a running sum.

Histogram Series Fields

Field	Type	Description
`endpoint_url`	string	Full endpoint URL
`labels`	object/null	Prometheus labels for this series
`stats`	object	Nested statistics object (always present for histograms)
`buckets`	object/null	Map of bucket upper bounds to delta counts. Present when count > 0, may be `null` if counter reset detected.
`timeslices`	array	Optional: Statistics broken down by time window

Histogram Stats Fields

Field	Type	Description
`count`	int	Total count change over collection period (number of observations)
`sum`	float	Total sum change over collection period
`avg`	float	Overall average value: `sum / count`
`count_rate`	float	Average count change per second (observations per second)
`sum_rate`	float	Average sum change per second
`p1_estimate`	float	Estimated 1st percentile
`p5_estimate`	float	Estimated 5th percentile
`p10_estimate`	float	Estimated 10th percentile
`p25_estimate`	float	Estimated 25th percentile
`p50_estimate`	float	Estimated 50th percentile (median)
`p75_estimate`	float	Estimated 75th percentile
`p90_estimate`	float	Estimated 90th percentile
`p95_estimate`	float	Estimated 95th percentile
`p99_estimate`	float	Estimated 99th percentile

Note: Percentiles are estimates interpolated from histogram buckets.

Histogram with Observations

1 {
2   "endpoint_url": "http://localhost:10001/metrics",
3   "labels": {
4     "dynamo_component": "prefill",
5     "dynamo_endpoint": "generate",
6     "model": "Qwen/Qwen3-0.6B"
7   },
8   "stats": {
9     "count": 50,
10     "sum": 2.2072624189999814,
11     "avg": 0.04414524837999963,
12     "count_rate": 2.233071906073402,
13     "sum_rate": 0.09857951394400953,
14     "p1_estimate": 0.025,
15     "p5_estimate": 0.028,
16     "p10_estimate": 0.030,
17     "p25_estimate": 0.033,
18     "p50_estimate": 0.038245593313299506,
19     "p75_estimate": 0.052658494249919106,
20     "p90_estimate": 0.07715849424991911,
21     "p95_estimate": 0.08532516091658578,
22     "p99_estimate": 0.0918584942499191
23   },
24   "buckets": {
25     "0.005": 0,
26     "0.01": 0,
27     "0.025": 0,
28     "0.05": 35,
29     "0.1": 50,
30     "0.25": 50,
31     "0.5": 50,
32     "1": 50,
33     "2.5": 50,
34     "5": 50,
35     "10": 50,
36     "+Inf": 50
37   },
38   "timeslices": [
39     {
40       "start_ns": 1765411635103733481,
41       "end_ns": 1765411637103733481,
42       "count": 15,
43       "sum": 0.5630153879999966,
44       "avg": 0.03753435919999978,
45       "buckets": {
46         "0.005": 0,
47         "0.025": 0,
48         "0.05": 10,
49         "0.1": 15,
50         "0.25": 15,
51         "0.5": 15,
52         "1": 15,
53         "2.5": 15,
54         "5": 15,
55         "10": 15,
56         "+Inf": 15
57       }
58     },
59     {
60       "start_ns": 1765411637103733481,
61       "end_ns": 1765411639103733481,
62       "count": 12,
63       "sum": 0.631630536000003,
64       "avg": 0.05263587800000025,
65       "buckets": {
66         "0.005": 0,
67         "0.025": 0,
68         "0.05": 8,
69         "0.1": 12,
70         "0.25": 12,
71         "0.5": 12,
72         "1": 12,
73         "2.5": 12,
74         "5": 12,
75         "10": 12,
76         "+Inf": 12
77       }
78     }
79   ]
80 }

Histogram with No Observations

When a histogram has no observations, stats contains only count: 0, and buckets contains all zeros:

1 {
2   "endpoint_url": "http://localhost:10001/metrics",
3   "labels": {
4     "dynamo_component": "prefill",
5     "dynamo_endpoint": "clear_kv_blocks",
6     "model": "Qwen/Qwen3-0.6B"
7   },
8   "stats": {
9     "count": 0
10   },
11   "buckets": {
12     "0.005": 0,
13     "0.01": 0,
14     "0.025": 0,
15     "0.05": 0,
16     "0.1": 0,
17     "0.25": 0,
18     "0.5": 0,
19     "1": 0,
20     "2.5": 0,
21     "5": 0,
22     "10": 0,
23     "+Inf": 0
24   }
25 }

Bucket Data

Bucket keys are the upper bound (as strings), values are delta counts (number of new observations in each bucket during the collection period). The +Inf bucket contains the total delta count.

1 "buckets": {
2   "0.005": 0,
3   "0.05": 35,
4   "0.1": 50,
5   "+Inf": 50
6 }

Histogram Timeslices

Each histogram timeslice contains count, sum, average, and bucket deltas for a fixed time window:

Field	Type	Description
`start_ns`	int	Timeslice start timestamp in nanoseconds
`end_ns`	int	Timeslice end timestamp in nanoseconds
`is_complete`	bool	Only present when `false` (partial timeslice, typically the final slice). Omitted for complete timeslices.
`count`	int	Change in count during this timeslice
`sum`	float	Change in sum during this timeslice
`avg`	float	Average value during this timeslice: `sum / count`
`buckets`	object/null	Map of bucket upper bounds to delta counts during this timeslice

1 {
2   "start_ns": 1765411635103733481,
3   "end_ns": 1765411637103733481,
4   "count": 15,
5   "sum": 0.5630153879999966,
6   "avg": 0.03753435919999978,
7   "buckets": {
8     "0.005": 0,
9     "0.05": 10,
10     "0.1": 15,
11     "+Inf": 15
12   }
13 }

Histogram Field Semantics by Use Case

The meaning of histogram fields depends on what the histogram measures:

Request-Level Histograms (e.g., `vllm:e2e_request_latency_seconds`)

Field	Semantic Meaning	Example
`stats.count`	Number of requests	50 requests
`stats.count_rate`	Request throughput	2.23 requests/second
`stats.avg`	Mean request duration	0.044 seconds
`stats.sum`	Total time spent on requests	2.21 seconds
`stats.sum_rate`	Concurrency metric: seconds of request time per second of real time	0.099 (≈0.1 concurrent requests)
`stats.p99_estimate`	99th percentile latency	0.092 seconds

Token-Level Histograms (e.g., `input_sequence_tokens`)

Field	Semantic Meaning	Example
`stats.count`	Number of requests	50 requests
`stats.count_rate`	Request throughput	2.29 requests/second
`stats.avg`	Mean tokens per request	986 tokens
`stats.sum`	Total tokens processed	49,311 tokens
`stats.sum_rate`	Token throughput	2,264 tokens/second
`stats.p99_estimate`	99th percentile tokens	2,193 tokens

Field Presence Rules

Fields are omitted when not applicable to reduce JSON size. All series now use consistent stats format.

Condition	Fields Present
Gauge (any)	`endpoint_url`, `labels`, `stats` (with all percentiles), `timeslices` (optional)
Gauge with no variation (std=0)	Same as above, but all percentiles equal the constant value and std=0
Counter (any)	`endpoint_url`, `labels`, `stats` (with total, rate, rate_* fields), `timeslices` (optional)
Counter with no activity (total=0)	Same as above, but total=0 and all rates=0
Histogram with no observations (count=0)	`endpoint_url`, `labels`, `stats` (count=0 only), `buckets` (all zeros)
Histogram with observations (count>0)	`endpoint_url`, `labels`, `stats` (all fields), `buckets`, `timeslices` (optional)
Metric has no labels	`labels` is `null` or omitted
Unit cannot be inferred	`unit` is `null` or omitted
Timeslices not requested	`timeslices` omitted

Unit Inference

Units are inferred from metric name suffixes. Longer suffixes are matched first to handle compound suffixes correctly (e.g., _tokens_total matches before _total).

The “JSON Unit Value” column shows the actual string that appears in the unit field of the exported JSON (computed via enum.name.lower().replace("_per_second", "/s")).

Metric Name Suffix	JSON Unit Value
Time
`_seconds`, `_seconds_total`	`seconds`
`_milliseconds`, `_ms`, `_ms_total`	`milliseconds`
`_nanoseconds`, `_ns`, `_ns_total`	`nanoseconds`
Size
`_bytes`, `_bytes_total`	`bytes`
`_kilobytes`	`kilobytes`
`_megabytes`	`megabytes`
`_gigabytes`	`gigabytes`
Counts
`_total`, `_count`	`count`
`_tokens`, `_tokens_total`	`tokens`
`_requests`, `_requests_total`, `_reqs`	`requests`
`request_success`	`requests` (special case: no underscore prefix)
`_errors`, `_errors_total`, `_error_count`, `_error_count_total`	`errors`
`_blocks`, `_blocks_total`, `_block_count`	`blocks`
Rates
`_gb_s`	`gb/s`
Ratios
`_ratio`	`ratio`
`_percent`, `_perc`	`percent`
Physical
`_celsius`	`celsius`
`_joules`	`joule`
`_watts`	`watt`

Note: Additional units may be inferred from metric description text (e.g., “in milliseconds”, “(GB/s)”). Description-based inference takes priority when both suffix and description are present.

Data Normalization

All statistics in the export are computed over the collection period, which may exclude warmup time based on configuration. Understanding how each metric type is normalized is critical for correct interpretation.

Counter Normalization

Counters are cumulative values in Prometheus—they only increase (except on server restart). The export normalizes them to deltas (changes) over the collection period:

Export Field	Calculation	Example
`total`	`final_value - reference_value`	If counter went from 1000 to 1500, `total = 500`
`rate`	`total / duration_seconds`	Overall rate: `500 / 22.0 = 22.7/second`
`rate_avg`	Mean of per-timeslice rates	Average instantaneous rate across all timeslices
`rate_min`	Minimum per-timeslice rate	Slowest period (may be 0 during idle)
`rate_max`	Maximum per-timeslice rate	Fastest burst
`rate_std`	Standard deviation of rates	Variability of rate over time

Counter reset handling: If a counter decreases (server restart), the delta is clamped to 0 to avoid negative totals.

Reference point: The reference value for delta calculation is the last sample before the collection period starts (after warmup exclusion), ensuring accurate deltas at the period boundary.

Gauge Normalization

Gauges are point-in-time values. Statistics are computed from all samples within the collection period:

Export Field	Calculation	Notes
`avg`	Arithmetic mean of all samples	Simple average, not time-weighted
`min`, `max`	Minimum/maximum observed	Extreme values seen
`std`	Sample standard deviation (ddof=1)	Unbiased estimate using Bessel’s correction
`p1` - `p99`	Exact percentiles	Computed from raw sample data using NumPy

Constant gauge handling: If standard deviation = 0 (gauge never varied), all percentiles will equal the constant value.

Histogram Normalization

Histograms are cumulative in Prometheus—both bucket counts and sum only increase. The export normalizes to deltas:

Export Field	Calculation	Notes
`count`	`final_count - reference_count`	Number of observations during period
`sum`	`final_sum - reference_sum`	Sum of observed values during period
`avg`	`sum / count`	Average value per observation
`count_rate`	`count / duration_seconds`	Observations per second
`sum_rate`	`sum / duration_seconds`	Sum increase per second
`buckets`	Per-bucket deltas	Each bucket shows count increase during period
`p*_estimate`	Estimated percentiles	See Histogram Percentile Estimation

Timeslice Normalization

When --slice-duration is configured (default: 2 seconds), the collection period is divided into fixed-duration windows. Each timeslice contains:

Gauges: avg, min, max for that window
Counters: total (delta) and rate for that window
Histograms: count, sum, avg, and optional buckets for that window

Fallback behavior: If the configured slice duration is smaller than the actual metric update interval, the system falls back to per-interval mode where each sample interval becomes its own “timeslice”.

Warmup Exclusion

When warmup time is configured, metrics collected during warmup are excluded from all statistics. The reference_value for delta calculations is taken from the last sample before the warmup period ends.

Histogram Percentile Estimation

Histogram percentiles are estimates because Prometheus histograms only store cumulative bucket counts, not individual observations. AIPerf uses a polynomial histogram algorithm for significantly improved accuracy over standard linear interpolation.

Why Standard Interpolation Fails

Standard Prometheus histogram interpolation assumes observations are uniformly distributed within each bucket. This assumption fails badly when:

Observations cluster near boundaries: Real latency distributions often cluster near 0 or near bucket edges
+Inf bucket contains data: The unbounded bucket makes interpolation impossible
Bucket widths are large: Wide buckets hide the true distribution shape

Standard interpolation can produce errors of 5-10x on P99 estimates for typical LLM inference workloads.

Polynomial Histogram Algorithm

AIPerf implements a four-phase algorithm that provides ~5x reduction in percentile estimation error:

Phase 1 - Per-bucket mean learning: When a scrape interval has all observations in a single bucket, the exact mean for that bucket can be computed: mean = sum_delta / count_delta. These learned means are accumulated over time via accumulate_bucket_statistics().

Phase 2 - Estimate bucket sums: For each finite bucket, estimate the sum using learned means (or midpoint fallback). This gives estimated_finite_sum.

Phase 3 - +Inf bucket back-calculation: The +Inf bucket sum is calculated as total_sum - estimated_finite_sum. Observations are spread around the estimated mean inf_avg = inf_sum / inf_count within the +Inf range.

Phase 4 - Generate finite observations with sum constraint: For each bucket, observations are placed using one of several strategies based on learned statistics:

F3 two-point mass: When variance is extremely tight (< 1% of bucket width)
Blended distribution: When variance is tight (< 20%) and mean is near center (< 30% offset)
Variance-aware distribution: When variance is moderate
Shifted uniform: Fallback when only mean is learned (no variance data)
Pure uniform: Final fallback using bucket midpoint

After initial placement, positions are adjusted proportionally across all buckets to match the adjusted target sum (total_sum - inf_sum_estimate), with each bucket’s adjustment capped at ±40% of bucket width.

Percentile Field Naming

Histogram percentiles use the _estimate suffix to indicate they are approximations:

Field	Description
`p1_estimate` - `p99_estimate`	Estimated percentiles using polynomial algorithm

Gauge percentiles (computed from raw samples) do not have the _estimate suffix because they are exact.

Example Queries

Find all metrics with p99 > 1 second

1 for name, metric in data["metrics"].items():
2     for series in metric["series"]:
3         stats = series.get("stats", {})
4         # Gauge percentiles use "p99", histogram uses "p99_estimate"
5         p99 = stats.get("p99") or stats.get("p99_estimate")
6         if p99 and p99 > 1.0 and metric.get("unit") == "seconds":
7             print(f"{name}: p99={p99:.2f}s")

Calculate total bytes transferred across all endpoints

1 total = sum(
2     series.get("stats", {}).get("total", 0)
3     for series in data["metrics"]["dynamo_component_request_bytes"]["series"]
4 )

Find highest throughput endpoint

1 max_throughput = max(
2     (series.get("stats", {}).get("rate", 0), series["endpoint_url"])
3     for series in data["metrics"]["dynamo_component_requests"]["series"]
4 )

Access timeslice data

1 metric = data["metrics"]["dynamo_component_inflight_requests"]
2 for series in metric["series"]:
3     if series.get("timeslices"):
4         for ts in series["timeslices"]:
5             duration_ns = ts["end_ns"] - ts["start_ns"]
6             duration_s = duration_ns / 1e9
7             print(f"  {duration_s:.1f}s window: avg={ts['avg']:.2f}")

Minimal Example

1 {
2   "schema_version": "1.0",
3   "aiperf_version": "0.11.0",
4   "benchmark_id": "550e8400-e29b-41d4-a716-446655440000",
5   "summary": {
6     "endpoints_configured": [
7       "http://localhost:10000/metrics",
8       "http://localhost:10001/metrics",
9       "http://localhost:10002/metrics"
10     ],
11     "endpoints_successful": [
12       "http://localhost:10000/metrics",
13       "http://localhost:10001/metrics",
14       "http://localhost:10002/metrics"
15     ],
16     "start_time": "2025-12-10T16:07:13.596361",
17     "end_time": "2025-12-10T16:07:35.749758",
18     "endpoint_info": {
19       "http://localhost:10000/metrics": {
20         "total_fetches": 144,
21         "first_fetch_ns": 1765529006843416914,
22         "last_fetch_ns": 1765529029508409301,
23         "avg_fetch_latency_ms": 296.86,
24         "unique_updates": 72,
25         "first_update_ns": 1765529006843416914,
26         "last_update_ns": 1765529029508409301,
27         "duration_seconds": 22.66,
28         "avg_update_interval_ms": 319.23,
29         "median_update_interval_ms": 334.01
30       },
31       "http://localhost:10001/metrics": {
32         "total_fetches": 140,
33         "first_fetch_ns": 1765529007434057293,
34         "last_fetch_ns": 1765529029554057293,
35         "avg_fetch_latency_ms": 285.42,
36         "unique_updates": 70,
37         "first_update_ns": 1765529007434057293,
38         "last_update_ns": 1765529029554057293,
39         "duration_seconds": 22.12,
40         "avg_update_interval_ms": 316.00,
41         "median_update_interval_ms": 320.50
42       },
43       "http://localhost:10002/metrics": {
44         "total_fetches": 142,
45         "first_fetch_ns": 1765529006950000000,
46         "last_fetch_ns": 1765529029400000000,
47         "avg_fetch_latency_ms": 290.15,
48         "unique_updates": 71,
49         "first_update_ns": 1765529006950000000,
50         "last_update_ns": 1765529029400000000,
51         "duration_seconds": 22.45,
52         "avg_update_interval_ms": 318.10,
53         "median_update_interval_ms": 325.75
54       }
55     }
56   },
57   "metrics": {
58     "vllm:num_requests_running": {
59       "type": "gauge",
60       "description": "Number of requests currently in the model execution batch",
61       "unit": "requests",
62       "series": [
63         {
64           "endpoint_url": "http://localhost:10000/metrics",
65           "labels": {"model": "Qwen/Qwen3-0.6B"},
66           "stats": {
67             "avg": 36.68,
68             "min": 0.0,
69             "max": 50.0,
70             "std": 16.88,
71             "p1": 0.0,
72             "p5": 2.0,
73             "p10": 8.0,
74             "p25": 25.0,
75             "p50": 45.5,
76             "p75": 47.0,
77             "p90": 48.0,
78             "p95": 49.0,
79             "p99": 50.0
80           }
81         },
82         {
83           "endpoint_url": "http://localhost:10001/metrics",
84           "labels": {"model": "Qwen/Qwen3-0.6B"},
85           "stats": {
86             "avg": 12.34,
87             "min": 0.0,
88             "max": 25.0,
89             "std": 8.21,
90             "p1": 0.0,
91             "p5": 1.0,
92             "p10": 3.0,
93             "p25": 6.0,
94             "p50": 14.0,
95             "p75": 18.0,
96             "p90": 22.0,
97             "p95": 24.0,
98             "p99": 25.0
99           }
100         },
101         {
102           "endpoint_url": "http://localhost:10002/metrics",
103           "labels": {"model": "Qwen/Qwen3-0.6B"},
104           "stats": {
105             "avg": 8.92,
106             "min": 0.0,
107             "max": 18.0,
108             "std": 5.67,
109             "p1": 0.0,
110             "p5": 1.0,
111             "p10": 2.0,
112             "p25": 4.0,
113             "p50": 10.0,
114             "p75": 13.0,
115             "p90": 16.0,
116             "p95": 17.0,
117             "p99": 18.0
118           }
119         }
120       ]
121     },
122     "vllm:request_success": {
123       "type": "counter",
124       "description": "Count of successfully completed requests",
125       "unit": "requests",
126       "series": [
127         {
128           "endpoint_url": "http://localhost:10000/metrics",
129           "labels": {"model": "Qwen/Qwen3-0.6B"},
130           "stats": {
131             "total": 50.0,
132             "rate": 2.23,
133             "rate_avg": 2.27,
134             "rate_min": 0.0,
135             "rate_max": 11.5,
136             "rate_std": 4.09
137           }
138         },
139         {
140           "endpoint_url": "http://localhost:10001/metrics",
141           "labels": {"model": "Qwen/Qwen3-0.6B"},
142           "stats": {
143             "total": 50.0,
144             "rate": 2.26,
145             "rate_avg": 2.30,
146             "rate_min": 0.0,
147             "rate_max": 10.8,
148             "rate_std": 3.95
149           }
150         },
151         {
152           "endpoint_url": "http://localhost:10002/metrics",
153           "labels": {"model": "Qwen/Qwen3-0.6B"},
154           "stats": {
155             "total": 50.0,
156             "rate": 2.23,
157             "rate_avg": 2.25,
158             "rate_min": 0.0,
159             "rate_max": 11.2,
160             "rate_std": 4.01
161           }
162         }
163       ]
164     },
165     "vllm:e2e_request_latency_seconds": {
166       "type": "histogram",
167       "description": "End-to-end request latency from arrival to completion",
168       "unit": "seconds",
169       "series": [
170         {
171           "endpoint_url": "http://localhost:10000/metrics",
172           "labels": {"model": "Qwen/Qwen3-0.6B"},
173           "stats": {
174             "count": 50,
175             "sum": 2.21,
176             "avg": 0.044,
177             "count_rate": 2.23,
178             "sum_rate": 0.099,
179             "p1_estimate": 0.025,
180             "p5_estimate": 0.028,
181             "p10_estimate": 0.030,
182             "p25_estimate": 0.033,
183             "p50_estimate": 0.038,
184             "p75_estimate": 0.052,
185             "p90_estimate": 0.077,
186             "p95_estimate": 0.085,
187             "p99_estimate": 0.092
188           },
189           "buckets": {"0.005": 0, "0.05": 35, "0.1": 50, "+Inf": 50}
190         },
191         {
192           "endpoint_url": "http://localhost:10001/metrics",
193           "labels": {"model": "Qwen/Qwen3-0.6B"},
194           "stats": {
195             "count": 50,
196             "sum": 1.85,
197             "avg": 0.037,
198             "count_rate": 2.26,
199             "sum_rate": 0.084,
200             "p1_estimate": 0.020,
201             "p5_estimate": 0.023,
202             "p10_estimate": 0.025,
203             "p25_estimate": 0.028,
204             "p50_estimate": 0.032,
205             "p75_estimate": 0.045,
206             "p90_estimate": 0.065,
207             "p95_estimate": 0.072,
208             "p99_estimate": 0.078
209           },
210           "buckets": {"0.005": 0, "0.05": 42, "0.1": 50, "+Inf": 50}
211         },
212         {
213           "endpoint_url": "http://localhost:10002/metrics",
214           "labels": {"model": "Qwen/Qwen3-0.6B"},
215           "stats": {
216             "count": 50,
217             "sum": 2.05,
218             "avg": 0.041,
219             "count_rate": 2.23,
220             "sum_rate": 0.091,
221             "p1_estimate": 0.022,
222             "p5_estimate": 0.025,
223             "p10_estimate": 0.027,
224             "p25_estimate": 0.030,
225             "p50_estimate": 0.035,
226             "p75_estimate": 0.048,
227             "p90_estimate": 0.072,
228             "p95_estimate": 0.080,
229             "p99_estimate": 0.086
230           },
231           "buckets": {"0.005": 0, "0.05": 38, "0.1": 50, "+Inf": 50}
232         }
233       ]
234     }
235   },
236   "input_config": {
237     "model": "Qwen/Qwen3-0.6B",
238     "url": "http://localhost:10000",
239     "loadgen": {
240       "concurrency": 50,
241       "request_count": 50
242     },
243     "cli_command": "aiperf profile --model 'Qwen/Qwen3-0.6B' --url 'http://localhost:10000' --server-metrics 'http://localhost:10001/metrics' 'http://localhost:10002/metrics' --request-count 50 --concurrency 50",
244     "benchmark_id": "550e8400-e29b-41d4-a716-446655440000",
245     "server_metrics": [
246       "http://localhost:10001/metrics",
247       "http://localhost:10002/metrics"
248     ]
249   }
250 }