AIPerf Server Metrics JSON Export Schema

View as Markdown

This document describes the structure and semantics of every field in the AIPerf server metrics JSON export format.

Overview

The server metrics JSON export provides aggregated statistics from Prometheus metrics collected during a benchmark run.

Data Organization

Metrics are grouped by name across all endpoints. When scraping multiple servers (e.g., prefill worker at :10000 and decode worker at :10001), metrics with the same name appear under a single key.

Each unique endpoint + label combination keeps its own separate series. Within each metric, the series array contains one entry for every distinct combination of endpoint URL and Prometheus labels, with independent statistics.

For example, if vllm:num_requests_running is scraped from 3 endpoints with 2 label sets each, you get 6 per-endpoint series.

Example Command

$aiperf profile \
> -m Qwen/Qwen3-0.6B \
> --url localhost:10000 \
> --server-metrics localhost:10001 localhost:10002 \
> --request-count 50 \
> --concurrency 50

Note: The --url endpoint (localhost:10000) is automatically scraped for server metrics.

Format selection: By default, AIPerf generates JSON and CSV exports. This document describes the JSON format. To control which formats are generated, use --server-metrics-formats:

  • Default: --server-metrics-formats json csv (JSONL and Parquet excluded to avoid large files)
  • Include JSONL: --server-metrics-formats json csv jsonl
  • Include Parquet: --server-metrics-formats json csv parquet
  • JSON only: --server-metrics-formats json

The Parquet format exports raw time-series data with delta calculations in columnar format, optimized for SQL analytics with DuckDB, pandas, or Polars. See Parquet Schema Reference for the complete schema.

Related documentation:

Data Access

Metrics are organized for O(1) lookup by name with nested stats within each series:

1data["metrics"]["metric_name"]["series"][0]["stats"]["p99"]

Top-Level Structure

1{
2 "schema_version": "1.0",
3 "aiperf_version": "0.8.0",
4 "benchmark_id": "550e8400-e29b-41d4-a716-446655440000",
5 "summary": { ... },
6 "metrics": { ... },
7 "input_config": { ... }
8}
FieldTypeDescription
schema_versionstringSchema version for this export format (e.g., "1.0")
aiperf_versionstring or nullAIPerf version that generated this export (e.g., "0.8.0"). null if version unavailable.
benchmark_idstring or nullUnique UUID identifying this benchmark run. null if not available.
summaryobjectCollection metadata and endpoint information
metricsobjectMetrics keyed by name, each containing type info and series data
input_configobjectSerialized user configuration used for this benchmark run

Summary Section

1"summary": {
2 "endpoints_configured": [
3 "http://localhost:10000/metrics",
4 "http://localhost:10001/metrics"
5 ],
6 "endpoints_successful": [
7 "http://localhost:10000/metrics",
8 "http://localhost:10001/metrics"
9 ],
10 "start_time": "2025-12-10T16:07:13.596361",
11 "end_time": "2025-12-10T16:07:35.749758",
12 "endpoint_info": { ... }
13}
FieldTypeDescription
endpoints_configuredarray[string]Full endpoint URLs that were configured for scraping
endpoints_successfularray[string]Full endpoint URLs that returned data
start_timedatetimeWhen metrics collection started (ISO 8601)
end_timedatetimeWhen metrics collection ended (ISO 8601)
endpoint_infoobjectPer-endpoint collection metadata

Endpoint Info

1"endpoint_info": {
2 "http://localhost:10000/metrics": {
3 "total_fetches": 144,
4 "first_fetch_ns": 1765529006843416914,
5 "last_fetch_ns": 1765529029508409301,
6 "avg_fetch_latency_ms": 296.8633202916667,
7 "unique_updates": 72,
8 "first_update_ns": 1765529006843416914,
9 "last_update_ns": 1765529029508409301,
10 "duration_seconds": 22.664992387,
11 "avg_update_interval_ms": 319.225244887324,
12 "median_update_interval_ms": 334.0127105
13 }
14}
FieldTypeDescription
total_fetchesintTotal number of HTTP fetches from this endpoint
first_fetch_nsintTimestamp of first fetch in nanoseconds
last_fetch_nsintTimestamp of last fetch in nanoseconds
avg_fetch_latency_msfloatAverage time to fetch metrics from this endpoint in milliseconds
unique_updatesintNumber of fetches that returned changed metrics
first_update_nsintTimestamp of first unique update in nanoseconds
last_update_nsintTimestamp of last unique update in nanoseconds
duration_secondsfloatTime span from first to last unique update in seconds
avg_update_interval_msfloatAverage time between unique metric updates in milliseconds
median_update_interval_msfloat or nullMedian time between unique metric updates in milliseconds. More robust to outliers than average. null if fewer than 2 intervals.

Metrics Section

Each metric entry has this structure:

1"metrics": {
2 "metric_name": {
3 "type": "gauge|counter|histogram",
4 "description": "Metric description from HELP text",
5 "unit": "seconds|tokens|requests|...",
6 "series": [ ... ]
7 }
8}
FieldTypeDescription
typestringPrometheus metric type: gauge, counter, or histogram
descriptionstringHuman-readable description from Prometheus HELP text
unitstring or nullUnit inferred from metric name suffix. See Unit Inference for complete mapping of suffixes to unit values.
seriesarrayStatistics for each unique endpoint + label combination

Series Fields (Common)

Every series entry contains these common fields:

1{
2 "endpoint_url": "http://localhost:10000/metrics",
3 "labels": {"model": "Qwen/Qwen3-0.6B", "dynamo_component": "prefill"}
4}
FieldTypeDescription
endpoint_urlstringFull endpoint URL (e.g., http://localhost:10000/metrics)
labelsobject or nullPrometheus labels for this time series. null or missing if metric has no labels.

Gauge Metrics

Gauges represent point-in-time values that can go up or down (e.g., current queue depth, memory usage).

Gauge Series Fields

FieldTypeDescription
endpoint_urlstringFull endpoint URL
labelsobject/nullPrometheus labels for this series
statsobjectNested statistics object (always present)
timeslicesarrayOptional: Statistics broken down by time window

Gauge Stats Fields

FieldTypeDescription
avgfloatMean of all observed values during collection
minfloatMinimum observed value
maxfloatMaximum observed value
stdfloatStandard deviation of observed values
p1float1st percentile
p5float5th percentile
p10float10th percentile
p25float25th percentile
p50float50th percentile (median)
p75float75th percentile
p90float90th percentile
p95float95th percentile
p99float99th percentile

Gauge with Variation

1{
2 "endpoint_url": "http://localhost:10002/metrics",
3 "labels": {
4 "dynamo_component": "backend",
5 "dynamo_endpoint": "generate",
6 "model": "Qwen/Qwen3-0.6B"
7 },
8 "stats": {
9 "avg": 36.68055555555556,
10 "min": 0.0,
11 "max": 50.0,
12 "std": 16.87887786545273,
13 "p1": 0.0,
14 "p5": 2.0,
15 "p10": 8.0,
16 "p25": 25.0,
17 "p50": 45.5,
18 "p75": 47.0,
19 "p90": 48.0,
20 "p95": 49.0,
21 "p99": 50.0
22 },
23 "timeslices": [
24 {
25 "start_ns": 1765411635639590410,
26 "end_ns": 1765411637639590410,
27 "avg": 5.0,
28 "min": 0.0,
29 "max": 15.0
30 },
31 {
32 "start_ns": 1765411637639590410,
33 "end_ns": 1765411639639590410,
34 "avg": 31.67,
35 "min": 24.0,
36 "max": 35.0
37 }
38 ]
39}

Example interpretation (dynamo_component_inflight_requests):

  • “On average, 36.7 requests were in-flight”
  • “In-flight requests ranged from 0 to 50”
  • “99% of the time, in-flight requests were at or below 50”

Gauge with No Variation (constant)

When a gauge never changes during collection (standard deviation = 0), stats are still provided for API consistency. All percentiles equal the constant value:

1{
2 "endpoint_url": "http://localhost:11001/metrics",
3 "labels": {
4 "dynamo_component": "prefill",
5 "dynamo_namespace": "acasagrande_sglang_acasagrande_sglang_disagg"
6 },
7 "stats": {
8 "avg": 1024.0,
9 "min": 1024.0,
10 "max": 1024.0,
11 "std": 0.0,
12 "p1": 1024.0,
13 "p5": 1024.0,
14 "p10": 1024.0,
15 "p25": 1024.0,
16 "p50": 1024.0,
17 "p75": 1024.0,
18 "p90": 1024.0,
19 "p95": 1024.0,
20 "p99": 1024.0
21 }
22}

Gauge Timeslices

Each gauge timeslice contains statistics for a fixed time window:

FieldTypeDescription
start_nsintTimeslice start timestamp in nanoseconds
end_nsintTimeslice end timestamp in nanoseconds
is_completeboolOnly present when false (partial timeslice, typically the final slice). Omitted for complete timeslices.
avgfloatAverage value during this timeslice
minfloatMinimum value during this timeslice
maxfloatMaximum value during this timeslice
1{
2 "start_ns": 1765411635639590410,
3 "end_ns": 1765411637639590410,
4 "avg": 5.0,
5 "min": 0.0,
6 "max": 15.0
7}

Counter Metrics

Counters are monotonically increasing values (e.g., total requests processed, total bytes transferred).

Counter Series Fields

FieldTypeDescription
endpoint_urlstringFull endpoint URL
labelsobject/nullPrometheus labels for this series
statsobjectNested statistics object (always present)
timeslicesarrayOptional: Statistics broken down by time window

Counter Stats Fields

FieldTypeDescription
totalfloatTotal increase in counter value during collection period
ratefloatOverall rate: total / duration_seconds
rate_avgfloatTime-weighted average rate between change points
rate_minfloatMinimum instantaneous rate observed between consecutive scrapes
rate_maxfloatMaximum instantaneous rate observed between consecutive scrapes
rate_stdfloatStandard deviation of point-to-point rates

Counter with Activity

1{
2 "endpoint_url": "http://localhost:10001/metrics",
3 "labels": {
4 "dynamo_component": "prefill",
5 "dynamo_endpoint": "generate",
6 "model": "Qwen/Qwen3-0.6B"
7 },
8 "stats": {
9 "total": 318092.0,
10 "rate": 14206.446174934012,
11 "rate_avg": 14458.727272727272,
12 "rate_min": 0.0,
13 "rate_max": 69626.0,
14 "rate_std": 25812.771107887304
15 },
16 "timeslices": [
17 {
18 "start_ns": 1765411635103733481,
19 "end_ns": 1765411637103733481,
20 "total": 104707.0,
21 "rate": 52353.5
22 },
23 {
24 "start_ns": 1765411637103733481,
25 "end_ns": 1765411639103733481,
26 "total": 74133.0,
27 "rate": 37066.5
28 }
29 ]
30}

Example interpretation (dynamo_component_request_bytes):

  • stats.total: 318092 → “318,092 bytes were received during the benchmark”
  • stats.rate: 14206.4 → “Overall throughput was 14,206 bytes/second”
  • stats.rate_avg: 14458.7 → “Average instantaneous rate was 14,459 bytes/second”
  • stats.rate_min: 0.0 → “Slowest period saw 0 bytes/second (idle)”
  • stats.rate_max: 69626.0 → “Fastest burst reached 69,626 bytes/second”

Counter with No Activity

When a counter doesn’t change during the collection period (total = 0), stats are still provided for API consistency:

1{
2 "endpoint_url": "http://localhost:10001/metrics",
3 "labels": {
4 "dynamo_component": "prefill",
5 "dynamo_endpoint": "clear_kv_blocks",
6 "model": "Qwen/Qwen3-0.6B"
7 },
8 "stats": {
9 "total": 0.0,
10 "rate": 0.0
11 }
12}

Counter Timeslices

Each counter timeslice contains the delta and rate for a fixed time window:

FieldTypeDescription
start_nsintTimeslice start timestamp in nanoseconds
end_nsintTimeslice end timestamp in nanoseconds
is_completeboolOnly present when false (partial timeslice, typically the final slice). Omitted for complete timeslices.
totalfloatTotal increase in counter value during this timeslice
ratefloatRate of counter value increase per second during this timeslice
1{
2 "start_ns": 1765411635103733481,
3 "end_ns": 1765411637103733481,
4 "total": 104707.0,
5 "rate": 52353.5
6}

Histogram Metrics

Histograms track distributions of values (e.g., request latencies, token counts). Prometheus histograms maintain cumulative bucket counts and a running sum.

Histogram Series Fields

FieldTypeDescription
endpoint_urlstringFull endpoint URL
labelsobject/nullPrometheus labels for this series
statsobjectNested statistics object (always present for histograms)
bucketsobject/nullMap of bucket upper bounds to delta counts. Present when count > 0, may be null if counter reset detected.
timeslicesarrayOptional: Statistics broken down by time window

Histogram Stats Fields

FieldTypeDescription
countintTotal count change over collection period (number of observations)
sumfloatTotal sum change over collection period
avgfloatOverall average value: sum / count
count_ratefloatAverage count change per second (observations per second)
sum_ratefloatAverage sum change per second
p1_estimatefloatEstimated 1st percentile
p5_estimatefloatEstimated 5th percentile
p10_estimatefloatEstimated 10th percentile
p25_estimatefloatEstimated 25th percentile
p50_estimatefloatEstimated 50th percentile (median)
p75_estimatefloatEstimated 75th percentile
p90_estimatefloatEstimated 90th percentile
p95_estimatefloatEstimated 95th percentile
p99_estimatefloatEstimated 99th percentile

Note: Percentiles are estimates interpolated from histogram buckets.

Histogram with Observations

1{
2 "endpoint_url": "http://localhost:10001/metrics",
3 "labels": {
4 "dynamo_component": "prefill",
5 "dynamo_endpoint": "generate",
6 "model": "Qwen/Qwen3-0.6B"
7 },
8 "stats": {
9 "count": 50,
10 "sum": 2.2072624189999814,
11 "avg": 0.04414524837999963,
12 "count_rate": 2.233071906073402,
13 "sum_rate": 0.09857951394400953,
14 "p1_estimate": 0.025,
15 "p5_estimate": 0.028,
16 "p10_estimate": 0.030,
17 "p25_estimate": 0.033,
18 "p50_estimate": 0.038245593313299506,
19 "p75_estimate": 0.052658494249919106,
20 "p90_estimate": 0.07715849424991911,
21 "p95_estimate": 0.08532516091658578,
22 "p99_estimate": 0.0918584942499191
23 },
24 "buckets": {
25 "0.005": 0,
26 "0.01": 0,
27 "0.025": 0,
28 "0.05": 35,
29 "0.1": 50,
30 "0.25": 50,
31 "0.5": 50,
32 "1": 50,
33 "2.5": 50,
34 "5": 50,
35 "10": 50,
36 "+Inf": 50
37 },
38 "timeslices": [
39 {
40 "start_ns": 1765411635103733481,
41 "end_ns": 1765411637103733481,
42 "count": 15,
43 "sum": 0.5630153879999966,
44 "avg": 0.03753435919999978,
45 "buckets": {
46 "0.005": 0,
47 "0.025": 0,
48 "0.05": 10,
49 "0.1": 15,
50 "0.25": 15,
51 "0.5": 15,
52 "1": 15,
53 "2.5": 15,
54 "5": 15,
55 "10": 15,
56 "+Inf": 15
57 }
58 },
59 {
60 "start_ns": 1765411637103733481,
61 "end_ns": 1765411639103733481,
62 "count": 12,
63 "sum": 0.631630536000003,
64 "avg": 0.05263587800000025,
65 "buckets": {
66 "0.005": 0,
67 "0.025": 0,
68 "0.05": 8,
69 "0.1": 12,
70 "0.25": 12,
71 "0.5": 12,
72 "1": 12,
73 "2.5": 12,
74 "5": 12,
75 "10": 12,
76 "+Inf": 12
77 }
78 }
79 ]
80}

Histogram with No Observations

When a histogram has no observations, stats contains only count: 0, and buckets contains all zeros:

1{
2 "endpoint_url": "http://localhost:10001/metrics",
3 "labels": {
4 "dynamo_component": "prefill",
5 "dynamo_endpoint": "clear_kv_blocks",
6 "model": "Qwen/Qwen3-0.6B"
7 },
8 "stats": {
9 "count": 0
10 },
11 "buckets": {
12 "0.005": 0,
13 "0.01": 0,
14 "0.025": 0,
15 "0.05": 0,
16 "0.1": 0,
17 "0.25": 0,
18 "0.5": 0,
19 "1": 0,
20 "2.5": 0,
21 "5": 0,
22 "10": 0,
23 "+Inf": 0
24 }
25}

Bucket Data

Bucket keys are the upper bound (as strings), values are delta counts (number of new observations in each bucket during the collection period). The +Inf bucket contains the total delta count.

1"buckets": {
2 "0.005": 0,
3 "0.05": 35,
4 "0.1": 50,
5 "+Inf": 50
6}

Histogram Timeslices

Each histogram timeslice contains count, sum, average, and bucket deltas for a fixed time window:

FieldTypeDescription
start_nsintTimeslice start timestamp in nanoseconds
end_nsintTimeslice end timestamp in nanoseconds
is_completeboolOnly present when false (partial timeslice, typically the final slice). Omitted for complete timeslices.
countintChange in count during this timeslice
sumfloatChange in sum during this timeslice
avgfloatAverage value during this timeslice: sum / count
bucketsobject/nullMap of bucket upper bounds to delta counts during this timeslice
1{
2 "start_ns": 1765411635103733481,
3 "end_ns": 1765411637103733481,
4 "count": 15,
5 "sum": 0.5630153879999966,
6 "avg": 0.03753435919999978,
7 "buckets": {
8 "0.005": 0,
9 "0.05": 10,
10 "0.1": 15,
11 "+Inf": 15
12 }
13}

Histogram Field Semantics by Use Case

The meaning of histogram fields depends on what the histogram measures:

Request-Level Histograms (e.g., vllm:e2e_request_latency_seconds)

FieldSemantic MeaningExample
stats.countNumber of requests50 requests
stats.count_rateRequest throughput2.23 requests/second
stats.avgMean request duration0.044 seconds
stats.sumTotal time spent on requests2.21 seconds
stats.sum_rateConcurrency metric: seconds of request time per second of real time0.099 (≈0.1 concurrent requests)
stats.p99_estimate99th percentile latency0.092 seconds

Token-Level Histograms (e.g., input_sequence_tokens)

FieldSemantic MeaningExample
stats.countNumber of requests50 requests
stats.count_rateRequest throughput2.29 requests/second
stats.avgMean tokens per request986 tokens
stats.sumTotal tokens processed49,311 tokens
stats.sum_rateToken throughput2,264 tokens/second
stats.p99_estimate99th percentile tokens2,193 tokens

Field Presence Rules

Fields are omitted when not applicable to reduce JSON size. All series now use consistent stats format.

ConditionFields Present
Gauge (any)endpoint_url, labels, stats (with all percentiles), timeslices (optional)
Gauge with no variation (std=0)Same as above, but all percentiles equal the constant value and std=0
Counter (any)endpoint_url, labels, stats (with total, rate, rate_* fields), timeslices (optional)
Counter with no activity (total=0)Same as above, but total=0 and all rates=0
Histogram with no observations (count=0)endpoint_url, labels, stats (count=0 only), buckets (all zeros)
Histogram with observations (count>0)endpoint_url, labels, stats (all fields), buckets, timeslices (optional)
Metric has no labelslabels is null or omitted
Unit cannot be inferredunit is null or omitted
Timeslices not requestedtimeslices omitted

Unit Inference

Units are inferred from metric name suffixes. Longer suffixes are matched first to handle compound suffixes correctly (e.g., _tokens_total matches before _total).

The “JSON Unit Value” column shows the actual string that appears in the unit field of the exported JSON (computed via enum.name.lower().replace("_per_second", "/s")).

Metric Name SuffixJSON Unit Value
Time
_seconds, _seconds_totalseconds
_milliseconds, _ms, _ms_totalmilliseconds
_nanoseconds, _ns, _ns_totalnanoseconds
Size
_bytes, _bytes_totalbytes
_kilobyteskilobytes
_megabytesmegabytes
_gigabytesgigabytes
Counts
_total, _countcount
_tokens, _tokens_totaltokens
_requests, _requests_total, _reqsrequests
request_successrequests (special case: no underscore prefix)
_errors, _errors_total, _error_count, _error_count_totalerrors
_blocks, _blocks_total, _block_countblocks
Rates
_gb_sgb/s
Ratios
_ratioratio
_percent, _percpercent
Physical
_celsiuscelsius
_joulesjoule
_wattswatt

Note: Additional units may be inferred from metric description text (e.g., “in milliseconds”, “(GB/s)”). Description-based inference takes priority when both suffix and description are present.


Data Normalization

All statistics in the export are computed over the collection period, which may exclude warmup time based on configuration. Understanding how each metric type is normalized is critical for correct interpretation.

Counter Normalization

Counters are cumulative values in Prometheus—they only increase (except on server restart). The export normalizes them to deltas (changes) over the collection period:

Export FieldCalculationExample
totalfinal_value - reference_valueIf counter went from 1000 to 1500, total = 500
ratetotal / duration_secondsOverall rate: 500 / 22.0 = 22.7/second
rate_avgMean of per-timeslice ratesAverage instantaneous rate across all timeslices
rate_minMinimum per-timeslice rateSlowest period (may be 0 during idle)
rate_maxMaximum per-timeslice rateFastest burst
rate_stdStandard deviation of ratesVariability of rate over time

Counter reset handling: If a counter decreases (server restart), the delta is clamped to 0 to avoid negative totals.

Reference point: The reference value for delta calculation is the last sample before the collection period starts (after warmup exclusion), ensuring accurate deltas at the period boundary.

Gauge Normalization

Gauges are point-in-time values. Statistics are computed from all samples within the collection period:

Export FieldCalculationNotes
avgArithmetic mean of all samplesSimple average, not time-weighted
min, maxMinimum/maximum observedExtreme values seen
stdSample standard deviation (ddof=1)Unbiased estimate using Bessel’s correction
p1 - p99Exact percentilesComputed from raw sample data using NumPy

Constant gauge handling: If standard deviation = 0 (gauge never varied), all percentiles will equal the constant value.

Histogram Normalization

Histograms are cumulative in Prometheus—both bucket counts and sum only increase. The export normalizes to deltas:

Export FieldCalculationNotes
countfinal_count - reference_countNumber of observations during period
sumfinal_sum - reference_sumSum of observed values during period
avgsum / countAverage value per observation
count_ratecount / duration_secondsObservations per second
sum_ratesum / duration_secondsSum increase per second
bucketsPer-bucket deltasEach bucket shows count increase during period
p*_estimateEstimated percentilesSee Histogram Percentile Estimation

Timeslice Normalization

When --slice-duration is configured (default: 2 seconds), the collection period is divided into fixed-duration windows. Each timeslice contains:

  • Gauges: avg, min, max for that window
  • Counters: total (delta) and rate for that window
  • Histograms: count, sum, avg, and optional buckets for that window

Fallback behavior: If the configured slice duration is smaller than the actual metric update interval, the system falls back to per-interval mode where each sample interval becomes its own “timeslice”.

Warmup Exclusion

When warmup time is configured, metrics collected during warmup are excluded from all statistics. The reference_value for delta calculations is taken from the last sample before the warmup period ends.


Histogram Percentile Estimation

Histogram percentiles are estimates because Prometheus histograms only store cumulative bucket counts, not individual observations. AIPerf uses a polynomial histogram algorithm for significantly improved accuracy over standard linear interpolation.

Why Standard Interpolation Fails

Standard Prometheus histogram interpolation assumes observations are uniformly distributed within each bucket. This assumption fails badly when:

  1. Observations cluster near boundaries: Real latency distributions often cluster near 0 or near bucket edges
  2. +Inf bucket contains data: The unbounded bucket makes interpolation impossible
  3. Bucket widths are large: Wide buckets hide the true distribution shape

Standard interpolation can produce errors of 5-10x on P99 estimates for typical LLM inference workloads.

Polynomial Histogram Algorithm

AIPerf implements a four-phase algorithm that provides ~5x reduction in percentile estimation error:

Phase 1 - Per-bucket mean learning: When a scrape interval has all observations in a single bucket, the exact mean for that bucket can be computed: mean = sum_delta / count_delta. These learned means are accumulated over time via accumulate_bucket_statistics().

Phase 2 - Estimate bucket sums: For each finite bucket, estimate the sum using learned means (or midpoint fallback). This gives estimated_finite_sum.

Phase 3 - +Inf bucket back-calculation: The +Inf bucket sum is calculated as total_sum - estimated_finite_sum. Observations are spread around the estimated mean inf_avg = inf_sum / inf_count within the +Inf range.

Phase 4 - Generate finite observations with sum constraint: For each bucket, observations are placed using one of several strategies based on learned statistics:

  • F3 two-point mass: When variance is extremely tight (< 1% of bucket width)
  • Blended distribution: When variance is tight (< 20%) and mean is near center (< 30% offset)
  • Variance-aware distribution: When variance is moderate
  • Shifted uniform: Fallback when only mean is learned (no variance data)
  • Pure uniform: Final fallback using bucket midpoint

After initial placement, positions are adjusted proportionally across all buckets to match the adjusted target sum (total_sum - inf_sum_estimate), with each bucket’s adjustment capped at ±40% of bucket width.

Percentile Field Naming

Histogram percentiles use the _estimate suffix to indicate they are approximations:

FieldDescription
p1_estimate - p99_estimateEstimated percentiles using polynomial algorithm

Gauge percentiles (computed from raw samples) do not have the _estimate suffix because they are exact.


Example Queries

Find all metrics with p99 > 1 second

1for name, metric in data["metrics"].items():
2 for series in metric["series"]:
3 stats = series.get("stats", {})
4 # Gauge percentiles use "p99", histogram uses "p99_estimate"
5 p99 = stats.get("p99") or stats.get("p99_estimate")
6 if p99 and p99 > 1.0 and metric.get("unit") == "seconds":
7 print(f"{name}: p99={p99:.2f}s")

Calculate total bytes transferred across all endpoints

1total = sum(
2 series.get("stats", {}).get("total", 0)
3 for series in data["metrics"]["dynamo_component_request_bytes"]["series"]
4)

Find highest throughput endpoint

1max_throughput = max(
2 (series.get("stats", {}).get("rate", 0), series["endpoint_url"])
3 for series in data["metrics"]["dynamo_component_requests"]["series"]
4)

Access timeslice data

1metric = data["metrics"]["dynamo_component_inflight_requests"]
2for series in metric["series"]:
3 if series.get("timeslices"):
4 for ts in series["timeslices"]:
5 duration_ns = ts["end_ns"] - ts["start_ns"]
6 duration_s = duration_ns / 1e9
7 print(f" {duration_s:.1f}s window: avg={ts['avg']:.2f}")

Minimal Example

1{
2 "schema_version": "1.0",
3 "aiperf_version": "0.8.0",
4 "benchmark_id": "550e8400-e29b-41d4-a716-446655440000",
5 "summary": {
6 "endpoints_configured": [
7 "http://localhost:10000/metrics",
8 "http://localhost:10001/metrics",
9 "http://localhost:10002/metrics"
10 ],
11 "endpoints_successful": [
12 "http://localhost:10000/metrics",
13 "http://localhost:10001/metrics",
14 "http://localhost:10002/metrics"
15 ],
16 "start_time": "2025-12-10T16:07:13.596361",
17 "end_time": "2025-12-10T16:07:35.749758",
18 "endpoint_info": {
19 "http://localhost:10000/metrics": {
20 "total_fetches": 144,
21 "first_fetch_ns": 1765529006843416914,
22 "last_fetch_ns": 1765529029508409301,
23 "avg_fetch_latency_ms": 296.86,
24 "unique_updates": 72,
25 "first_update_ns": 1765529006843416914,
26 "last_update_ns": 1765529029508409301,
27 "duration_seconds": 22.66,
28 "avg_update_interval_ms": 319.23,
29 "median_update_interval_ms": 334.01
30 },
31 "http://localhost:10001/metrics": {
32 "total_fetches": 140,
33 "first_fetch_ns": 1765529007434057293,
34 "last_fetch_ns": 1765529029554057293,
35 "avg_fetch_latency_ms": 285.42,
36 "unique_updates": 70,
37 "first_update_ns": 1765529007434057293,
38 "last_update_ns": 1765529029554057293,
39 "duration_seconds": 22.12,
40 "avg_update_interval_ms": 316.00,
41 "median_update_interval_ms": 320.50
42 },
43 "http://localhost:10002/metrics": {
44 "total_fetches": 142,
45 "first_fetch_ns": 1765529006950000000,
46 "last_fetch_ns": 1765529029400000000,
47 "avg_fetch_latency_ms": 290.15,
48 "unique_updates": 71,
49 "first_update_ns": 1765529006950000000,
50 "last_update_ns": 1765529029400000000,
51 "duration_seconds": 22.45,
52 "avg_update_interval_ms": 318.10,
53 "median_update_interval_ms": 325.75
54 }
55 }
56 },
57 "metrics": {
58 "vllm:num_requests_running": {
59 "type": "gauge",
60 "description": "Number of requests currently in the model execution batch",
61 "unit": "requests",
62 "series": [
63 {
64 "endpoint_url": "http://localhost:10000/metrics",
65 "labels": {"model": "Qwen/Qwen3-0.6B"},
66 "stats": {
67 "avg": 36.68,
68 "min": 0.0,
69 "max": 50.0,
70 "std": 16.88,
71 "p1": 0.0,
72 "p5": 2.0,
73 "p10": 8.0,
74 "p25": 25.0,
75 "p50": 45.5,
76 "p75": 47.0,
77 "p90": 48.0,
78 "p95": 49.0,
79 "p99": 50.0
80 }
81 },
82 {
83 "endpoint_url": "http://localhost:10001/metrics",
84 "labels": {"model": "Qwen/Qwen3-0.6B"},
85 "stats": {
86 "avg": 12.34,
87 "min": 0.0,
88 "max": 25.0,
89 "std": 8.21,
90 "p1": 0.0,
91 "p5": 1.0,
92 "p10": 3.0,
93 "p25": 6.0,
94 "p50": 14.0,
95 "p75": 18.0,
96 "p90": 22.0,
97 "p95": 24.0,
98 "p99": 25.0
99 }
100 },
101 {
102 "endpoint_url": "http://localhost:10002/metrics",
103 "labels": {"model": "Qwen/Qwen3-0.6B"},
104 "stats": {
105 "avg": 8.92,
106 "min": 0.0,
107 "max": 18.0,
108 "std": 5.67,
109 "p1": 0.0,
110 "p5": 1.0,
111 "p10": 2.0,
112 "p25": 4.0,
113 "p50": 10.0,
114 "p75": 13.0,
115 "p90": 16.0,
116 "p95": 17.0,
117 "p99": 18.0
118 }
119 }
120 ]
121 },
122 "vllm:request_success": {
123 "type": "counter",
124 "description": "Count of successfully completed requests",
125 "unit": "requests",
126 "series": [
127 {
128 "endpoint_url": "http://localhost:10000/metrics",
129 "labels": {"model": "Qwen/Qwen3-0.6B"},
130 "stats": {
131 "total": 50.0,
132 "rate": 2.23,
133 "rate_avg": 2.27,
134 "rate_min": 0.0,
135 "rate_max": 11.5,
136 "rate_std": 4.09
137 }
138 },
139 {
140 "endpoint_url": "http://localhost:10001/metrics",
141 "labels": {"model": "Qwen/Qwen3-0.6B"},
142 "stats": {
143 "total": 50.0,
144 "rate": 2.26,
145 "rate_avg": 2.30,
146 "rate_min": 0.0,
147 "rate_max": 10.8,
148 "rate_std": 3.95
149 }
150 },
151 {
152 "endpoint_url": "http://localhost:10002/metrics",
153 "labels": {"model": "Qwen/Qwen3-0.6B"},
154 "stats": {
155 "total": 50.0,
156 "rate": 2.23,
157 "rate_avg": 2.25,
158 "rate_min": 0.0,
159 "rate_max": 11.2,
160 "rate_std": 4.01
161 }
162 }
163 ]
164 },
165 "vllm:e2e_request_latency_seconds": {
166 "type": "histogram",
167 "description": "End-to-end request latency from arrival to completion",
168 "unit": "seconds",
169 "series": [
170 {
171 "endpoint_url": "http://localhost:10000/metrics",
172 "labels": {"model": "Qwen/Qwen3-0.6B"},
173 "stats": {
174 "count": 50,
175 "sum": 2.21,
176 "avg": 0.044,
177 "count_rate": 2.23,
178 "sum_rate": 0.099,
179 "p1_estimate": 0.025,
180 "p5_estimate": 0.028,
181 "p10_estimate": 0.030,
182 "p25_estimate": 0.033,
183 "p50_estimate": 0.038,
184 "p75_estimate": 0.052,
185 "p90_estimate": 0.077,
186 "p95_estimate": 0.085,
187 "p99_estimate": 0.092
188 },
189 "buckets": {"0.005": 0, "0.05": 35, "0.1": 50, "+Inf": 50}
190 },
191 {
192 "endpoint_url": "http://localhost:10001/metrics",
193 "labels": {"model": "Qwen/Qwen3-0.6B"},
194 "stats": {
195 "count": 50,
196 "sum": 1.85,
197 "avg": 0.037,
198 "count_rate": 2.26,
199 "sum_rate": 0.084,
200 "p1_estimate": 0.020,
201 "p5_estimate": 0.023,
202 "p10_estimate": 0.025,
203 "p25_estimate": 0.028,
204 "p50_estimate": 0.032,
205 "p75_estimate": 0.045,
206 "p90_estimate": 0.065,
207 "p95_estimate": 0.072,
208 "p99_estimate": 0.078
209 },
210 "buckets": {"0.005": 0, "0.05": 42, "0.1": 50, "+Inf": 50}
211 },
212 {
213 "endpoint_url": "http://localhost:10002/metrics",
214 "labels": {"model": "Qwen/Qwen3-0.6B"},
215 "stats": {
216 "count": 50,
217 "sum": 2.05,
218 "avg": 0.041,
219 "count_rate": 2.23,
220 "sum_rate": 0.091,
221 "p1_estimate": 0.022,
222 "p5_estimate": 0.025,
223 "p10_estimate": 0.027,
224 "p25_estimate": 0.030,
225 "p50_estimate": 0.035,
226 "p75_estimate": 0.048,
227 "p90_estimate": 0.072,
228 "p95_estimate": 0.080,
229 "p99_estimate": 0.086
230 },
231 "buckets": {"0.005": 0, "0.05": 38, "0.1": 50, "+Inf": 50}
232 }
233 ]
234 }
235 },
236 "input_config": {
237 "model": "Qwen/Qwen3-0.6B",
238 "url": "http://localhost:10000",
239 "loadgen": {
240 "concurrency": 50,
241 "request_count": 50
242 },
243 "cli_command": "aiperf profile --model 'Qwen/Qwen3-0.6B' --url 'http://localhost:10000' --server-metrics 'http://localhost:10001/metrics' 'http://localhost:10002/metrics' --request-count 50 --concurrency 50",
244 "benchmark_id": "550e8400-e29b-41d4-a716-446655440000",
245 "server_metrics": [
246 "http://localhost:10001/metrics",
247 "http://localhost:10002/metrics"
248 ]
249 }
250}