AIPerf Server Metrics JSON Export Schema
This document describes the structure and semantics of every field in the AIPerf server metrics JSON export format.
Overview
The server metrics JSON export provides aggregated statistics from Prometheus metrics collected during a benchmark run.
Data Organization
Metrics are grouped by name across all endpoints. When scraping multiple servers (e.g., prefill worker at :10000 and decode worker at :10001), metrics with the same name appear under a single key.
Each unique endpoint + label combination keeps its own separate series. Within each metric, the series array contains one entry for every distinct combination of endpoint URL and Prometheus labels, with independent statistics.
For example, if vllm:num_requests_running is scraped from 3 endpoints with 2 label sets each, you get 6 per-endpoint series.
Example Command
Note: The --url endpoint (localhost:10000) is automatically scraped for server metrics.
Format selection: By default, AIPerf generates JSON and CSV exports. This document describes the JSON format. To control which formats are generated, use --server-metrics-formats:
- Default:
--server-metrics-formats json csv(JSONL and Parquet excluded to avoid large files) - Include JSONL:
--server-metrics-formats json csv jsonl - Include Parquet:
--server-metrics-formats json csv parquet - JSON only:
--server-metrics-formats json
The Parquet format exports raw time-series data with delta calculations in columnar format, optimized for SQL analytics with DuckDB, pandas, or Polars. See Parquet Schema Reference for the complete schema.
Related documentation:
- Server Metrics Tutorial - Quick start guide and usage examples
- Server Metrics Reference - Metric definitions by backend (vLLM, SGLang, TRT-LLM, Dynamo)
- Parquet Schema Reference - Raw time-series data schema
Data Access
Metrics are organized for O(1) lookup by name with nested stats within each series:
Top-Level Structure
Summary Section
Endpoint Info
Metrics Section
Each metric entry has this structure:
Series Fields (Common)
Every series entry contains these common fields:
Gauge Metrics
Gauges represent point-in-time values that can go up or down (e.g., current queue depth, memory usage).
Gauge Series Fields
Gauge Stats Fields
Gauge with Variation
Example interpretation (dynamo_component_inflight_requests):
- “On average, 36.7 requests were in-flight”
- “In-flight requests ranged from 0 to 50”
- “99% of the time, in-flight requests were at or below 50”
Gauge with No Variation (constant)
When a gauge never changes during collection (standard deviation = 0), stats are still provided for API consistency. All percentiles equal the constant value:
Gauge Timeslices
Each gauge timeslice contains statistics for a fixed time window:
Counter Metrics
Counters are monotonically increasing values (e.g., total requests processed, total bytes transferred).
Counter Series Fields
Counter Stats Fields
Counter with Activity
Example interpretation (dynamo_component_request_bytes):
stats.total: 318092→ “318,092 bytes were received during the benchmark”stats.rate: 14206.4→ “Overall throughput was 14,206 bytes/second”stats.rate_avg: 14458.7→ “Average instantaneous rate was 14,459 bytes/second”stats.rate_min: 0.0→ “Slowest period saw 0 bytes/second (idle)”stats.rate_max: 69626.0→ “Fastest burst reached 69,626 bytes/second”
Counter with No Activity
When a counter doesn’t change during the collection period (total = 0), stats are still provided for API consistency:
Counter Timeslices
Each counter timeslice contains the delta and rate for a fixed time window:
Histogram Metrics
Histograms track distributions of values (e.g., request latencies, token counts). Prometheus histograms maintain cumulative bucket counts and a running sum.
Histogram Series Fields
Histogram Stats Fields
Note: Percentiles are estimates interpolated from histogram buckets.
Histogram with Observations
Histogram with No Observations
When a histogram has no observations, stats contains only count: 0, and buckets contains all zeros:
Bucket Data
Bucket keys are the upper bound (as strings), values are delta counts (number of new observations in each bucket during the collection period). The +Inf bucket contains the total delta count.
Histogram Timeslices
Each histogram timeslice contains count, sum, average, and bucket deltas for a fixed time window:
Histogram Field Semantics by Use Case
The meaning of histogram fields depends on what the histogram measures:
Request-Level Histograms (e.g., vllm:e2e_request_latency_seconds)
Token-Level Histograms (e.g., input_sequence_tokens)
Field Presence Rules
Fields are omitted when not applicable to reduce JSON size. All series now use consistent stats format.
Unit Inference
Units are inferred from metric name suffixes. Longer suffixes are matched first to handle compound suffixes correctly (e.g., _tokens_total matches before _total).
The “JSON Unit Value” column shows the actual string that appears in the unit field of the exported JSON (computed via enum.name.lower().replace("_per_second", "/s")).
Note: Additional units may be inferred from metric description text (e.g., “in milliseconds”, “(GB/s)”). Description-based inference takes priority when both suffix and description are present.
Data Normalization
All statistics in the export are computed over the collection period, which may exclude warmup time based on configuration. Understanding how each metric type is normalized is critical for correct interpretation.
Counter Normalization
Counters are cumulative values in Prometheus—they only increase (except on server restart). The export normalizes them to deltas (changes) over the collection period:
Counter reset handling: If a counter decreases (server restart), the delta is clamped to 0 to avoid negative totals.
Reference point: The reference value for delta calculation is the last sample before the collection period starts (after warmup exclusion), ensuring accurate deltas at the period boundary.
Gauge Normalization
Gauges are point-in-time values. Statistics are computed from all samples within the collection period:
Constant gauge handling: If standard deviation = 0 (gauge never varied), all percentiles will equal the constant value.
Histogram Normalization
Histograms are cumulative in Prometheus—both bucket counts and sum only increase. The export normalizes to deltas:
Timeslice Normalization
When --slice-duration is configured (default: 2 seconds), the collection period is divided into fixed-duration windows. Each timeslice contains:
- Gauges:
avg,min,maxfor that window - Counters:
total(delta) andratefor that window - Histograms:
count,sum,avg, and optionalbucketsfor that window
Fallback behavior: If the configured slice duration is smaller than the actual metric update interval, the system falls back to per-interval mode where each sample interval becomes its own “timeslice”.
Warmup Exclusion
When warmup time is configured, metrics collected during warmup are excluded from all statistics. The reference_value for delta calculations is taken from the last sample before the warmup period ends.
Histogram Percentile Estimation
Histogram percentiles are estimates because Prometheus histograms only store cumulative bucket counts, not individual observations. AIPerf uses a polynomial histogram algorithm for significantly improved accuracy over standard linear interpolation.
Why Standard Interpolation Fails
Standard Prometheus histogram interpolation assumes observations are uniformly distributed within each bucket. This assumption fails badly when:
- Observations cluster near boundaries: Real latency distributions often cluster near 0 or near bucket edges
- +Inf bucket contains data: The unbounded bucket makes interpolation impossible
- Bucket widths are large: Wide buckets hide the true distribution shape
Standard interpolation can produce errors of 5-10x on P99 estimates for typical LLM inference workloads.
Polynomial Histogram Algorithm
AIPerf implements a four-phase algorithm that provides ~5x reduction in percentile estimation error:
Phase 1 - Per-bucket mean learning:
When a scrape interval has all observations in a single bucket, the exact mean for that bucket can be computed: mean = sum_delta / count_delta. These learned means are accumulated over time via accumulate_bucket_statistics().
Phase 2 - Estimate bucket sums:
For each finite bucket, estimate the sum using learned means (or midpoint fallback). This gives estimated_finite_sum.
Phase 3 - +Inf bucket back-calculation:
The +Inf bucket sum is calculated as total_sum - estimated_finite_sum. Observations are spread around the estimated mean inf_avg = inf_sum / inf_count within the +Inf range.
Phase 4 - Generate finite observations with sum constraint: For each bucket, observations are placed using one of several strategies based on learned statistics:
- F3 two-point mass: When variance is extremely tight (< 1% of bucket width)
- Blended distribution: When variance is tight (< 20%) and mean is near center (< 30% offset)
- Variance-aware distribution: When variance is moderate
- Shifted uniform: Fallback when only mean is learned (no variance data)
- Pure uniform: Final fallback using bucket midpoint
After initial placement, positions are adjusted proportionally across all buckets to match the adjusted target sum (total_sum - inf_sum_estimate), with each bucket’s adjustment capped at ±40% of bucket width.
Percentile Field Naming
Histogram percentiles use the _estimate suffix to indicate they are approximations:
Gauge percentiles (computed from raw samples) do not have the _estimate suffix because they are exact.