This document describes the structure and semantics of every field in the AIPerf server metrics JSON export format.
The server metrics JSON export provides aggregated statistics from Prometheus metrics collected during a benchmark run.
Metrics are grouped by name across all endpoints. When scraping multiple servers (e.g., prefill worker at :10000 and decode worker at :10001), metrics with the same name appear under a single key.
Each unique endpoint + label combination keeps its own separate series. Within each metric, the series array contains one entry for every distinct combination of endpoint URL and Prometheus labels, with independent statistics.
For example, if vllm:num_requests_running is scraped from 3 endpoints with 2 label sets each, you get 6 per-endpoint series.
Note: The --url endpoint (localhost:10000) is automatically scraped for server metrics.
Format selection: By default, AIPerf generates JSON and CSV exports. This document describes the JSON format. To control which formats are generated, use --server-metrics-formats:
--server-metrics-formats json csv (JSONL and Parquet excluded to avoid large files)--server-metrics-formats json csv jsonl--server-metrics-formats json csv parquet--server-metrics-formats jsonThe Parquet format exports raw time-series data with delta calculations in columnar format, optimized for SQL analytics with DuckDB, pandas, or Polars. See Parquet Schema Reference for the complete schema.
Related documentation:
Metrics are organized for O(1) lookup by name with nested stats within each series:
Each metric entry has this structure:
Every series entry contains these common fields:
Gauges represent point-in-time values that can go up or down (e.g., current queue depth, memory usage).
Example interpretation (dynamo_component_inflight_requests):
When a gauge never changes during collection (standard deviation = 0), stats are still provided for API consistency. All percentiles equal the constant value:
Each gauge timeslice contains statistics for a fixed time window:
Prometheus families declared # TYPE foo untyped — and families that ship with no # TYPE line at all, which the parser also classifies as untyped — appear in the export with type: "unknown". node-exporter’s node_netstat_Icmp_*, node_netstat_Tcp_*, and node_netstat_IpExt_* families are typical examples.
AIPerf treats unknown as gauge-equivalent for storage and statistics: the series shape and stat fields are identical to a Gauge. The dedicated type: "unknown" tag is preserved (rather than flattened to "gauge") so a real gauge and an exporter-untyped scalar remain distinguishable for downstream consumers — e.g., to flag that the exporter is explicitly not asserting monotonic or rate semantics.
Identical to Gauge Series Fields.
Identical to Gauge Stats Fields.
Counters are monotonically increasing values (e.g., total requests processed, total bytes transferred).
Example interpretation (dynamo_component_request_bytes):
stats.total: 318092 → “318,092 bytes were received during the benchmark”stats.rate: 14206.4 → “Overall throughput was 14,206 bytes/second”stats.rate_avg: 14458.7 → “Average instantaneous rate was 14,459 bytes/second”stats.rate_min: 0.0 → “Slowest period saw 0 bytes/second (idle)”stats.rate_max: 69626.0 → “Fastest burst reached 69,626 bytes/second”When a counter doesn’t change during the collection period (total = 0), stats are still provided for API consistency:
Each counter timeslice contains the delta and rate for a fixed time window:
Histograms track distributions of values (e.g., request latencies, token counts). Prometheus histograms maintain cumulative bucket counts and a running sum.
Note: Percentiles are estimates interpolated from histogram buckets.
When a histogram has no observations, stats contains only count: 0, and buckets contains all zeros:
Bucket keys are the upper bound (as strings), values are delta counts (number of new observations in each bucket during the collection period). The +Inf bucket contains the total delta count.
Each histogram timeslice contains count, sum, average, and bucket deltas for a fixed time window:
The meaning of histogram fields depends on what the histogram measures:
vllm:e2e_request_latency_seconds)input_sequence_tokens)Fields are omitted when not applicable to reduce JSON size. All series now use consistent stats format.
Units are inferred from metric name suffixes. Longer suffixes are matched first to handle compound suffixes correctly (e.g., _tokens_total matches before _total).
The “JSON Unit Value” column shows the actual string that appears in the unit field of the exported JSON (computed via enum.name.lower().replace("_per_second", "/s")).
Note: Additional units may be inferred from metric description text (e.g., “in milliseconds”, “(GB/s)”). Description-based inference takes priority when both suffix and description are present.
All statistics in the export are computed over the collection period, which may exclude warmup time based on configuration. Understanding how each metric type is normalized is critical for correct interpretation.
Counters are cumulative values in Prometheus—they only increase (except on server restart). The export normalizes them to deltas (changes) over the collection period:
Counter reset handling: If a counter decreases (server restart), the delta is clamped to 0 to avoid negative totals.
Reference point: The reference value for delta calculation is the last sample before the collection period starts (after warmup exclusion), ensuring accurate deltas at the period boundary.
Gauges are point-in-time values. Statistics are computed from all samples within the collection period:
Constant gauge handling: If standard deviation = 0 (gauge never varied), all percentiles will equal the constant value.
Histograms are cumulative in Prometheus—both bucket counts and sum only increase. The export normalizes to deltas:
When --slice-duration is configured (default: 2 seconds), the collection period is divided into fixed-duration windows. Each timeslice contains:
avg, min, max for that windowtotal (delta) and rate for that windowcount, sum, avg, and optional buckets for that windowFallback behavior: If the configured slice duration is smaller than the actual metric update interval, the system falls back to per-interval mode where each sample interval becomes its own “timeslice”.
When warmup time is configured, metrics collected during warmup are excluded from all statistics. The reference_value for delta calculations is taken from the last sample before the warmup period ends.
Histogram percentiles are estimates because Prometheus histograms only store cumulative bucket counts, not individual observations. AIPerf uses a polynomial histogram algorithm for significantly improved accuracy over standard linear interpolation.
Standard Prometheus histogram interpolation assumes observations are uniformly distributed within each bucket. This assumption fails badly when:
Standard interpolation can produce errors of 5-10x on P99 estimates for typical LLM inference workloads.
AIPerf implements a four-phase algorithm that provides ~5x reduction in percentile estimation error:
Phase 1 - Per-bucket mean learning:
When a scrape interval has all observations in a single bucket, the exact mean for that bucket can be computed: mean = sum_delta / count_delta. These learned means are accumulated over time via accumulate_bucket_statistics().
Phase 2 - Estimate bucket sums:
For each finite bucket, estimate the sum using learned means (or midpoint fallback). This gives estimated_finite_sum.
Phase 3 - +Inf bucket back-calculation:
The +Inf bucket sum is calculated as total_sum - estimated_finite_sum. Observations are spread around the estimated mean inf_avg = inf_sum / inf_count within the +Inf range.
Phase 4 - Generate finite observations with sum constraint: For each bucket, observations are placed using one of several strategies based on learned statistics:
After initial placement, positions are adjusted proportionally across all buckets to match the adjusted target sum (total_sum - inf_sum_estimate), with each bucket’s adjustment capped at ±40% of bucket width.
Histogram percentiles use the _estimate suffix to indicate they are approximations:
Gauge percentiles (computed from raw samples) do not have the _estimate suffix because they are exact.