Server Metrics Parquet Export Schema | NVIDIA AIPerf Documentation

Schema reference for the server_metrics_export.parquet file. Optimized for SQL analytics with DuckDB, pandas, and Polars.

Overview

The Parquet export provides raw time-series data with cumulative delta calculations applied at each timestamp. Uses a normalized schema where histogram buckets are separate rows (not wide columns), producing ~50% smaller files.

Enable Parquet Export

$ aiperf profile --model MODEL ... --server-metrics-formats json csv parquet

Delta Calculations

All values are deltas from a reference point (last sample before profiling period):

Metric Type	Value Semantics
Gauge	Raw value at timestamp (no delta)
Counter	Cumulative delta from reference (`value[t] - value[ref]`)
Histogram	Cumulative deltas for `sum`, `count`, and each `bucket_count`

Negative deltas (counter resets) are clamped to 0.

Schema Definition

Fixed Columns

Column	Type	Nullable	Description
`endpoint_url`	`string`	No	Prometheus endpoint URL (e.g., `http://localhost:8000/metrics`)
`metric_name`	`string`	No	Metric name (e.g., `vllm:kv_cache_usage_perc`)
`metric_type`	`string`	No	`gauge`, `counter`, `histogram`, or `unknown` (Prometheus `# TYPE foo untyped`)
`unit`	`string`	Yes	Inferred unit (`seconds`, `tokens`, `requests`, `ratio`, etc.)
`description`	`string`	Yes	Metric HELP text from Prometheus
`timestamp_ns`	`int64`	No	Collection timestamp in nanoseconds since epoch

Value Columns

Column	Type	Nullable	Used By	Description
`value`	`float64`	Yes	Gauge, Counter	Metric value (raw for gauge, delta for counter)
`sum`	`float64`	Yes	Histogram	Cumulative sum delta from reference
`count`	`float64`	Yes	Histogram	Cumulative count delta from reference
`bucket_le`	`string`	Yes	Histogram	Bucket upper bound (e.g., `0.1`, `+Inf`)
`bucket_count`	`float64`	Yes	Histogram	Cumulative bucket count delta (observations <= `bucket_le`)

Dynamic Label Columns

Prometheus labels become individual columns (alphabetically sorted):

Column	Type	Nullable	Description
`engine`	`string`	Yes	vLLM engine ID
`engine_type`	`string`	Yes	Engine type (`trtllm`, `unified`, `prefill`, `decode`)
`finished_reason`	`string`	Yes	Request completion reason
`reason`	`string`	Yes	vLLM waiting reason or Triton failure reason
`sleep_state`	`string`	Yes	vLLM engine sleep state
`source`	`string`	Yes	vLLM prompt-token source
`position`	`string`	Yes	vLLM speculative-decoding draft position
`transfer_type`	`string`	Yes	vLLM KV offload transfer type
`model_name`	`string`	Yes	Model identifier
`dynamo_component`	`string`	Yes	Dynamo worker component
`worker_id`	`string`	Yes	Dynamo worker identifier
`worker_type`	`string`	Yes	Dynamo worker type (`prefill`, `decode`, etc.)
`router_id`	`string`	Yes	Dynamo router identifier
`operation`	`string`	Yes	Dynamo operation name
`migration_type`	`string`	Yes	Dynamo request migration type
`event_type`	`string`	Yes	Dynamo KV publisher event type
`worker`	`string`	Yes	Tokio worker index
`pool`	`string`	Yes	Dynamo KVBM logical pool name
`instance_id`	`string`	Yes	Dynamo KVBM external instance label
`tp_rank`	`string`	Yes	Tensor parallel rank
`pp_rank`	`string`	Yes	Pipeline parallel rank
`moe_ep_rank`	`string`	Yes	SGLang MoE expert-parallel rank
`dp_rank`	`string`	Yes	SGLang data-parallel rank
`priority`	`string`	Yes	SGLang priority-scheduling value
`stage`	`string`	Yes	SGLang processing stage
`mode`	`string`	Yes	SGLang token/CUDA graph mode
`category`	`string`	Yes	SGLang forward execution category
`cache_source`	`string`	Yes	SGLang cache source (`device`, `host`, `storage_*`, `total`)
`num_prefill_ranks`	`string`	Yes	SGLang DP cooperation prefill-rank count
`input_estimation`	`string`	Yes	SGLang prefill-delayer input estimate
`output_allow`	`string`	Yes	SGLang prefill-delayer output allowance
`output_reason`	`string`	Yes	SGLang prefill-delayer output reason
`actual_execution`	`string`	Yes	SGLang prefill-delayer execution outcome
`forward_mode`	`string`	Yes	SGLang expert-parallel forward mode
`layer`	`string`	Yes	SGLang model layer
`request_type`	`string`	Yes	Triton/TensorRT-LLM backend request type
`model_namespace`	`string`	Yes	Triton model namespace
`gpu_uuid`	`string`	Yes	Triton GPU UUID
`_custom_tag`	`string`	Yes	Triton model tag labels (actual column name uses the configured tag name prefixed with `_`)
`memory_type`	`string`	Yes	TensorRT-LLM backend memory type
`kv_cache_block_type`	`string`	Yes	TensorRT-LLM backend KV-cache block type
`disaggregated_serving_type`	`string`	Yes	TensorRT-LLM backend disaggregated-serving metric type
`version`	`string`	Yes	Triton model version
(others)	`string`	Yes	Any additional Prometheus labels

Label columns vary by endpoint/model. Use union_by_name=true for cross-file queries.

Note: Prometheus labels that conflict with reserved column names (endpoint_url, metric_name, metric_type, unit, description, timestamp_ns, value, sum, count, bucket_le, bucket_count) are silently excluded.

Row Structure by Metric Type

Column order: fixed columns → label columns (alphabetically) → value columns.

Gauge/Counter: One Row per Timestamp

endpoint_url | metric_name              | metric_type | unit  | description |timestamp_ns        | model_name   | value | sum  | count | bucket_le | bucket_count
-------------|--------------------------|-------------|-------|-------------|---------------------|--------------|-------|------|-------|-----------|-------------
http://...   | vllm:kv_cache_usage_perc | gauge       | ratio | KV-cache... | 1765793061967310848 | Qwen/Qwen3-0.6B | 0.72  | null | null  | null      | null
http://...   | vllm:request_success     | counter     | null  | Count of... | 1765793061967310848 | Qwen/Qwen3-0.6B | 150.0 | null | null  | null      | null

Histogram: N Rows per Timestamp (One per Bucket)

endpoint_url | metric_name                      | metric_type | unit    | description  | timestamp_ns        | model_name      | value | sum    | count | bucket_le | bucket_count
-------------|----------------------------------|-------------|---------|--------------|---------------------|-----------------|-------|--------|-------|-----------|-------------
http://...   | vllm:e2e_request_latency_seconds | histogram   | seconds | Histogram... | 1765793061967310848 | Qwen/Qwen3-0.6B | null  | 259.87 | 19.0  | 0.3       | 0.0
http://...   | vllm:e2e_request_latency_seconds | histogram   | seconds | Histogram... | 1765793061967310848 | Qwen/Qwen3-0.6B | null  | 259.87 | 19.0  | 1.0       | 1.0
http://...   | vllm:e2e_request_latency_seconds | histogram   | seconds | Histogram... | 1765793061967310848 | Qwen/Qwen3-0.6B | null  | 259.87 | 19.0  | 5.0       | 3.0
http://...   | vllm:e2e_request_latency_seconds | histogram   | seconds | Histogram... | 1765793061967310848 | Qwen/Qwen3-0.6B | null  | 259.87 | 19.0  | +Inf      | 19.0

File Metadata

Parquet file metadata (accessible via pq.read_metadata()) includes:

Key	Description
`aiperf.schema_version`	Schema version (`1.0`)
`aiperf.version`	AIPerf version
`aiperf.benchmark_id`	Unique benchmark UUID
`aiperf.exporter`	Exporter class name (`ServerMetricsParquetExporter`)
`aiperf.export_timestamp_utc`	Export timestamp (ISO 8601)
`aiperf.time_filter_start_ns`	Profiling period start (nanoseconds)
`aiperf.time_filter_end_ns`	Profiling period end (nanoseconds)
`aiperf.profiling_duration_ns`	Profiling duration (nanoseconds)
`aiperf.profiling_duration_seconds`	Profiling duration (seconds)
`aiperf.endpoint_urls`	JSON array of endpoint URLs
`aiperf.endpoint_count`	Number of endpoints
`aiperf.label_columns`	JSON array of label column names
`aiperf.label_count`	Number of label columns
`aiperf.metric_count`	Total unique metrics
`aiperf.metric_type_counts`	JSON object: `{"gauge": N, "counter": N, "histogram": N, "unknown": N}`
`aiperf.model_names`	JSON array of model names
`aiperf.concurrency`	Benchmark concurrency setting
`aiperf.request_rate`	Benchmark request rate (if set)
`aiperf.input_config`	Full user configuration (JSON)
`aiperf.hostname`	Collection host
`aiperf.python_version`	Python version
`aiperf.pyarrow_version`	PyArrow version
`aiperf.schema_note`	Cross-file query hint

Compression: Snappy (good compression ratio with fast decompression)

Example Queries

DuckDB

1 -- Time-series for a specific metric
2 SELECT timestamp_ns, value
3 FROM 'server_metrics_export.parquet'
4 WHERE metric_name = 'vllm:kv_cache_usage_perc'
5 ORDER BY timestamp_ns;
6 
7 -- Filter by label
8 SELECT timestamp_ns, value
9 FROM 'server_metrics_export.parquet'
10 WHERE metric_name = 'vllm:request_success'
11   AND model_name = 'Qwen/Qwen3-0.6B'
12 ORDER BY timestamp_ns;
13 
14 -- Histogram bucket distribution at final timestamp
15 SELECT bucket_le, bucket_count
16 FROM 'server_metrics_export.parquet'
17 WHERE metric_name = 'vllm:e2e_request_latency_seconds'
18   AND timestamp_ns = (SELECT MAX(timestamp_ns) FROM 'server_metrics_export.parquet'
19                       WHERE metric_name = 'vllm:e2e_request_latency_seconds')
20 ORDER BY CAST(REPLACE(bucket_le, '+Inf', '999999') AS DOUBLE);
21 
22 -- Aggregate across multiple runs (handles schema differences)
23 SELECT metric_name, AVG(value) as avg_value
24 FROM read_parquet('artifacts/*/server_metrics_export.parquet', union_by_name=true)
25 WHERE metric_type = 'gauge'
26 GROUP BY metric_name;
27 
28 -- Compare endpoints
29 SELECT endpoint_url, metric_name, AVG(value) as avg_value
30 FROM 'server_metrics_export.parquet'
31 WHERE metric_type = 'gauge'
32 GROUP BY endpoint_url, metric_name;

pandas

1 import pandas as pd
2 
3 df = pd.read_parquet('server_metrics_export.parquet')
4 
5 # Filter to gauge metrics
6 gauges = df[df['metric_type'] == 'gauge']
7 
8 # Time-series plot
9 kv_usage = df[df['metric_name'] == 'vllm:kv_cache_usage_perc']
10 kv_usage.plot(x='timestamp_ns', y='value', title='KV Cache Usage')
11 
12 # Pivot histogram buckets
13 hist = df[df['metric_name'] == 'vllm:e2e_request_latency_seconds']
14 pivot = hist.pivot(index='timestamp_ns', columns='bucket_le', values='bucket_count')

Polars

1 import polars as pl
2 
3 df = pl.read_parquet('server_metrics_export.parquet')
4 
5 # Filter and aggregate
6 result = (
7     df.filter(pl.col('metric_type') == 'gauge')
8     .group_by('metric_name')
9     .agg([
10         pl.col('value').mean().alias('avg'),
11         pl.col('value').max().alias('max'),
12     ])
13 )
14 
15 # Lazy scan for large files
16 lazy = pl.scan_parquet('artifacts/*/server_metrics_export.parquet')
17 result = lazy.filter(pl.col('metric_name') == 'vllm:kv_cache_usage_perc').collect()

Reading Metadata

1 import pyarrow.parquet as pq
2 import json
3 
4 metadata = pq.read_metadata('server_metrics_export.parquet')
5 schema_metadata = metadata.schema.to_arrow_schema().metadata
6 
7 # Access specific fields
8 benchmark_id = schema_metadata[b'aiperf.benchmark_id'].decode()
9 config = json.loads(schema_metadata[b'aiperf.input_config'])
10 label_columns = json.loads(schema_metadata[b'aiperf.label_columns'])

Best Practices

Cross-File Analysis

Label columns vary by endpoint and model. Always use union_by_name:

1 -- DuckDB
2 SELECT * FROM read_parquet('run_*/server_metrics_export.parquet', union_by_name=true);

1 # pandas
2 import pandas as pd
3 from pathlib import Path
4 
5 dfs = [pd.read_parquet(p) for p in Path('.').glob('run_*/server_metrics_export.parquet')]
6 combined = pd.concat(dfs, ignore_index=True)

Histogram Percentile Estimation

Reconstruct percentiles from bucket data. Note that bucket_count values are cumulative (each bucket includes all observations with value <= bucket_le), matching Prometheus histogram semantics:

1 import numpy as np
2 
3 def estimate_percentile(bucket_les, bucket_counts, percentile):
4     """Estimate percentile from histogram buckets using linear interpolation."""
5     # Convert bucket_le strings to floats (handle +Inf)
6     bounds = [float(b) if b != '+Inf' else np.inf for b in bucket_les]
7     counts = np.array(bucket_counts)
8 
9     total = counts[-1]  # +Inf bucket has cumulative total
10     target = total * (percentile / 100)
11 
12     for i, (le, count) in enumerate(zip(bounds, counts)):
13         if count >= target:
14             if i == 0:
15                 return le
16             prev_le = bounds[i-1] if i > 0 else 0
17             prev_count = counts[i-1] if i > 0 else 0
18             # Linear interpolation within bucket
19             fraction = (target - prev_count) / (count - prev_count) if count > prev_count else 0
20             return prev_le + fraction * (le - prev_le)
21     return bounds[-2]  # Return last finite bound

Memory-Efficient Processing

For large files, use lazy evaluation:

1 # Polars lazy scan
2 import polars as pl
3 df = pl.scan_parquet('server_metrics_export.parquet') \
4     .filter(pl.col('metric_name') == 'vllm:kv_cache_usage_perc') \
5     .collect()
6 
7 # DuckDB direct query (doesn't load entire file)
8 import duckdb
9 result = duckdb.query("""
10     SELECT AVG(value) FROM 'server_metrics_export.parquet'
11     WHERE metric_name = 'vllm:kv_cache_usage_perc'
12 """).fetchone()

Schema Version History

Version	Changes
`1.0`	Initial schema with normalized histogram buckets

For aggregated statistics, see JSON Schema. For metric definitions, see Server Metrics Reference. For usage examples, see the Server Metrics Tutorial.