For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
    • Welcome to AIPerf Documentation
  • Getting Started
    • Profiling with AIPerf
    • Comprehensive LLM Benchmarking
    • Migrating from GenAI-Perf
    • GenAI-Perf vs AIPerf CLI Feature Comparison Matrix
  • Tutorials
      • Server Metrics Collection
      • Server Metrics Reference
      • Server Metrics JSON Export Schema
      • Server Metrics Parquet Export Schema
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
On this page
  • Overview
  • Enable Parquet Export
  • Delta Calculations
  • Schema Definition
  • Fixed Columns
  • Value Columns
  • Dynamic Label Columns
  • Row Structure by Metric Type
  • Gauge/Counter: One Row per Timestamp
  • Histogram: N Rows per Timestamp (One per Bucket)
  • File Metadata
  • Example Queries
  • DuckDB
  • pandas
  • Polars
  • Reading Metadata
  • Best Practices
  • Cross-File Analysis
  • Histogram Percentile Estimation
  • Memory-Efficient Processing
  • Schema Version History
Server Metrics

AIPerf Server Metrics Parquet Export Schema

||View as Markdown|
Previous

Server Metrics JSON Export Schema

Next

Plugin System

Schema reference for the server_metrics_export.parquet file. Optimized for SQL analytics with DuckDB, pandas, and Polars.

Overview

The Parquet export provides raw time-series data with cumulative delta calculations applied at each timestamp. Uses a normalized schema where histogram buckets are separate rows (not wide columns), producing ~50% smaller files.

Enable Parquet Export

$aiperf profile --model MODEL ... --server-metrics-formats json csv parquet

Delta Calculations

All values are deltas from a reference point (last sample before profiling period):

Metric TypeValue Semantics
GaugeRaw value at timestamp (no delta)
CounterCumulative delta from reference (value[t] - value[ref])
HistogramCumulative deltas for sum, count, and each bucket_count

Negative deltas (counter resets) are clamped to 0.


Schema Definition

Fixed Columns

ColumnTypeNullableDescription
endpoint_urlstringNoPrometheus endpoint URL (e.g., http://localhost:8000/metrics)
metric_namestringNoMetric name (e.g., vllm:kv_cache_usage_perc)
metric_typestringNogauge, counter, histogram, or unknown (Prometheus # TYPE foo untyped)
unitstringYesInferred unit (seconds, tokens, requests, ratio, etc.)
descriptionstringYesMetric HELP text from Prometheus
timestamp_nsint64NoCollection timestamp in nanoseconds since epoch

Value Columns

ColumnTypeNullableUsed ByDescription
valuefloat64YesGauge, CounterMetric value (raw for gauge, delta for counter)
sumfloat64YesHistogramCumulative sum delta from reference
countfloat64YesHistogramCumulative count delta from reference
bucket_lestringYesHistogramBucket upper bound (e.g., 0.1, +Inf)
bucket_countfloat64YesHistogramCumulative bucket count delta (observations <= bucket_le)

Dynamic Label Columns

Prometheus labels become individual columns (alphabetically sorted):

ColumnTypeNullableDescription
enginestringYesvLLM engine ID
engine_typestringYesEngine type (trtllm, unified, prefill, decode)
finished_reasonstringYesRequest completion reason
reasonstringYesvLLM waiting reason or Triton failure reason
sleep_statestringYesvLLM engine sleep state
sourcestringYesvLLM prompt-token source
positionstringYesvLLM speculative-decoding draft position
transfer_typestringYesvLLM KV offload transfer type
model_namestringYesModel identifier
dynamo_componentstringYesDynamo worker component
worker_idstringYesDynamo worker identifier
worker_typestringYesDynamo worker type (prefill, decode, etc.)
router_idstringYesDynamo router identifier
operationstringYesDynamo operation name
migration_typestringYesDynamo request migration type
event_typestringYesDynamo KV publisher event type
workerstringYesTokio worker index
poolstringYesDynamo KVBM logical pool name
instance_idstringYesDynamo KVBM external instance label
tp_rankstringYesTensor parallel rank
pp_rankstringYesPipeline parallel rank
moe_ep_rankstringYesSGLang MoE expert-parallel rank
dp_rankstringYesSGLang data-parallel rank
prioritystringYesSGLang priority-scheduling value
stagestringYesSGLang processing stage
modestringYesSGLang token/CUDA graph mode
categorystringYesSGLang forward execution category
cache_sourcestringYesSGLang cache source (device, host, storage_*, total)
num_prefill_ranksstringYesSGLang DP cooperation prefill-rank count
input_estimationstringYesSGLang prefill-delayer input estimate
output_allowstringYesSGLang prefill-delayer output allowance
output_reasonstringYesSGLang prefill-delayer output reason
actual_executionstringYesSGLang prefill-delayer execution outcome
forward_modestringYesSGLang expert-parallel forward mode
layerstringYesSGLang model layer
request_typestringYesTriton/TensorRT-LLM backend request type
model_namespacestringYesTriton model namespace
gpu_uuidstringYesTriton GPU UUID
_custom_tagstringYesTriton model tag labels (actual column name uses the configured tag name prefixed with _)
memory_typestringYesTensorRT-LLM backend memory type
kv_cache_block_typestringYesTensorRT-LLM backend KV-cache block type
disaggregated_serving_typestringYesTensorRT-LLM backend disaggregated-serving metric type
versionstringYesTriton model version
(others)stringYesAny additional Prometheus labels

Label columns vary by endpoint/model. Use union_by_name=true for cross-file queries.

Note: Prometheus labels that conflict with reserved column names (endpoint_url, metric_name, metric_type, unit, description, timestamp_ns, value, sum, count, bucket_le, bucket_count) are silently excluded.


Row Structure by Metric Type

Column order: fixed columns → label columns (alphabetically) → value columns.

Gauge/Counter: One Row per Timestamp

endpoint_url | metric_name | metric_type | unit | description |timestamp_ns | model_name | value | sum | count | bucket_le | bucket_count
-------------|--------------------------|-------------|-------|-------------|---------------------|--------------|-------|------|-------|-----------|-------------
http://... | vllm:kv_cache_usage_perc | gauge | ratio | KV-cache... | 1765793061967310848 | Qwen/Qwen3-0.6B | 0.72 | null | null | null | null
http://... | vllm:request_success | counter | null | Count of... | 1765793061967310848 | Qwen/Qwen3-0.6B | 150.0 | null | null | null | null

Histogram: N Rows per Timestamp (One per Bucket)

endpoint_url | metric_name | metric_type | unit | description | timestamp_ns | model_name | value | sum | count | bucket_le | bucket_count
-------------|----------------------------------|-------------|---------|--------------|---------------------|-----------------|-------|--------|-------|-----------|-------------
http://... | vllm:e2e_request_latency_seconds | histogram | seconds | Histogram... | 1765793061967310848 | Qwen/Qwen3-0.6B | null | 259.87 | 19.0 | 0.3 | 0.0
http://... | vllm:e2e_request_latency_seconds | histogram | seconds | Histogram... | 1765793061967310848 | Qwen/Qwen3-0.6B | null | 259.87 | 19.0 | 1.0 | 1.0
http://... | vllm:e2e_request_latency_seconds | histogram | seconds | Histogram... | 1765793061967310848 | Qwen/Qwen3-0.6B | null | 259.87 | 19.0 | 5.0 | 3.0
http://... | vllm:e2e_request_latency_seconds | histogram | seconds | Histogram... | 1765793061967310848 | Qwen/Qwen3-0.6B | null | 259.87 | 19.0 | +Inf | 19.0

File Metadata

Parquet file metadata (accessible via pq.read_metadata()) includes:

KeyDescription
aiperf.schema_versionSchema version (1.0)
aiperf.versionAIPerf version
aiperf.benchmark_idUnique benchmark UUID
aiperf.exporterExporter class name (ServerMetricsParquetExporter)
aiperf.export_timestamp_utcExport timestamp (ISO 8601)
aiperf.time_filter_start_nsProfiling period start (nanoseconds)
aiperf.time_filter_end_nsProfiling period end (nanoseconds)
aiperf.profiling_duration_nsProfiling duration (nanoseconds)
aiperf.profiling_duration_secondsProfiling duration (seconds)
aiperf.endpoint_urlsJSON array of endpoint URLs
aiperf.endpoint_countNumber of endpoints
aiperf.label_columnsJSON array of label column names
aiperf.label_countNumber of label columns
aiperf.metric_countTotal unique metrics
aiperf.metric_type_countsJSON object: {"gauge": N, "counter": N, "histogram": N, "unknown": N}
aiperf.model_namesJSON array of model names
aiperf.concurrencyBenchmark concurrency setting
aiperf.request_rateBenchmark request rate (if set)
aiperf.input_configFull user configuration (JSON)
aiperf.hostnameCollection host
aiperf.python_versionPython version
aiperf.pyarrow_versionPyArrow version
aiperf.schema_noteCross-file query hint

Compression: Snappy (good compression ratio with fast decompression)


Example Queries

DuckDB

1-- Time-series for a specific metric
2SELECT timestamp_ns, value
3FROM 'server_metrics_export.parquet'
4WHERE metric_name = 'vllm:kv_cache_usage_perc'
5ORDER BY timestamp_ns;
6
7-- Filter by label
8SELECT timestamp_ns, value
9FROM 'server_metrics_export.parquet'
10WHERE metric_name = 'vllm:request_success'
11 AND model_name = 'Qwen/Qwen3-0.6B'
12ORDER BY timestamp_ns;
13
14-- Histogram bucket distribution at final timestamp
15SELECT bucket_le, bucket_count
16FROM 'server_metrics_export.parquet'
17WHERE metric_name = 'vllm:e2e_request_latency_seconds'
18 AND timestamp_ns = (SELECT MAX(timestamp_ns) FROM 'server_metrics_export.parquet'
19 WHERE metric_name = 'vllm:e2e_request_latency_seconds')
20ORDER BY CAST(REPLACE(bucket_le, '+Inf', '999999') AS DOUBLE);
21
22-- Aggregate across multiple runs (handles schema differences)
23SELECT metric_name, AVG(value) as avg_value
24FROM read_parquet('artifacts/*/server_metrics_export.parquet', union_by_name=true)
25WHERE metric_type = 'gauge'
26GROUP BY metric_name;
27
28-- Compare endpoints
29SELECT endpoint_url, metric_name, AVG(value) as avg_value
30FROM 'server_metrics_export.parquet'
31WHERE metric_type = 'gauge'
32GROUP BY endpoint_url, metric_name;

pandas

1import pandas as pd
2
3df = pd.read_parquet('server_metrics_export.parquet')
4
5# Filter to gauge metrics
6gauges = df[df['metric_type'] == 'gauge']
7
8# Time-series plot
9kv_usage = df[df['metric_name'] == 'vllm:kv_cache_usage_perc']
10kv_usage.plot(x='timestamp_ns', y='value', title='KV Cache Usage')
11
12# Pivot histogram buckets
13hist = df[df['metric_name'] == 'vllm:e2e_request_latency_seconds']
14pivot = hist.pivot(index='timestamp_ns', columns='bucket_le', values='bucket_count')

Polars

1import polars as pl
2
3df = pl.read_parquet('server_metrics_export.parquet')
4
5# Filter and aggregate
6result = (
7 df.filter(pl.col('metric_type') == 'gauge')
8 .group_by('metric_name')
9 .agg([
10 pl.col('value').mean().alias('avg'),
11 pl.col('value').max().alias('max'),
12 ])
13)
14
15# Lazy scan for large files
16lazy = pl.scan_parquet('artifacts/*/server_metrics_export.parquet')
17result = lazy.filter(pl.col('metric_name') == 'vllm:kv_cache_usage_perc').collect()

Reading Metadata

1import pyarrow.parquet as pq
2import json
3
4metadata = pq.read_metadata('server_metrics_export.parquet')
5schema_metadata = metadata.schema.to_arrow_schema().metadata
6
7# Access specific fields
8benchmark_id = schema_metadata[b'aiperf.benchmark_id'].decode()
9config = json.loads(schema_metadata[b'aiperf.input_config'])
10label_columns = json.loads(schema_metadata[b'aiperf.label_columns'])

Best Practices

Cross-File Analysis

Label columns vary by endpoint and model. Always use union_by_name:

1-- DuckDB
2SELECT * FROM read_parquet('run_*/server_metrics_export.parquet', union_by_name=true);
1# pandas
2import pandas as pd
3from pathlib import Path
4
5dfs = [pd.read_parquet(p) for p in Path('.').glob('run_*/server_metrics_export.parquet')]
6combined = pd.concat(dfs, ignore_index=True)

Histogram Percentile Estimation

Reconstruct percentiles from bucket data. Note that bucket_count values are cumulative (each bucket includes all observations with value <= bucket_le), matching Prometheus histogram semantics:

1import numpy as np
2
3def estimate_percentile(bucket_les, bucket_counts, percentile):
4 """Estimate percentile from histogram buckets using linear interpolation."""
5 # Convert bucket_le strings to floats (handle +Inf)
6 bounds = [float(b) if b != '+Inf' else np.inf for b in bucket_les]
7 counts = np.array(bucket_counts)
8
9 total = counts[-1] # +Inf bucket has cumulative total
10 target = total * (percentile / 100)
11
12 for i, (le, count) in enumerate(zip(bounds, counts)):
13 if count >= target:
14 if i == 0:
15 return le
16 prev_le = bounds[i-1] if i > 0 else 0
17 prev_count = counts[i-1] if i > 0 else 0
18 # Linear interpolation within bucket
19 fraction = (target - prev_count) / (count - prev_count) if count > prev_count else 0
20 return prev_le + fraction * (le - prev_le)
21 return bounds[-2] # Return last finite bound

Memory-Efficient Processing

For large files, use lazy evaluation:

1# Polars lazy scan
2import polars as pl
3df = pl.scan_parquet('server_metrics_export.parquet') \
4 .filter(pl.col('metric_name') == 'vllm:kv_cache_usage_perc') \
5 .collect()
6
7# DuckDB direct query (doesn't load entire file)
8import duckdb
9result = duckdb.query("""
10 SELECT AVG(value) FROM 'server_metrics_export.parquet'
11 WHERE metric_name = 'vllm:kv_cache_usage_perc'
12""").fetchone()

Schema Version History

VersionChanges
1.0Initial schema with normalized histogram buckets

For aggregated statistics, see JSON Schema. For metric definitions, see Server Metrics Reference. For usage examples, see the Server Metrics Tutorial.