Server Metrics Collection

View as Markdown

AIPerf automatically collects metrics from Prometheus-compatible endpoints exposed by LLM inference servers (vLLM, SGLang, TRT-LLM, Dynamo, etc.).

Quick Reference

FeatureDescriptionDefault
Auto-discoveryAutomatically finds /metrics endpoint on server URLEnabled
CollectionScrapes metrics every 333ms during benchmarkEnabled
OutputsJSON (aggregated), CSV (tabular), JSONL (time-series), Parquet (cumulative deltas)JSON + CSV
Custom endpoints--server-metrics URL [URL...] for additional endpointsNone
Disable--no-server-metrics to turn off collectionEnabled
Windowed stats--slice-duration SECONDS for time-sliced analysisOff

Key metrics by server:

MetricTypeWhat to Watch
vllm:num_requests_runninggaugeActive batch size (stats.avg)
vllm:num_requests_waitinggaugeQueue depth—growing = saturation (stats.max)
vllm:kv_cache_usage_percgauge>0.9 = capacity limit (stats.max)
vllm:num_preemptionscounter>0 = memory pressure (stats.total)
vllm:e2e_request_latency_secondshistogramE2E latency (stats.p99_estimate)
vllm:time_to_first_token_secondshistogramTTFT (stats.p99_estimate)
vllm:inter_token_latency_secondshistogramITL (stats.p99_estimate)
vllm:generation_tokenscounterDecode throughput (stats.rate)
MetricTypeWhat to Watch
dynamo_frontend_inflight_requestsgaugeActive requests (stats.avg)
dynamo_frontend_queued_requestsgaugeRequests awaiting first token (stats.avg)
dynamo_frontend_request_duration_secondshistogramE2E latency (stats.p99_estimate)
dynamo_frontend_time_to_first_token_secondshistogramTTFT (stats.p99_estimate)
dynamo_frontend_inter_token_latency_secondshistogramITL (stats.p99_estimate)
dynamo_frontend_requestscounterThroughput (stats.rate)
dynamo_component_kvstats_gpu_cache_usage_percentgaugeBackend cache usage (stats.max)
MetricTypeWhat to Watch
sglang:num_running_reqsgaugeActive batch size (stats.avg)
sglang:num_queue_reqsgaugeQueue depth—growing = saturation (stats.max)
sglang:token_usagegauge>0.9 = capacity limit (stats.max)
sglang:cache_hit_rategaugePrefix cache efficiency (stats.avg)
sglang:gen_throughputgaugeReal-time tokens/s (stats.avg)
sglang:queue_time_secondshistogramQueue wait (stats.p99_estimate)
MetricTypeWhat to Watch
trtllm:e2e_request_latency_secondshistogramE2E latency (stats.p99_estimate)
trtllm:time_to_first_token_secondshistogramTTFT (stats.p99_estimate)
trtllm:time_per_output_token_secondshistogramITL (stats.p99_estimate)
trtllm:request_queue_time_secondshistogramQueue wait (stats.p99_estimate)
trtllm:request_successcounterCompleted requests (stats.rate)

Quick Start

Server metrics are collected by default - just run AIPerf normally:

$aiperf profile \
> --model Qwen/Qwen3-0.6B \
> --endpoint-type chat \
> --endpoint /v1/chat/completions \
> --url localhost:8000 \
> --concurrency 4 \
> --request-count 100

AIPerf automatically:

  1. Discovers the /metrics endpoint on your inference server (base URL + /metrics)
  2. Tests endpoint reachability before profiling starts
  3. Captures baseline metrics before warmup period begins (reference point for deltas)
  4. Collects metrics at configurable intervals during warmup and profiling
  5. Performs final scrape after profiling completes (captures end state)
  6. Exports selected formats (default: JSON + CSV):
    • server_metrics_export.json - Aggregated statistics (profiling period only)
    • server_metrics_export.csv - Tabular format (profiling period only)
    • server_metrics_export.jsonl - Time-series data (all scrapes, opt-in only)
    • server_metrics_export.parquet - Raw time-series with delta calculations (opt-in only)

Custom file naming: The --profile-export-prefix (or --profile-export-file) flag changes the prefix for all export files, including server metrics. Any file extension is automatically stripped from the provided value. For example:

$aiperf profile --model MODEL ... --profile-export-prefix my_benchmark
$# Produces: my_benchmark_server_metrics.json, my_benchmark_server_metrics.csv, etc.
$
$# --profile-export-file is an alias for --profile-export-prefix, so this is equivalent:
$aiperf profile --model MODEL ... --profile-export-file my_benchmark.json
$# Produces the same files (the .json extension is stripped automatically)

Time filtering: Statistics in JSON/CSV exports exclude the warmup period, showing only metrics from the profiling phase. The JSONL file contains all scrapes (including warmup) for complete time-series analysis.

Format selection: By default, only JSON and CSV formats are generated to avoid large JSONL files. To include JSONL for time-series analysis:

$aiperf profile --model MODEL ... --server-metrics-formats json csv jsonl

Adding Custom Endpoints

$# Single endpoint
$aiperf profile --model MODEL ... --server-metrics http://localhost:8081
$
$# Multiple endpoints (distributed deployment)
$aiperf profile --model MODEL ... --server-metrics \
> http://node1:8081 \
> http://node2:8081

Disabling Server Metrics

$aiperf profile --model MODEL ... --no-server-metrics

Selecting Output Formats

$# Default: JSON + CSV only
$aiperf profile --model MODEL ...
$
$# Add time-series formats as needed
$aiperf profile --model MODEL ... --server-metrics-formats json csv parquet
$aiperf profile --model MODEL ... --server-metrics-formats json csv jsonl parquet
FormatUse CaseSize
JSON/CSV (default)Summary statistics, CI/CD thresholdsSmall
ParquetSQL queries, pandas/DuckDB analyticsCompressed
JSONLDebugging, raw Prometheus snapshots10-100x larger

Configuration

Environment VariableDefaultDescription
AIPERF_SERVER_METRICS_COLLECTION_INTERVAL0.333sCollection frequency (333ms, ~3Hz)
AIPERF_SERVER_METRICS_COLLECTION_FLUSH_PERIOD2.0sWait time for final metrics after benchmark
AIPERF_SERVER_METRICS_REACHABILITY_TIMEOUT10sTimeout for endpoint reachability tests
AIPERF_SERVER_METRICS_EXPORT_BATCH_SIZE100Batch size for JSONL writer
AIPERF_SERVER_METRICS_SHUTDOWN_DELAY5.0sShutdown delay for command response transmission

Output Files

The filenames below are defaults. When --profile-export-prefix <prefix> is used, server metrics files are named <prefix>_server_metrics.{json,csv,jsonl,parquet} (any file extension in the prefix is stripped automatically). All files are written to the artifact directory (--artifact-directory, default: ./artifacts/<run_info>).

1. Time-Series: server_metrics_export.jsonl

Line-delimited JSON with metrics snapshots over time:

1{
2 "endpoint_url": "http://localhost:8000/metrics",
3 "timestamp_ns": 1763591215220757503,
4 "endpoint_latency_ns": 719764167,
5 "metrics": {
6 "vllm:num_requests_running": [{"value": 12.0}],
7 "vllm:kv_cache_usage_perc": [{"value": 0.72}],
8 "vllm:request_success": [{"value": 1500.0}],
9 "vllm:time_to_first_token_seconds": [{
10 "buckets": {"0.01": 145.0, "0.1": 1498.0, "+Inf": 1500.0},
11 "sum": 32.456,
12 "count": 1500.0
13 }]
14 },
15 "request_sent_ns": 1763591214500993336,
16 "first_byte_ns": 1763591215220757503
17}

Fields:

  • endpoint_url: Source Prometheus endpoint
  • timestamp_ns: Collection timestamp in nanoseconds
  • endpoint_latency_ns: HTTP round-trip time in nanoseconds
  • metrics: All metrics from this endpoint
    • Counter/Gauge: {"value": N} or {"labels": {...}, "value": N}
    • Histogram: {"buckets": {"le": count}, "sum": N, "count": N} with optional labels

2. Aggregated Statistics: server_metrics_export.json

Aggregated statistics from profiling period. Metrics from all endpoints are merged, each series tagged with endpoint_url.

1{
2 "schema_version": "1.0",
3 "aiperf_version": "0.3.0",
4 "benchmark_id": "2900a136-3c1a-4520-adaa-5719822b729b",
5 "summary": {
6 "endpoints_configured": ["http://localhost:8000/metrics"],
7 "endpoints_successful": ["http://localhost:8000/metrics"],
8 "start_time": "2025-12-15T02:04:23.028529",
9 "end_time": "2025-12-15T02:05:15.294690",
10 "endpoint_info": {
11 "http://localhost:8000/metrics": {
12 "total_fetches": 157,
13 "first_fetch_ns": 1765793061967310848,
14 "last_fetch_ns": 1765793114960054143,
15 "avg_fetch_latency_ms": 246.83,
16 "unique_updates": 157,
17 "first_update_ns": 1765793061967310848,
18 "last_update_ns": 1765793114960054143,
19 "duration_seconds": 52.99,
20 "avg_update_interval_ms": 339.70,
21 "median_update_interval_ms": 333.48
22 }
23 }
24 },
25 "metrics": {
26 "vllm:kv_cache_usage_perc": {
27 "type": "gauge",
28 "description": "KV-cache usage. 1 means 100 percent usage.",
29 "unit": "ratio",
30 "series": [{
31 "endpoint_url": "http://localhost:8000/metrics",
32 "labels": { "engine": "0", "model_name": "Qwen/Qwen3-0.6B" },
33 "stats": {
34 "avg": 0.191, "min": 0.0, "max": 0.202, "std": 0.038,
35 "p1": 0.003, "p5": 0.178, "p10": 0.191, "p25": 0.198,
36 "p50": 0.202, "p75": 0.202, "p90": 0.202, "p95": 0.202, "p99": 0.202
37 },
38 "timeslices": [
39 { "start_ns": 1765793063028529452, "end_ns": 1765793068028529452, "avg": 0.107, "min": 0.0, "max": 0.191 },
40 { "start_ns": 1765793068028529452, "end_ns": 1765793073028529452, "avg": 0.192, "min": 0.191, "max": 0.194 }
41 ]
42 }]
43 },
44 "vllm:request_success": {
45 "type": "counter",
46 "description": "Count of successfully processed requests.",
47 "unit": "requests",
48 "series": [{
49 "endpoint_url": "http://localhost:8000/metrics",
50 "labels": { "engine": "0", "finished_reason": "length", "model_name": "Qwen/Qwen3-0.6B" },
51 "stats": {
52 "total": 19.0, "rate": 0.359,
53 "rate_avg": 0.38, "rate_min": 0.0, "rate_max": 1.8, "rate_std": 0.751
54 },
55 "timeslices": [
56 { "start_ns": 1765793063028529452, "end_ns": 1765793068028529452, "total": 0.0, "rate": 0.0 },
57 { "start_ns": 1765793073028529452, "end_ns": 1765793078028529452, "total": 9.0, "rate": 1.8 }
58 ]
59 }]
60 },
61 "vllm:e2e_request_latency_seconds": {
62 "type": "histogram",
63 "description": "Histogram of e2e request latency in seconds.",
64 "unit": "seconds",
65 "series": [{
66 "endpoint_url": "http://localhost:8000/metrics",
67 "labels": { "engine": "0", "model_name": "Qwen/Qwen3-0.6B" },
68 "stats": {
69 "count": 19, "sum": 259.87, "avg": 13.68,
70 "count_rate": 0.359, "sum_rate": 4.90,
71 "p1_estimate": 2.25, "p5_estimate": 5.77, "p10_estimate": 8.26,
72 "p25_estimate": 10.82, "p50_estimate": 13.75, "p75_estimate": 15.35,
73 "p90_estimate": 17.24, "p95_estimate": 19.51, "p99_estimate": 31.77
74 },
75 "buckets": {
76 "0.3": 0, "0.5": 0, "1.0": 0, "2.5": 1, "5.0": 1,
77 "10.0": 3, "15.0": 11, "20.0": 18, "30.0": 18, "+Inf": 19
78 },
79 "timeslices": [
80 {
81 "start_ns": 1765793063028529452, "end_ns": 1765793068028529452,
82 "count": 0, "sum": 0.0, "avg": 0.0,
83 "buckets": { "0.3": 0, "0.5": 0, "1.0": 0, "2.5": 0, "5.0": 0, "10.0": 0, "15.0": 0, "20.0": 0, "+Inf": 0 }
84 }
85 ]
86 }]
87 }
88 },
89 "input_config": {
90 "endpoint": { "model_names": ["Qwen/Qwen3-0.6B"], "streaming": true },
91 "loadgen": { "concurrency": 400, "request_rate": 5000.0, "request_count": 30000 },
92 "output": { "slice_duration": 5.0 }
93 }
94}

Query with jq:

$jq '.metrics["vllm:e2e_request_latency_seconds"].series[0].stats.p99_estimate' server_metrics_export.json

3. CSV Export: server_metrics_export.csv

Tabular export organized in four sections (separated by blank lines): gauge, counter, histogram, info.

  • Labels expanded into individual columns for easy filtering/pivoting
  • Open directly in Excel/Sheets or load with pandas
1from io import StringIO
2import pandas as pd
3
4with open("server_metrics_export.csv") as f:
5 sections = [pd.read_csv(StringIO(s)) for s in f.read().strip().split('\n\n') if s.strip()]

4. Parquet Export: server_metrics_export.parquet

Raw time-series data with delta calculations applied. Uses a normalized schema (~50% smaller than wide format) where histogram buckets are separate rows. Each label becomes a column for SQL filtering.

Schema overview:

ColumnTypeDescription
endpoint_urlstringSource Prometheus endpoint
metric_namestringMetric name
metric_typestringgauge, counter, or histogram
timestamp_nsint64Collection timestamp (nanoseconds)
valuefloat64Gauge/counter value (delta for counters)
sum, countfloat64Histogram sum/count deltas
bucket_le, bucket_countstring, float64Histogram bucket bound and delta count
(label columns)stringDynamic columns from Prometheus labels

See Parquet Schema Reference for complete schema, metadata, and query examples.

Related documentation:

Quick examples:

$# DuckDB queries
$duckdb -c "SELECT * FROM 'server_metrics_export.parquet' WHERE metric_name LIKE 'vllm:%' ORDER BY timestamp_ns"
$duckdb -c "SELECT metric_name, AVG(value) FROM '*.parquet' WHERE metric_type='gauge' GROUP BY metric_name"
$
$# Combine multiple runs (handles schema differences)
$duckdb -c "SELECT * FROM read_parquet('artifacts/*/server_metrics_export.parquet', union_by_name=true)"
1import pandas as pd
2df = pd.read_parquet('server_metrics_export.parquet')
3df[df['metric_name'] == 'vllm:kv_cache_usage_perc'].plot(x='timestamp_ns', y='value')

Statistics by Metric Type

Now that you understand the output formats, let’s examine how statistics are structured within each metric type.

Statistics are nested under a stats field within each series item. All metrics use the stats format for consistent API access.

Gauge (point-in-time values)

Statistics: avg, min, max, std, p1, p5, p10, p25, p50, p75, p90, p95, p99

Gauge percentiles are computed from actual collected samples (not estimated from buckets).

Counter (cumulative totals)

Statistics: total, rate, and when --slice-duration is set: rate_avg, rate_min, rate_max, rate_std

  • total: Change during profiling period (uses last pre-profiling sample as reference)
  • rate: Increase per second (total/duration)
  • Counter resets are detected and handled (negative deltas → total = 0)

Histogram (distributions)

Statistics (stats): count, count_rate, sum, sum_rate, avg, p1_estimate, p5_estimate, p10_estimate, p25_estimate, p50_estimate, p75_estimate, p90_estimate, p95_estimate, p99_estimate

Series-level field: buckets (per-bucket delta counts, not cumulative)

  • avg (sum/count) is exact
  • Percentiles are estimates from bucket interpolation

Prometheus Summary metrics are not supported. Summary quantiles are computed cumulatively over the entire server lifetime, making them unsuitable for benchmark-specific analysis. Major LLM inference servers (vLLM, SGLang, TRT-LLM, Dynamo) use Histograms instead, which allow period-specific percentile estimation.

Timesliced Statistics

When configured with --slice-duration, AIPerf computes windowed statistics over fixed time intervals. Each series includes a timeslices array with per-window statistics:

1{
2 "stats": { "avg": 25.5, "min": 0.0, "max": 50.0 },
3 "timeslices": [
4 { "start_ns": 1765615837721140145, "end_ns": 1765615839721140145, "avg": 22.9, "min": 0.0, "max": 42.0 },
5 { "start_ns": 1765615839721140145, "end_ns": 1765615841721140145, "avg": 49.8, "min": 49.0, "max": 50.0 }
6 ]
7}
  • Gauges: Each timeslice contains avg, min, max
  • Counters: Each timeslice contains total, rate
  • Histograms: Each timeslice contains count, sum, avg, buckets

Partial timeslices (at the end of the collection period) are marked with is_complete: false and excluded from aggregate statistics (e.g., rate_avg, rate_min) to ensure fair comparison. Individual timeslice data includes both complete and partial slices for data completeness.


Labeled Metrics

Prometheus metrics with labels (e.g., model, status) are aggregated separately for each unique label combination. When collecting from multiple endpoints, series are merged together with each tagged by endpoint_url.

Unit Inference

AIPerf automatically infers units from metric names and descriptions using standard Prometheus conventions (_seconds, _bytes, _requests, etc.). Units appear in both JSON and CSV exports. The unit field is optional—if no unit can be inferred, it’s omitted.

Common Metrics by Server

vLLM

MetricTypeDescription
vllm:num_requests_runninggaugeRequests in execution batches
vllm:num_requests_waitinggaugeRequests in queue (saturation indicator)
vllm:kv_cache_usage_percgaugeKV-cache usage (0.0-1.0, >0.9 = capacity limit)
vllm:num_preemptionscounterRequests preempted due to memory pressure
vllm:prefix_cache_hitscounterTokens served from prefix cache
vllm:prefix_cache_queriescounterTokens queried (hit_rate = hits/queries)
vllm:time_to_first_token_secondshistogramTime to first token (TTFT)
vllm:e2e_request_latency_secondshistogramEnd-to-end latency
vllm:inter_token_latency_secondshistogramTime between output tokens (ITL)
vllm:request_queue_time_secondshistogramTime spent waiting in queue
vllm:request_prefill_time_secondshistogramTime spent in prefill phase
vllm:request_decode_time_secondshistogramTime spent in decode phase
vllm:request_successcounterCompleted requests
vllm:prompt_tokenscounterTotal prompt tokens (rate = prefill throughput)
vllm:generation_tokenscounterTotal generated tokens (rate = decode throughput)

Dynamo

MetricTypeDescription
dynamo_frontend_requestscounterRequests by endpoint/model/status
dynamo_frontend_inflight_requestsgaugeRequests currently processing
dynamo_frontend_queued_requestsgaugeRequests awaiting first token
dynamo_frontend_request_duration_secondshistogramEnd-to-end HTTP latency
dynamo_frontend_time_to_first_token_secondshistogramTTFT including routing overhead
dynamo_frontend_inter_token_latency_secondshistogramInter-token latency (ITL)
dynamo_frontend_input_sequence_tokenshistogramPrompt token distribution
dynamo_frontend_output_sequence_tokenshistogramResponse token distribution
dynamo_component_requestscounterPer-component (prefill/decode) requests
dynamo_component_request_duration_secondshistogramPer-component processing time
dynamo_component_inflight_requestsgaugeActive requests per worker
dynamo_component_errorscounterErrors by component/type
dynamo_component_kvstats_gpu_cache_usage_percentgaugeBackend KV-cache usage

SGLang

MetricTypeDescription
sglang:num_running_reqsgaugeRunning requests
sglang:num_queue_reqsgaugeQueued requests (saturation indicator)
sglang:token_usagegaugeMemory utilization (>0.9 = capacity limit)
sglang:cache_hit_rategaugePrefix cache hit rate
sglang:gen_throughputgaugeReal-time generation tokens/s
sglang:queue_time_secondshistogramQueue wait time
sglang:per_stage_req_latency_secondshistogramLatency by stage (prefill_/decode_)

TRT-LLM

MetricTypeDescription
trtllm:time_to_first_token_secondshistogramTime to first token (TTFT)
trtllm:e2e_request_latency_secondshistogramEnd-to-end latency
trtllm:time_per_output_token_secondshistogramPer-token generation time (ITL)
trtllm:request_queue_time_secondshistogramTime in WAITING phase
trtllm:request_successcounterCompleted requests

Troubleshooting

ProblemCheckSolution
High p99, good p50vllm:num_requests_waiting spikesQueue buildup—reduce concurrency or increase server capacity
OOM crashesvllm:kv_cache_usage_perc approaching 1.0Reduce max_model_len or increase gpu_memory_utilization
Low throughputvllm:num_requests_running vs vllm:num_requests_waitingLow both = client bottleneck; high waiting = server bottleneck
Endpoint unreachablecurl http://localhost:8000/metricsCheck server running, network, firewall; use explicit --server-metrics URL

CI/CD Integration

1import json
2
3with open('server_metrics_export.json') as f:
4 data = json.load(f)
5
6latency = data['metrics']['vllm:e2e_request_latency_seconds']['series'][0]['stats']
7assert latency['p99_estimate'] < 5.0, f"P99 latency too high: {latency['p99_estimate']}"