Server Metrics Collection | NVIDIA AIPerf Documentation

AIPerf automatically collects metrics from Prometheus-compatible endpoints exposed by LLM inference servers and serving frontends (vLLM, SGLang, TRT-LLM, Dynamo, Triton, etc.).

Quick Reference

Feature	Description	Default
Auto-discovery	Automatically finds `/metrics` endpoint on server URL	Enabled
Collection	Scrapes metrics every 333ms during benchmark	Enabled
Outputs	JSON (aggregated), CSV (tabular), JSONL (time-series), Parquet (cumulative deltas)	JSON + CSV + Parquet
Custom endpoints	`--server-metrics URL [URL...]` for additional endpoints	None
Disable	`--no-server-metrics` to turn off collection	Enabled
Windowed stats	`--slice-duration SECONDS` for time-sliced analysis	Off

Key metrics by server:

vLLM

Metric	Type	What to Watch
`vllm:num_requests_running`	gauge	Active batch size (`stats.avg`)
`vllm:num_requests_waiting`	gauge	Queue depth—growing = saturation (`stats.max`)
`vllm:num_requests_waiting_by_reason`	gauge	Queue depth split into `capacity` and `deferred` (`stats.max`)
`vllm:kv_cache_usage_perc`	gauge	>0.9 = capacity limit (`stats.max`)
`vllm:num_preemptions`	counter	>0 = memory pressure (`stats.total`)
`vllm:e2e_request_latency_seconds`	histogram	E2E latency (`stats.p99_estimate`)
`vllm:time_to_first_token_seconds`	histogram	TTFT (`stats.p99_estimate`)
`vllm:inter_token_latency_seconds`	histogram	ITL (`stats.p99_estimate`)
`vllm:prompt_tokens_by_source`	counter	Prompt-token source mix (`source` label)
`vllm:generation_tokens`	counter	Decode throughput (`stats.rate`)

Dynamo

Metric	Type	What to Watch
`dynamo_frontend_active_requests`	gauge	HTTP handler active requests (`stats.avg`)
`dynamo_frontend_inflight_requests`	gauge	Engine-bound active requests (`stats.avg`)
`dynamo_frontend_queued_requests`	gauge	HTTP requests awaiting first token (`stats.avg`)
`dynamo_frontend_request_duration_seconds`	histogram	E2E latency (`stats.p99_estimate`)
`dynamo_frontend_time_to_first_token_seconds`	histogram	TTFT (`stats.p99_estimate`)
`dynamo_frontend_inter_token_latency_seconds`	histogram	ITL (`stats.p99_estimate`)
`dynamo_frontend_requests`	counter	Completed request throughput (`stats.rate`)
`dynamo_frontend_output_tokens`	counter	Decode throughput (`stats.rate`)
`dynamo_component_gpu_cache_usage_percent`	gauge	Backend cache usage (`stats.max`)

SGLang

Metric	Type	What to Watch
`sglang:num_running_reqs`	gauge	Active batch size (`stats.avg`)
`sglang:num_queue_reqs`	gauge	Queue depth—growing = saturation (`stats.max`)
`sglang:token_usage`	gauge	>0.9 = capacity limit (`stats.max`)
`sglang:cache_hit_rate`	gauge	Prefix cache efficiency (`stats.avg`)
`sglang:gen_throughput`	gauge	Real-time tokens/s (`stats.avg`)
`sglang:time_to_first_token_seconds`	histogram	TTFT (`stats.p99_estimate`)
`sglang:inter_token_latency_seconds`	histogram	ITL (`stats.p99_estimate`)
`sglang:e2e_request_latency_seconds`	histogram	E2E latency (`stats.p99_estimate`)
`sglang:queue_time_seconds`	histogram	Queue wait (`stats.p99_estimate`)
`sglang:prompt_tokens`	counter	Prefill throughput (`stats.rate`)
`sglang:generation_tokens`	counter	Decode throughput (`stats.rate`)

SGLang server-side setup is required. SGLang does not expose Prometheus exposition format at /metrics by default. Launch the server with --enable-metrics:

$ python -m sglang.launch_server \
>   --model-path <model> \
>   --enable-metrics \
>   --host 0.0.0.0 --port 8000

Without this flag, GET /metrics returns 404 and AIPerf’s auto-discovery silently falls back to no server-metrics collection (server_metrics_export.* files will not be produced).

TRT-LLM

Metric	Type	What to Watch
`trtllm_e2e_request_latency_seconds`	histogram	E2E latency (`stats.p99_estimate`)
`trtllm_time_to_first_token_seconds`	histogram	TTFT (`stats.p99_estimate`)
`trtllm_time_per_output_token_seconds`	histogram	ITL (`stats.p99_estimate`)
`trtllm_request_queue_time_seconds`	histogram	Queue wait (`stats.p99_estimate`)
`trtllm_request_prefill_time_seconds`	histogram	Prefill duration (`stats.p99_estimate`)
`trtllm_request_decode_time_seconds`	histogram	Decode duration (`stats.p99_estimate`)
`trtllm_request_success`	counter	Completed requests (`stats.rate`)
`trtllm_prompt_tokens`	counter	Prefill throughput (`stats.rate`)
`trtllm_generation_tokens`	counter	Decode throughput (`stats.rate`)
`trtllm_num_requests_running`	gauge	Active requests (`stats.avg`)
`trtllm_num_requests_waiting`	gauge	Queued requests (`stats.max`)
`trtllm_kv_cache_utilization`	gauge	KV cache usage (`stats.max`)
`trtllm_kv_cache_hit_rate`	gauge	KV cache reuse efficiency (`stats.avg`)

TRT-LLM server-side setup is required. Unlike vLLM, trtllm-serve does not expose Prometheus exposition format at /metrics by default — the default /metrics returns an iteration-stats JSON array (application/json), which is not parseable as Prometheus. Two consequences:

Enable Prometheus on the server. Pass return_perf_metrics: true in your extra_llm_api_options.yaml. This mounts the proper Prometheus exposition at /prometheus/metrics (a non-standard path). Add enable_iter_perf_stats: true when you want iteration-derived queue/KV/memory metrics from the PyTorch backend.
AIPerf auto-detects and falls back. When AIPerf hits /metrics and gets application/json, it automatically probes <base>/prometheus/metrics once. If the alt path serves Prometheus, AIPerf swaps the URL and continues — no manual override needed. If the alt path also fails (e.g. return_perf_metrics was not set), the collector auto-disables for the remainder of the run with a single warning.

Example extra_llm_api_options.yaml snippet:

1 return_perf_metrics: true
2 enable_iter_perf_stats: true

Triton Inference Server

Metric	Type	What to Watch
`nv_inference_request_success`	counter	Successful request throughput (`stats.rate`)
`nv_inference_request_failure`	counter	Failed requests by `reason` (`stats.total`)
`nv_inference_count`	counter	Inference throughput and average batch size numerator (`stats.rate`)
`nv_inference_exec_count`	counter	Execution throughput and average batch size denominator (`stats.rate`)
`nv_inference_pending_request_count`	gauge	Backend queue depth (`stats.max`)
`nv_inference_request_duration_us`	counter	Cumulative E2E request time (`stats.total`, microseconds)
`nv_inference_queue_duration_us`	counter	Cumulative queue time (`stats.total`, microseconds)
`nv_inference_first_response_histogram_ms`	histogram	First-response latency when histogram latencies are enabled (`stats.p99_estimate`)
`nv_gpu_utilization`	gauge	GPU utilization (`stats.avg`)
`nv_gpu_memory_used_bytes`	gauge	GPU memory pressure (`stats.max`)
`nv_cache_num_hits_per_model`	counter	Response-cache hits (`stats.total`)
`nv_cache_num_misses_per_model`	counter	Response-cache misses (`stats.total`)

Triton serves Prometheus metrics at http://localhost:8002/metrics by default, not on the inference HTTP port. Use --server-metrics http://HOST:8002/metrics when the inference URL and metrics URL differ. Triton latency summaries are ignored by AIPerf; enable --metrics-config histogram_latencies=true for first-response histogram percentiles.

Quick Start

Server metrics are collected by default - just run AIPerf normally:

$ aiperf profile \
>     --model Qwen/Qwen3-0.6B \
>     --endpoint-type chat \
>     --endpoint /v1/chat/completions \
>     --url localhost:8000 \
>     --concurrency 4 \
>     --request-count 100

AIPerf automatically:

Discovers the /metrics endpoint on your inference server (base URL + /metrics)
Tests endpoint reachability before profiling starts
Captures baseline metrics before warmup period begins (reference point for deltas) — also where AIPerf first parses the response and validates it as Prometheus exposition format; see Compatibility & auto-disable for what happens when an endpoint returns non-Prometheus content
Collects metrics at configurable intervals during warmup and profiling
Performs final scrape after profiling completes (captures end state)
Exports selected formats (default: JSON + CSV + Parquet):
- server_metrics_export.json - Aggregated statistics (profiling period only)
- server_metrics_export.csv - Tabular format (profiling period only)
- server_metrics_export.parquet - Raw time-series with delta calculations
- server_metrics_export.jsonl - Time-series data (all scrapes, opt-in only)

Custom file naming: The --profile-export-prefix (or --profile-export-file) flag changes the prefix for all export files, including server metrics. Any file extension is automatically stripped from the provided value. For example:

$ aiperf profile --model MODEL ... --profile-export-prefix my_benchmark
$ # Produces: my_benchmark_server_metrics.json, my_benchmark_server_metrics.csv, etc.
$ 
$ # --profile-export-file is an alias for --profile-export-prefix, so this is equivalent:
$ aiperf profile --model MODEL ... --profile-export-file my_benchmark.json
$ # Produces the same files (the .json extension is stripped automatically)

Time filtering: Statistics in JSON/CSV exports exclude the warmup period, showing only metrics from the profiling phase. The JSONL file contains all scrapes (including warmup) for complete time-series analysis.

Format selection: By default, JSON, CSV, and Parquet formats are generated (JSONL is opt-in to avoid large files). To opt out of Parquet, or to include JSONL for time-series analysis:

$ # Disable Parquet (JSON + CSV only)
$ aiperf profile --model MODEL ... --server-metrics-formats json csv
$ 
$ # Add JSONL for raw time-series snapshots
$ aiperf profile --model MODEL ... --server-metrics-formats json csv parquet jsonl

Adding Custom Endpoints

$ # Single endpoint
$ aiperf profile --model MODEL ... --server-metrics http://localhost:8081
$ 
$ # Multiple endpoints (distributed deployment)
$ aiperf profile --model MODEL ... --server-metrics \
>     http://node1:8081 \
>     http://node2:8081

Disabling Server Metrics

$ aiperf profile --model MODEL ... --no-server-metrics

Selecting Output Formats

$ # Default: JSON + CSV + Parquet
$ aiperf profile --model MODEL ...
$ 
$ # Opt out of Parquet (JSON + CSV only)
$ aiperf profile --model MODEL ... --server-metrics-formats json csv
$ 
$ # Add JSONL for raw time-series snapshots
$ aiperf profile --model MODEL ... --server-metrics-formats json csv parquet jsonl

Format	Use Case	Size
JSON/CSV (default)	Summary statistics, CI/CD thresholds	Small
Parquet (default)	SQL queries, pandas/DuckDB analytics	Compressed
JSONL (opt-in)	Debugging, raw Prometheus snapshots	10-100x larger

Compatibility & auto-disable

AIPerf scrapes /metrics at ~3 Hz and parses the response as Prometheus exposition format. When a server speaks something else at that path (most commonly TRT-LLM, which serves an iteration-stats JSON array), AIPerf does not retry-and-spam — it detects the mismatch on the first scrape and disables collection for that endpoint with a single log line. This avoids the failure mode where parse errors at the scrape interval inflate run time by 10×+.

Detection. A response is treated as non-Prometheus when either:

the HTTP Content-Type is application/json (the response body is never read in this case — the rejection is cheaper than parsing); or
the body fails to parse as Prometheus exposition format (prometheus_client.parser.text_string_to_metric_families raises ValueError — e.g. a server returns text/plain with garbage, or a JSON body without a content-type).

TRT-LLM /prometheus/metrics fallback. Before disabling, AIPerf probes <base>/prometheus/metrics exactly once — TRT-LLM mounts the proper Prometheus path there when launched with return_perf_metrics: true (see the TRT-LLM entry in the Quick Reference table above). If the probe succeeds, the collector swaps its URL there and the run continues with the alt endpoint. The probe is attempted whenever the configured URL ends with /metrics and is not already /prometheus/metrics itself — so /metrics, /v1/metrics, and /api/metrics all trigger the fallback probe. URLs that don’t end in /metrics (e.g. /stats, /telemetry) are left untouched, and /prometheus/metrics is excluded to avoid probing the same path it would swap to.

On auto-disable. A single WARNING is emitted naming the endpoint and the suppression flag. Subsequent scrape cycles short-circuit, the collector emits no further log noise, and the rest of the benchmark proceeds normally — other configured endpoints (DCGM telemetry, additional --server-metrics URLs) are unaffected.

WARNING  Disabling server metrics collection for http://127.0.0.1:60000/metrics:
         endpoint 'http://127.0.0.1:60000/metrics' returned non-Prometheus
         content-type 'application/json'; expected text/plain (Prometheus
         exposition format). To suppress this warning, pass --no-server-metrics.

To suppress the warning entirely, pass --no-server-metrics — collection is skipped, no probe is attempted, no warning is logged.

Configuration

Environment Variable	Default	Description
`AIPERF_SERVER_METRICS_COLLECTION_INTERVAL`	0.333s	Collection frequency (333ms, ~3Hz)
`AIPERF_SERVER_METRICS_COLLECTION_FLUSH_PERIOD`	2.0s	Wait time for final metrics after benchmark
`AIPERF_SERVER_METRICS_REACHABILITY_TIMEOUT`	10s	Timeout for endpoint reachability tests
`AIPERF_SERVER_METRICS_EXPORT_BATCH_SIZE`	100	Batch size for JSONL writer
`AIPERF_SERVER_METRICS_SHUTDOWN_DELAY`	5.0s	Shutdown delay for command response transmission

Output Files

The filenames below are defaults. When --profile-export-prefix <prefix> is used, server metrics files are named <prefix>_server_metrics.{json,csv,jsonl,parquet} (any file extension in the prefix is stripped automatically). All files are written to the artifact directory (--artifact-dir / --output-artifact-dir, default: ./artifacts/<run_info>).

1. Time-Series: `server_metrics_export.jsonl`

Line-delimited JSON with metrics snapshots over time:

1 {
2   "endpoint_url": "http://localhost:8000/metrics",
3   "timestamp_ns": 1763591215220757503,
4   "endpoint_latency_ns": 719764167,
5   "metrics": {
6     "vllm:num_requests_running": [{"value": 12.0}],
7     "vllm:kv_cache_usage_perc": [{"value": 0.72}],
8     "vllm:request_success": [{"value": 1500.0}],
9     "vllm:time_to_first_token_seconds": [{
10       "buckets": {"0.01": 145.0, "0.1": 1498.0, "+Inf": 1500.0},
11       "sum": 32.456,
12       "count": 1500.0
13     }]
14   },
15   "request_sent_ns": 1763591214500993336,
16   "first_byte_ns": 1763591215220757503
17 }

Fields:

endpoint_url: Source Prometheus endpoint
timestamp_ns: Collection timestamp in nanoseconds
endpoint_latency_ns: HTTP round-trip time in nanoseconds
metrics: All metrics from this endpoint
- Counter/Gauge: {"value": N} or {"labels": {...}, "value": N}
- Histogram: {"buckets": {"le": count}, "sum": N, "count": N} with optional labels

2. Aggregated Statistics: `server_metrics_export.json`

Aggregated statistics from profiling period. Metrics from all endpoints are merged, each series tagged with endpoint_url.

1 {
2   "schema_version": "1.0",
3   "aiperf_version": "0.3.0",
4   "benchmark_id": "2900a136-3c1a-4520-adaa-5719822b729b",
5   "summary": {
6     "endpoints_configured": ["http://localhost:8000/metrics"],
7     "endpoints_successful": ["http://localhost:8000/metrics"],
8     "start_time": "2025-12-15T02:04:23.028529",
9     "end_time": "2025-12-15T02:05:15.294690",
10     "endpoint_info": {
11       "http://localhost:8000/metrics": {
12         "total_fetches": 157,
13         "first_fetch_ns": 1765793061967310848,
14         "last_fetch_ns": 1765793114960054143,
15         "avg_fetch_latency_ms": 246.83,
16         "unique_updates": 157,
17         "first_update_ns": 1765793061967310848,
18         "last_update_ns": 1765793114960054143,
19         "duration_seconds": 52.99,
20         "avg_update_interval_ms": 339.70,
21         "median_update_interval_ms": 333.48
22       }
23     }
24   },
25   "metrics": {
26     "vllm:kv_cache_usage_perc": {
27       "type": "gauge",
28       "description": "KV-cache usage. 1 means 100 percent usage.",
29       "unit": "ratio",
30       "series": [{
31         "endpoint_url": "http://localhost:8000/metrics",
32         "labels": { "engine": "0", "model_name": "Qwen/Qwen3-0.6B" },
33         "stats": {
34           "avg": 0.191, "min": 0.0, "max": 0.202, "std": 0.038,
35           "p1": 0.003, "p5": 0.178, "p10": 0.191, "p25": 0.198,
36           "p50": 0.202, "p75": 0.202, "p90": 0.202, "p95": 0.202, "p99": 0.202
37         },
38         "timeslices": [
39           { "start_ns": 1765793063028529452, "end_ns": 1765793068028529452, "avg": 0.107, "min": 0.0, "max": 0.191 },
40           { "start_ns": 1765793068028529452, "end_ns": 1765793073028529452, "avg": 0.192, "min": 0.191, "max": 0.194 }
41         ]
42       }]
43     },
44     "vllm:request_success": {
45       "type": "counter",
46       "description": "Count of successfully processed requests.",
47       "unit": "requests",
48       "series": [{
49         "endpoint_url": "http://localhost:8000/metrics",
50         "labels": { "engine": "0", "finished_reason": "length", "model_name": "Qwen/Qwen3-0.6B" },
51         "stats": {
52           "total": 19.0, "rate": 0.359,
53           "rate_avg": 0.38, "rate_min": 0.0, "rate_max": 1.8, "rate_std": 0.751
54         },
55         "timeslices": [
56           { "start_ns": 1765793063028529452, "end_ns": 1765793068028529452, "total": 0.0, "rate": 0.0 },
57           { "start_ns": 1765793073028529452, "end_ns": 1765793078028529452, "total": 9.0, "rate": 1.8 }
58         ]
59       }]
60     },
61     "vllm:e2e_request_latency_seconds": {
62       "type": "histogram",
63       "description": "Histogram of e2e request latency in seconds.",
64       "unit": "seconds",
65       "series": [{
66         "endpoint_url": "http://localhost:8000/metrics",
67         "labels": { "engine": "0", "model_name": "Qwen/Qwen3-0.6B" },
68         "stats": {
69           "count": 19, "sum": 259.87, "avg": 13.68,
70           "count_rate": 0.359, "sum_rate": 4.90,
71           "p1_estimate": 2.25, "p5_estimate": 5.77, "p10_estimate": 8.26,
72           "p25_estimate": 10.82, "p50_estimate": 13.75, "p75_estimate": 15.35,
73           "p90_estimate": 17.24, "p95_estimate": 19.51, "p99_estimate": 31.77
74         },
75         "buckets": {
76           "0.3": 0, "0.5": 0, "1.0": 0, "2.5": 1, "5.0": 1,
77           "10.0": 3, "15.0": 11, "20.0": 18, "30.0": 18, "+Inf": 19
78         },
79         "timeslices": [
80           {
81             "start_ns": 1765793063028529452, "end_ns": 1765793068028529452,
82             "count": 0, "sum": 0.0, "avg": 0.0,
83             "buckets": { "0.3": 0, "0.5": 0, "1.0": 0, "2.5": 0, "5.0": 0, "10.0": 0, "15.0": 0, "20.0": 0, "+Inf": 0 }
84           }
85         ]
86       }]
87     }
88   },
89   "input_config": {
90     "models": ["Qwen/Qwen3-0.6B"],
91     "endpoint": { "urls": ["http://localhost:8000"], "streaming": true },
92     "datasets": [{ "name": "default", "type": "synthetic", "count": 30000 }],
93     "phases": [
94       { "name": "profiling", "type": "concurrency", "concurrency": 400, "requests": 30000 }
95     ],
96     "artifacts": { "slice_duration": 5.0 }
97   }
98 }

Query with jq:

$ jq '.metrics["vllm:e2e_request_latency_seconds"].series[0].stats.p99_estimate' server_metrics_export.json

3. CSV Export: `server_metrics_export.csv`

Tabular export organized in five sections (separated by blank lines): gauge, counter, histogram, unknown, info. The unknown section holds families that the Prometheus server declared as # TYPE foo untyped (or with no # TYPE line at all); they use the same statistics columns as gauges.

Labels expanded into individual columns for easy filtering/pivoting
Open directly in Excel/Sheets or load with pandas

1 from io import StringIO
2 import pandas as pd
3 
4 with open("server_metrics_export.csv") as f:
5     sections = [pd.read_csv(StringIO(s)) for s in f.read().strip().split('\n\n') if s.strip()]

4. Parquet Export: `server_metrics_export.parquet`

Raw time-series data with delta calculations applied. Uses a normalized schema (~50% smaller than wide format) where histogram buckets are separate rows. Each label becomes a column for SQL filtering.

Schema overview:

Column	Type	Description
`endpoint_url`	string	Source Prometheus endpoint
`metric_name`	string	Metric name
`metric_type`	string	`gauge`, `unknown`, `counter`, or `histogram`
`timestamp_ns`	int64	Collection timestamp (nanoseconds)
`value`	float64	Gauge/counter value (delta for counters)
`sum`, `count`	float64	Histogram sum/count deltas
`bucket_le`, `bucket_count`	string, float64	Histogram bucket bound and delta count
(label columns)	string	Dynamic columns from Prometheus labels

See Parquet Schema Reference for complete schema, metadata, and query examples.

Related documentation:

JSON Schema Reference - Complete JSON export format specification
Server Metrics Reference - Metric definitions by backend (vLLM, SGLang, TRT-LLM, Dynamo, Triton)
Parquet Schema Reference - Raw time-series data schema

Quick examples:

$ # DuckDB queries
$ duckdb -c "SELECT * FROM 'server_metrics_export.parquet' WHERE metric_name LIKE 'vllm:%' ORDER BY timestamp_ns"
$ duckdb -c "SELECT metric_name, AVG(value) FROM '*.parquet' WHERE metric_type='gauge' GROUP BY metric_name"
$ 
$ # Combine multiple runs (handles schema differences)
$ duckdb -c "SELECT * FROM read_parquet('artifacts/*/server_metrics_export.parquet', union_by_name=true)"

1 import pandas as pd
2 df = pd.read_parquet('server_metrics_export.parquet')
3 df[df['metric_name'] == 'vllm:kv_cache_usage_perc'].plot(x='timestamp_ns', y='value')

Statistics by Metric Type

Now that you understand the output formats, let’s examine how statistics are structured within each metric type.

Statistics are nested under a stats field within each series item. All metrics use the stats format for consistent API access.

Gauge (point-in-time values)

Statistics: avg, min, max, std, p1, p5, p10, p25, p50, p75, p90, p95, p99

Gauge percentiles are computed from actual collected samples (not estimated from buckets).

Counter (cumulative totals)

Statistics: total, rate, and when --slice-duration is set: rate_avg, rate_min, rate_max, rate_std

total: Change during profiling period (uses last pre-profiling sample as reference)
rate: Increase per second (total/duration)
Counter resets are detected and handled (negative deltas → total = 0)

Histogram (distributions)

Statistics (stats): count, count_rate, sum, sum_rate, avg, p1_estimate, p5_estimate, p10_estimate, p25_estimate, p50_estimate, p75_estimate, p90_estimate, p95_estimate, p99_estimate

Series-level field: buckets (per-bucket delta counts, not cumulative)

avg (sum/count) is exact
Percentiles are estimates from bucket interpolation

Prometheus Summary metrics are not supported. Summary quantiles are computed cumulatively over the entire server lifetime, making them unsuitable for benchmark-specific analysis. Use Histogram families for percentile estimation when the server offers them. Rare optional Summary families, such as SGLang’s sglang:eplb_balancedness, Triton’s nv_inference_*_summary_us, or Triton’s response-cache nv_cache_*_summary_per_model, are ignored by AIPerf exports.

Timesliced Statistics

When configured with --slice-duration, AIPerf computes windowed statistics over fixed time intervals. Each series includes a timeslices array with per-window statistics:

1 {
2   "stats": { "avg": 25.5, "min": 0.0, "max": 50.0 },
3   "timeslices": [
4     { "start_ns": 1765615837721140145, "end_ns": 1765615839721140145, "avg": 22.9, "min": 0.0, "max": 42.0 },
5     { "start_ns": 1765615839721140145, "end_ns": 1765615841721140145, "avg": 49.8, "min": 49.0, "max": 50.0 }
6   ]
7 }

Gauges: Each timeslice contains avg, min, max
Counters: Each timeslice contains total, rate
Histograms: Each timeslice contains count, sum, avg, buckets

Partial timeslices (at the end of the collection period) are marked with is_complete: false and excluded from aggregate statistics (e.g., rate_avg, rate_min) to ensure fair comparison. Individual timeslice data includes both complete and partial slices for data completeness.

Labeled Metrics

Prometheus metrics with labels (e.g., model, status) are aggregated separately for each unique label combination. When collecting from multiple endpoints, series are merged together with each tagged by endpoint_url.

Unit Inference

AIPerf automatically infers units from metric names and descriptions using standard Prometheus conventions (_seconds, _bytes, _requests, etc.). Units appear in both JSON and CSV exports. The unit field is optional—if no unit can be inferred, it’s omitted.

Common Metrics by Server

vLLM

Metric	Type	Description
`vllm:num_requests_running`	gauge	Requests in execution batches
`vllm:num_requests_waiting`	gauge	Requests in queue (saturation indicator)
`vllm:num_requests_waiting_by_reason`	gauge	Waiting requests split by `capacity` vs `deferred`
`vllm:engine_sleep_state`	gauge	Engine sleep/offload state
`vllm:kv_cache_usage_perc`	gauge	KV-cache usage (0.0-1.0, >0.9 = capacity limit)
`vllm:num_preemptions`	counter	Requests preempted due to memory pressure
`vllm:prefix_cache_hits`	counter	Tokens served from prefix cache
`vllm:prefix_cache_queries`	counter	Tokens queried (hit_rate = hits/queries)
`vllm:external_prefix_cache_hits`	counter	Tokens served from external KV connector cache
`vllm:external_prefix_cache_queries`	counter	Tokens queried from external KV connector cache
`vllm:mm_cache_hits`	counter	Multi-modal cache hits
`vllm:mm_cache_queries`	counter	Multi-modal cache queries
`vllm:time_to_first_token_seconds`	histogram	Time to first token (TTFT)
`vllm:e2e_request_latency_seconds`	histogram	End-to-end latency
`vllm:inter_token_latency_seconds`	histogram	Time between output tokens (ITL)
`vllm:request_queue_time_seconds`	histogram	Time spent waiting in queue
`vllm:request_prefill_time_seconds`	histogram	Time spent in prefill phase
`vllm:request_decode_time_seconds`	histogram	Time spent in decode phase
`vllm:request_prefill_kv_computed_tokens`	histogram	New KV tokens computed during prefill, excluding cached tokens
`vllm:request_success`	counter	Completed requests
`vllm:prompt_tokens`	counter	Total prompt tokens (rate = prefill throughput)
`vllm:prompt_tokens_by_source`	counter	Prompt tokens by `local_compute`, `local_cache_hit`, or `external_kv_transfer`
`vllm:prompt_tokens_cached`	counter	Cached prompt tokens (local + external)
`vllm:generation_tokens`	counter	Total generated tokens (rate = decode throughput)

Dynamo

Metric	Type	Description
`dynamo_frontend_requests`	counter	Requests by endpoint/model/status
`dynamo_frontend_inflight_requests`	gauge	Requests currently processing
`dynamo_frontend_queued_requests`	gauge	Requests awaiting first token
`dynamo_frontend_request_duration_seconds`	histogram	End-to-end HTTP latency
`dynamo_frontend_time_to_first_token_seconds`	histogram	TTFT including routing overhead
`dynamo_frontend_inter_token_latency_seconds`	histogram	Inter-token latency (ITL)
`dynamo_frontend_input_sequence_tokens`	histogram	Prompt token distribution
`dynamo_frontend_output_sequence_tokens`	histogram	Response token distribution
`dynamo_component_requests`	counter	Per-component (prefill/decode) requests
`dynamo_component_request_duration_seconds`	histogram	Per-component processing time
`dynamo_component_inflight_requests`	gauge	Active requests per worker
`dynamo_component_errors`	counter	Errors by component/type
`dynamo_component_gpu_cache_usage_percent`	gauge	Backend KV-cache usage
`dynamo_component_embedding_cache_hits`	counter	Multimodal embedding-cache hits
`dynamo_component_embedding_cache_misses`	counter	Multimodal embedding-cache misses
`dynamo_component_kv_publisher_zmq_events`	counter	KV publisher relay events
`dynamo_tokio_global_queue_depth`	gauge	Tokio runtime global queue depth
`dynamo_frontend_event_loop_delay_seconds`	histogram	Event-loop delay canary

SGLang

Metric	Type	Description
`sglang:num_running_reqs`	gauge	Running requests
`sglang:num_queue_reqs`	gauge	Queued requests (saturation indicator)
`sglang:token_usage`	gauge	Memory utilization (>0.9 = capacity limit)
`sglang:cache_hit_rate`	gauge	Prefix cache hit rate
`sglang:gen_throughput`	gauge	Real-time generation tokens/s
`sglang:prompt_tokens`	counter	Total prompt tokens (rate = prefill throughput)
`sglang:generation_tokens`	counter	Total generated tokens (rate = decode throughput)
`sglang:time_to_first_token_seconds`	histogram	Time to first token (TTFT)
`sglang:inter_token_latency_seconds`	histogram	Time between output tokens (ITL)
`sglang:e2e_request_latency_seconds`	histogram	End-to-end latency
`sglang:queue_time_seconds`	histogram	Queue wait time
`sglang:per_stage_req_latency_seconds`	histogram	Latency by observed stage (`request_process`, `prefill_forward`, `decode_waiting`, etc.)

TRT-LLM

Metric	Type	Description
`trtllm_time_to_first_token_seconds`	histogram	Time to first token (TTFT)
`trtllm_e2e_request_latency_seconds`	histogram	End-to-end latency
`trtllm_time_per_output_token_seconds`	histogram	Per-token generation time (ITL)
`trtllm_request_queue_time_seconds`	histogram	Time in waiting phase
`trtllm_request_prefill_time_seconds`	histogram	Prefill/context phase duration
`trtllm_request_decode_time_seconds`	histogram	Decode/generation phase duration
`trtllm_request_inference_time_seconds`	histogram	Total scheduled inference duration
`trtllm_request_success`	counter	Completed requests by `finished_reason`
`trtllm_prompt_tokens`	counter	Total prompt tokens (rate = prefill throughput)
`trtllm_generation_tokens`	counter	Total generated tokens (rate = decode throughput)
`trtllm_num_requests_running`	gauge	Active requests
`trtllm_num_requests_waiting`	gauge	Queued requests
`trtllm_kv_cache_utilization`	gauge	KV cache utilization
`trtllm_kv_cache_hit_rate`	gauge	KV cache hit rate
`trtllm_num_aborted_requests`	counter	Dynamo-TRTLLM additional aborted/cancelled requests
`trtllm_kv_transfer_latency_seconds`	histogram	Dynamo-TRTLLM additional KV-transfer latency
`trtllm_kv_transfer_bytes`	histogram	Dynamo-TRTLLM additional KV-transfer size
`trtllm_kv_transfer_speed_gb_s`	histogram	Dynamo-TRTLLM additional KV-transfer speed

Triton Inference Server

Metric	Type	Description
`nv_inference_request_success`	counter	Successful inference requests
`nv_inference_request_failure`	counter	Failed inference requests by `reason`
`nv_inference_count`	counter	Inferences performed; divide by `nv_inference_exec_count` for average batch size
`nv_inference_exec_count`	counter	Backend batch executions
`nv_inference_pending_request_count`	gauge	Requests received by Triton but not yet executing
`nv_inference_request_duration_us`	counter	Cumulative end-to-end request handling time
`nv_inference_queue_duration_us`	counter	Cumulative scheduler queue time
`nv_inference_first_response_histogram_ms`	histogram	Optional first-response latency histogram
`nv_gpu_utilization`	gauge	GPU utilization
`nv_gpu_memory_used_bytes`	gauge	Used GPU memory
`nv_cache_num_hits_per_model`	counter	Response-cache hits per model (when response cache is enabled)
`nv_cache_num_misses_per_model`	counter	Response-cache misses per model (when response cache is enabled)

Troubleshooting

Problem	Check	Solution
High p99, good p50	`vllm:num_requests_waiting` spikes	Queue buildup—reduce concurrency or increase server capacity
OOM crashes	`vllm:kv_cache_usage_perc` approaching 1.0	Reduce `max_model_len` or increase `gpu_memory_utilization`
Low throughput	`vllm:num_requests_running` vs `vllm:num_requests_waiting`	Low both = client bottleneck; high waiting = server bottleneck
Endpoint unreachable	`curl http://localhost:8000/metrics` or `curl http://localhost:8002/metrics` for Triton	Check server running, network, firewall; use explicit `--server-metrics` URL
`WARNING ... non-Prometheus content-type 'application/json'`	`curl -i <base>/metrics` shows `Content-Type: application/json`	Server isn’t serving Prometheus at `/metrics`. For TRT-LLM, set `return_perf_metrics: true` in `extra_llm_api_options.yaml` so AIPerf’s auto-probe finds `/prometheus/metrics`. To silence the warning entirely, pass `--no-server-metrics`. See Compatibility & auto-disable.

CI/CD Integration

1 import json
2 
3 with open('server_metrics_export.json') as f:
4     data = json.load(f)
5 
6 latency = data['metrics']['vllm:e2e_request_latency_seconds']['series'][0]['stats']
7 assert latency['p99_estimate'] < 5.0, f"P99 latency too high: {latency['p99_estimate']}"

Quick Reference

vLLM

Dynamo

SGLang

TRT-LLM

Triton Inference Server

Quick Start

Adding Custom Endpoints

Disabling Server Metrics

Selecting Output Formats

Compatibility & auto-disable

Configuration

Output Files

1. Time-Series: server_metrics_export.jsonl

2. Aggregated Statistics: server_metrics_export.json

3. CSV Export: server_metrics_export.csv

4. Parquet Export: server_metrics_export.parquet

Statistics by Metric Type

Gauge (point-in-time values)

Counter (cumulative totals)

Histogram (distributions)

Timesliced Statistics

Labeled Metrics

Unit Inference

Common Metrics by Server

vLLM

Dynamo

SGLang

TRT-LLM

Triton Inference Server

Troubleshooting

CI/CD Integration

1. Time-Series: `server_metrics_export.jsonl`

2. Aggregated Statistics: `server_metrics_export.json`

3. CSV Export: `server_metrics_export.csv`

4. Parquet Export: `server_metrics_export.parquet`