For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
    • Welcome to AIPerf Documentation
  • Getting Started
    • Profiling with AIPerf
    • Comprehensive LLM Benchmarking
    • Migrating from GenAI-Perf
    • GenAI-Perf vs AIPerf CLI Feature Comparison Matrix
  • Tutorials
      • Server Metrics Collection
      • Server Metrics Reference
      • Server Metrics JSON Export Schema
      • Server Metrics Parquet Export Schema
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
On this page
  • Quick Reference
  • Quick Start
  • Adding Custom Endpoints
  • Disabling Server Metrics
  • Selecting Output Formats
  • Compatibility & auto-disable
  • Configuration
  • Output Files
  • 1. Time-Series: server_metrics_export.jsonl
  • 2. Aggregated Statistics: server_metrics_export.json
  • 3. CSV Export: server_metrics_export.csv
  • 4. Parquet Export: server_metrics_export.parquet
  • Statistics by Metric Type
  • Gauge (point-in-time values)
  • Counter (cumulative totals)
  • Histogram (distributions)
  • Timesliced Statistics
  • Labeled Metrics
  • Unit Inference
  • Common Metrics by Server
  • vLLM
  • Dynamo
  • SGLang
  • TRT-LLM
  • Triton Inference Server
  • Troubleshooting
  • CI/CD Integration
Server Metrics

Server Metrics Collection

||View as Markdown|
Previous

YAML Config Roadmap

Next

Server Metrics Reference

AIPerf automatically collects metrics from Prometheus-compatible endpoints exposed by LLM inference servers and serving frontends (vLLM, SGLang, TRT-LLM, Dynamo, Triton, etc.).

Quick Reference

FeatureDescriptionDefault
Auto-discoveryAutomatically finds /metrics endpoint on server URLEnabled
CollectionScrapes metrics every 333ms during benchmarkEnabled
OutputsJSON (aggregated), CSV (tabular), JSONL (time-series), Parquet (cumulative deltas)JSON + CSV + Parquet
Custom endpoints--server-metrics URL [URL...] for additional endpointsNone
Disable--no-server-metrics to turn off collectionEnabled
Windowed stats--slice-duration SECONDS for time-sliced analysisOff

Key metrics by server:

vLLM
MetricTypeWhat to Watch
vllm:num_requests_runninggaugeActive batch size (stats.avg)
vllm:num_requests_waitinggaugeQueue depth—growing = saturation (stats.max)
vllm:num_requests_waiting_by_reasongaugeQueue depth split into capacity and deferred (stats.max)
vllm:kv_cache_usage_percgauge>0.9 = capacity limit (stats.max)
vllm:num_preemptionscounter>0 = memory pressure (stats.total)
vllm:e2e_request_latency_secondshistogramE2E latency (stats.p99_estimate)
vllm:time_to_first_token_secondshistogramTTFT (stats.p99_estimate)
vllm:inter_token_latency_secondshistogramITL (stats.p99_estimate)
vllm:prompt_tokens_by_sourcecounterPrompt-token source mix (source label)
vllm:generation_tokenscounterDecode throughput (stats.rate)
Dynamo
MetricTypeWhat to Watch
dynamo_frontend_active_requestsgaugeHTTP handler active requests (stats.avg)
dynamo_frontend_inflight_requestsgaugeEngine-bound active requests (stats.avg)
dynamo_frontend_queued_requestsgaugeHTTP requests awaiting first token (stats.avg)
dynamo_frontend_request_duration_secondshistogramE2E latency (stats.p99_estimate)
dynamo_frontend_time_to_first_token_secondshistogramTTFT (stats.p99_estimate)
dynamo_frontend_inter_token_latency_secondshistogramITL (stats.p99_estimate)
dynamo_frontend_requestscounterCompleted request throughput (stats.rate)
dynamo_frontend_output_tokenscounterDecode throughput (stats.rate)
dynamo_component_gpu_cache_usage_percentgaugeBackend cache usage (stats.max)
SGLang
MetricTypeWhat to Watch
sglang:num_running_reqsgaugeActive batch size (stats.avg)
sglang:num_queue_reqsgaugeQueue depth—growing = saturation (stats.max)
sglang:token_usagegauge>0.9 = capacity limit (stats.max)
sglang:cache_hit_rategaugePrefix cache efficiency (stats.avg)
sglang:gen_throughputgaugeReal-time tokens/s (stats.avg)
sglang:time_to_first_token_secondshistogramTTFT (stats.p99_estimate)
sglang:inter_token_latency_secondshistogramITL (stats.p99_estimate)
sglang:e2e_request_latency_secondshistogramE2E latency (stats.p99_estimate)
sglang:queue_time_secondshistogramQueue wait (stats.p99_estimate)
sglang:prompt_tokenscounterPrefill throughput (stats.rate)
sglang:generation_tokenscounterDecode throughput (stats.rate)
TRT-LLM
MetricTypeWhat to Watch
trtllm_e2e_request_latency_secondshistogramE2E latency (stats.p99_estimate)
trtllm_time_to_first_token_secondshistogramTTFT (stats.p99_estimate)
trtllm_time_per_output_token_secondshistogramITL (stats.p99_estimate)
trtllm_request_queue_time_secondshistogramQueue wait (stats.p99_estimate)
trtllm_request_prefill_time_secondshistogramPrefill duration (stats.p99_estimate)
trtllm_request_decode_time_secondshistogramDecode duration (stats.p99_estimate)
trtllm_request_successcounterCompleted requests (stats.rate)
trtllm_prompt_tokenscounterPrefill throughput (stats.rate)
trtllm_generation_tokenscounterDecode throughput (stats.rate)
trtllm_num_requests_runninggaugeActive requests (stats.avg)
trtllm_num_requests_waitinggaugeQueued requests (stats.max)
trtllm_kv_cache_utilizationgaugeKV cache usage (stats.max)
trtllm_kv_cache_hit_rategaugeKV cache reuse efficiency (stats.avg)

TRT-LLM server-side setup is required. Unlike vLLM and SGLang, trtllm-serve does not expose Prometheus exposition format at /metrics by default — the default /metrics returns an iteration-stats JSON array (application/json), which is not parseable as Prometheus. Two consequences:

  1. Enable Prometheus on the server. Pass return_perf_metrics: true in your extra_llm_api_options.yaml. This mounts the proper Prometheus exposition at /prometheus/metrics (a non-standard path). Add enable_iter_perf_stats: true when you want iteration-derived queue/KV/memory metrics from the PyTorch backend.
  2. AIPerf auto-detects and falls back. When AIPerf hits /metrics and gets application/json, it automatically probes <base>/prometheus/metrics once. If the alt path serves Prometheus, AIPerf swaps the URL and continues — no manual override needed. If the alt path also fails (e.g. return_perf_metrics was not set), the collector auto-disables for the remainder of the run with a single warning.

Example extra_llm_api_options.yaml snippet:

1return_perf_metrics: true
2enable_iter_perf_stats: true
Triton Inference Server
MetricTypeWhat to Watch
nv_inference_request_successcounterSuccessful request throughput (stats.rate)
nv_inference_request_failurecounterFailed requests by reason (stats.total)
nv_inference_countcounterInference throughput and average batch size numerator (stats.rate)
nv_inference_exec_countcounterExecution throughput and average batch size denominator (stats.rate)
nv_inference_pending_request_countgaugeBackend queue depth (stats.max)
nv_inference_request_duration_uscounterCumulative E2E request time (stats.total, microseconds)
nv_inference_queue_duration_uscounterCumulative queue time (stats.total, microseconds)
nv_inference_first_response_histogram_mshistogramFirst-response latency when histogram latencies are enabled (stats.p99_estimate)
nv_gpu_utilizationgaugeGPU utilization (stats.avg)
nv_gpu_memory_used_bytesgaugeGPU memory pressure (stats.max)
nv_cache_num_hits_per_modelcounterResponse-cache hits (stats.total)
nv_cache_num_misses_per_modelcounterResponse-cache misses (stats.total)

Triton serves Prometheus metrics at http://localhost:8002/metrics by default, not on the inference HTTP port. Use --server-metrics http://HOST:8002/metrics when the inference URL and metrics URL differ. Triton latency summaries are ignored by AIPerf; enable --metrics-config histogram_latencies=true for first-response histogram percentiles.

Quick Start

Server metrics are collected by default - just run AIPerf normally:

$aiperf profile \
> --model Qwen/Qwen3-0.6B \
> --endpoint-type chat \
> --endpoint /v1/chat/completions \
> --url localhost:8000 \
> --concurrency 4 \
> --request-count 100

AIPerf automatically:

  1. Discovers the /metrics endpoint on your inference server (base URL + /metrics)
  2. Tests endpoint reachability before profiling starts
  3. Captures baseline metrics before warmup period begins (reference point for deltas) — also where AIPerf first parses the response and validates it as Prometheus exposition format; see Compatibility & auto-disable for what happens when an endpoint returns non-Prometheus content
  4. Collects metrics at configurable intervals during warmup and profiling
  5. Performs final scrape after profiling completes (captures end state)
  6. Exports selected formats (default: JSON + CSV + Parquet):
    • server_metrics_export.json - Aggregated statistics (profiling period only)
    • server_metrics_export.csv - Tabular format (profiling period only)
    • server_metrics_export.parquet - Raw time-series with delta calculations
    • server_metrics_export.jsonl - Time-series data (all scrapes, opt-in only)

Custom file naming: The --profile-export-prefix (or --profile-export-file) flag changes the prefix for all export files, including server metrics. Any file extension is automatically stripped from the provided value. For example:

$aiperf profile --model MODEL ... --profile-export-prefix my_benchmark
$# Produces: my_benchmark_server_metrics.json, my_benchmark_server_metrics.csv, etc.
$
$# --profile-export-file is an alias for --profile-export-prefix, so this is equivalent:
$aiperf profile --model MODEL ... --profile-export-file my_benchmark.json
$# Produces the same files (the .json extension is stripped automatically)

Time filtering: Statistics in JSON/CSV exports exclude the warmup period, showing only metrics from the profiling phase. The JSONL file contains all scrapes (including warmup) for complete time-series analysis.

Format selection: By default, JSON, CSV, and Parquet formats are generated (JSONL is opt-in to avoid large files). To opt out of Parquet, or to include JSONL for time-series analysis:

$# Disable Parquet (JSON + CSV only)
$aiperf profile --model MODEL ... --server-metrics-formats json csv
$
$# Add JSONL for raw time-series snapshots
$aiperf profile --model MODEL ... --server-metrics-formats json csv parquet jsonl

Adding Custom Endpoints

$# Single endpoint
$aiperf profile --model MODEL ... --server-metrics http://localhost:8081
$
$# Multiple endpoints (distributed deployment)
$aiperf profile --model MODEL ... --server-metrics \
> http://node1:8081 \
> http://node2:8081

Disabling Server Metrics

$aiperf profile --model MODEL ... --no-server-metrics

Selecting Output Formats

$# Default: JSON + CSV + Parquet
$aiperf profile --model MODEL ...
$
$# Opt out of Parquet (JSON + CSV only)
$aiperf profile --model MODEL ... --server-metrics-formats json csv
$
$# Add JSONL for raw time-series snapshots
$aiperf profile --model MODEL ... --server-metrics-formats json csv parquet jsonl
FormatUse CaseSize
JSON/CSV (default)Summary statistics, CI/CD thresholdsSmall
Parquet (default)SQL queries, pandas/DuckDB analyticsCompressed
JSONL (opt-in)Debugging, raw Prometheus snapshots10-100x larger

Compatibility & auto-disable

AIPerf scrapes /metrics at ~3 Hz and parses the response as Prometheus exposition format. When a server speaks something else at that path (most commonly TRT-LLM, which serves an iteration-stats JSON array), AIPerf does not retry-and-spam — it detects the mismatch on the first scrape and disables collection for that endpoint with a single log line. This avoids the failure mode where parse errors at the scrape interval inflate run time by 10×+.

Detection. A response is treated as non-Prometheus when either:

  • the HTTP Content-Type is application/json (the response body is never read in this case — the rejection is cheaper than parsing); or
  • the body fails to parse as Prometheus exposition format (prometheus_client.parser.text_string_to_metric_families raises ValueError — e.g. a server returns text/plain with garbage, or a JSON body without a content-type).

TRT-LLM /prometheus/metrics fallback. Before disabling, AIPerf probes <base>/prometheus/metrics exactly once — TRT-LLM mounts the proper Prometheus path there when launched with return_perf_metrics: true (see the TRT-LLM entry in the Quick Reference table above). If the probe succeeds, the collector swaps its URL there and the run continues with the alt endpoint. The probe is attempted whenever the configured URL ends with /metrics and is not already /prometheus/metrics itself — so /metrics, /v1/metrics, and /api/metrics all trigger the fallback probe. URLs that don’t end in /metrics (e.g. /stats, /telemetry) are left untouched, and /prometheus/metrics is excluded to avoid probing the same path it would swap to.

On auto-disable. A single WARNING is emitted naming the endpoint and the suppression flag. Subsequent scrape cycles short-circuit, the collector emits no further log noise, and the rest of the benchmark proceeds normally — other configured endpoints (DCGM telemetry, additional --server-metrics URLs) are unaffected.

WARNING Disabling server metrics collection for http://127.0.0.1:60000/metrics:
endpoint 'http://127.0.0.1:60000/metrics' returned non-Prometheus
content-type 'application/json'; expected text/plain (Prometheus
exposition format). To suppress this warning, pass --no-server-metrics.

To suppress the warning entirely, pass --no-server-metrics — collection is skipped, no probe is attempted, no warning is logged.

Configuration

Environment VariableDefaultDescription
AIPERF_SERVER_METRICS_COLLECTION_INTERVAL0.333sCollection frequency (333ms, ~3Hz)
AIPERF_SERVER_METRICS_COLLECTION_FLUSH_PERIOD2.0sWait time for final metrics after benchmark
AIPERF_SERVER_METRICS_REACHABILITY_TIMEOUT10sTimeout for endpoint reachability tests
AIPERF_SERVER_METRICS_EXPORT_BATCH_SIZE100Batch size for JSONL writer
AIPERF_SERVER_METRICS_SHUTDOWN_DELAY5.0sShutdown delay for command response transmission

Output Files

The filenames below are defaults. When --profile-export-prefix <prefix> is used, server metrics files are named <prefix>_server_metrics.{json,csv,jsonl,parquet} (any file extension in the prefix is stripped automatically). All files are written to the artifact directory (--artifact-dir / --output-artifact-dir, default: ./artifacts/<run_info>).

1. Time-Series: server_metrics_export.jsonl

Line-delimited JSON with metrics snapshots over time:

1{
2 "endpoint_url": "http://localhost:8000/metrics",
3 "timestamp_ns": 1763591215220757503,
4 "endpoint_latency_ns": 719764167,
5 "metrics": {
6 "vllm:num_requests_running": [{"value": 12.0}],
7 "vllm:kv_cache_usage_perc": [{"value": 0.72}],
8 "vllm:request_success": [{"value": 1500.0}],
9 "vllm:time_to_first_token_seconds": [{
10 "buckets": {"0.01": 145.0, "0.1": 1498.0, "+Inf": 1500.0},
11 "sum": 32.456,
12 "count": 1500.0
13 }]
14 },
15 "request_sent_ns": 1763591214500993336,
16 "first_byte_ns": 1763591215220757503
17}

Fields:

  • endpoint_url: Source Prometheus endpoint
  • timestamp_ns: Collection timestamp in nanoseconds
  • endpoint_latency_ns: HTTP round-trip time in nanoseconds
  • metrics: All metrics from this endpoint
    • Counter/Gauge: {"value": N} or {"labels": {...}, "value": N}
    • Histogram: {"buckets": {"le": count}, "sum": N, "count": N} with optional labels

2. Aggregated Statistics: server_metrics_export.json

Aggregated statistics from profiling period. Metrics from all endpoints are merged, each series tagged with endpoint_url.

1{
2 "schema_version": "1.0",
3 "aiperf_version": "0.3.0",
4 "benchmark_id": "2900a136-3c1a-4520-adaa-5719822b729b",
5 "summary": {
6 "endpoints_configured": ["http://localhost:8000/metrics"],
7 "endpoints_successful": ["http://localhost:8000/metrics"],
8 "start_time": "2025-12-15T02:04:23.028529",
9 "end_time": "2025-12-15T02:05:15.294690",
10 "endpoint_info": {
11 "http://localhost:8000/metrics": {
12 "total_fetches": 157,
13 "first_fetch_ns": 1765793061967310848,
14 "last_fetch_ns": 1765793114960054143,
15 "avg_fetch_latency_ms": 246.83,
16 "unique_updates": 157,
17 "first_update_ns": 1765793061967310848,
18 "last_update_ns": 1765793114960054143,
19 "duration_seconds": 52.99,
20 "avg_update_interval_ms": 339.70,
21 "median_update_interval_ms": 333.48
22 }
23 }
24 },
25 "metrics": {
26 "vllm:kv_cache_usage_perc": {
27 "type": "gauge",
28 "description": "KV-cache usage. 1 means 100 percent usage.",
29 "unit": "ratio",
30 "series": [{
31 "endpoint_url": "http://localhost:8000/metrics",
32 "labels": { "engine": "0", "model_name": "Qwen/Qwen3-0.6B" },
33 "stats": {
34 "avg": 0.191, "min": 0.0, "max": 0.202, "std": 0.038,
35 "p1": 0.003, "p5": 0.178, "p10": 0.191, "p25": 0.198,
36 "p50": 0.202, "p75": 0.202, "p90": 0.202, "p95": 0.202, "p99": 0.202
37 },
38 "timeslices": [
39 { "start_ns": 1765793063028529452, "end_ns": 1765793068028529452, "avg": 0.107, "min": 0.0, "max": 0.191 },
40 { "start_ns": 1765793068028529452, "end_ns": 1765793073028529452, "avg": 0.192, "min": 0.191, "max": 0.194 }
41 ]
42 }]
43 },
44 "vllm:request_success": {
45 "type": "counter",
46 "description": "Count of successfully processed requests.",
47 "unit": "requests",
48 "series": [{
49 "endpoint_url": "http://localhost:8000/metrics",
50 "labels": { "engine": "0", "finished_reason": "length", "model_name": "Qwen/Qwen3-0.6B" },
51 "stats": {
52 "total": 19.0, "rate": 0.359,
53 "rate_avg": 0.38, "rate_min": 0.0, "rate_max": 1.8, "rate_std": 0.751
54 },
55 "timeslices": [
56 { "start_ns": 1765793063028529452, "end_ns": 1765793068028529452, "total": 0.0, "rate": 0.0 },
57 { "start_ns": 1765793073028529452, "end_ns": 1765793078028529452, "total": 9.0, "rate": 1.8 }
58 ]
59 }]
60 },
61 "vllm:e2e_request_latency_seconds": {
62 "type": "histogram",
63 "description": "Histogram of e2e request latency in seconds.",
64 "unit": "seconds",
65 "series": [{
66 "endpoint_url": "http://localhost:8000/metrics",
67 "labels": { "engine": "0", "model_name": "Qwen/Qwen3-0.6B" },
68 "stats": {
69 "count": 19, "sum": 259.87, "avg": 13.68,
70 "count_rate": 0.359, "sum_rate": 4.90,
71 "p1_estimate": 2.25, "p5_estimate": 5.77, "p10_estimate": 8.26,
72 "p25_estimate": 10.82, "p50_estimate": 13.75, "p75_estimate": 15.35,
73 "p90_estimate": 17.24, "p95_estimate": 19.51, "p99_estimate": 31.77
74 },
75 "buckets": {
76 "0.3": 0, "0.5": 0, "1.0": 0, "2.5": 1, "5.0": 1,
77 "10.0": 3, "15.0": 11, "20.0": 18, "30.0": 18, "+Inf": 19
78 },
79 "timeslices": [
80 {
81 "start_ns": 1765793063028529452, "end_ns": 1765793068028529452,
82 "count": 0, "sum": 0.0, "avg": 0.0,
83 "buckets": { "0.3": 0, "0.5": 0, "1.0": 0, "2.5": 0, "5.0": 0, "10.0": 0, "15.0": 0, "20.0": 0, "+Inf": 0 }
84 }
85 ]
86 }]
87 }
88 },
89 "input_config": {
90 "models": ["Qwen/Qwen3-0.6B"],
91 "endpoint": { "urls": ["http://localhost:8000"], "streaming": true },
92 "datasets": [{ "name": "default", "type": "synthetic", "count": 30000 }],
93 "phases": [
94 { "name": "profiling", "type": "concurrency", "concurrency": 400, "requests": 30000 }
95 ],
96 "artifacts": { "slice_duration": 5.0 }
97 }
98}

Query with jq:

$jq '.metrics["vllm:e2e_request_latency_seconds"].series[0].stats.p99_estimate' server_metrics_export.json

3. CSV Export: server_metrics_export.csv

Tabular export organized in five sections (separated by blank lines): gauge, counter, histogram, unknown, info. The unknown section holds families that the Prometheus server declared as # TYPE foo untyped (or with no # TYPE line at all); they use the same statistics columns as gauges.

  • Labels expanded into individual columns for easy filtering/pivoting
  • Open directly in Excel/Sheets or load with pandas
1from io import StringIO
2import pandas as pd
3
4with open("server_metrics_export.csv") as f:
5 sections = [pd.read_csv(StringIO(s)) for s in f.read().strip().split('\n\n') if s.strip()]

4. Parquet Export: server_metrics_export.parquet

Raw time-series data with delta calculations applied. Uses a normalized schema (~50% smaller than wide format) where histogram buckets are separate rows. Each label becomes a column for SQL filtering.

Schema overview:

ColumnTypeDescription
endpoint_urlstringSource Prometheus endpoint
metric_namestringMetric name
metric_typestringgauge, unknown, counter, or histogram
timestamp_nsint64Collection timestamp (nanoseconds)
valuefloat64Gauge/counter value (delta for counters)
sum, countfloat64Histogram sum/count deltas
bucket_le, bucket_countstring, float64Histogram bucket bound and delta count
(label columns)stringDynamic columns from Prometheus labels

See Parquet Schema Reference for complete schema, metadata, and query examples.

Related documentation:

  • JSON Schema Reference - Complete JSON export format specification
  • Server Metrics Reference - Metric definitions by backend (vLLM, SGLang, TRT-LLM, Dynamo, Triton)
  • Parquet Schema Reference - Raw time-series data schema

Quick examples:

$# DuckDB queries
$duckdb -c "SELECT * FROM 'server_metrics_export.parquet' WHERE metric_name LIKE 'vllm:%' ORDER BY timestamp_ns"
$duckdb -c "SELECT metric_name, AVG(value) FROM '*.parquet' WHERE metric_type='gauge' GROUP BY metric_name"
$
$# Combine multiple runs (handles schema differences)
$duckdb -c "SELECT * FROM read_parquet('artifacts/*/server_metrics_export.parquet', union_by_name=true)"
1import pandas as pd
2df = pd.read_parquet('server_metrics_export.parquet')
3df[df['metric_name'] == 'vllm:kv_cache_usage_perc'].plot(x='timestamp_ns', y='value')

Statistics by Metric Type

Now that you understand the output formats, let’s examine how statistics are structured within each metric type.

Statistics are nested under a stats field within each series item. All metrics use the stats format for consistent API access.

Gauge (point-in-time values)

Statistics: avg, min, max, std, p1, p5, p10, p25, p50, p75, p90, p95, p99

Gauge percentiles are computed from actual collected samples (not estimated from buckets).

Counter (cumulative totals)

Statistics: total, rate, and when --slice-duration is set: rate_avg, rate_min, rate_max, rate_std

  • total: Change during profiling period (uses last pre-profiling sample as reference)
  • rate: Increase per second (total/duration)
  • Counter resets are detected and handled (negative deltas → total = 0)

Histogram (distributions)

Statistics (stats): count, count_rate, sum, sum_rate, avg, p1_estimate, p5_estimate, p10_estimate, p25_estimate, p50_estimate, p75_estimate, p90_estimate, p95_estimate, p99_estimate

Series-level field: buckets (per-bucket delta counts, not cumulative)

  • avg (sum/count) is exact
  • Percentiles are estimates from bucket interpolation

Prometheus Summary metrics are not supported. Summary quantiles are computed cumulatively over the entire server lifetime, making them unsuitable for benchmark-specific analysis. Use Histogram families for percentile estimation when the server offers them. Rare optional Summary families, such as SGLang’s sglang:eplb_balancedness, Triton’s nv_inference_*_summary_us, or Triton’s response-cache nv_cache_*_summary_per_model, are ignored by AIPerf exports.

Timesliced Statistics

When configured with --slice-duration, AIPerf computes windowed statistics over fixed time intervals. Each series includes a timeslices array with per-window statistics:

1{
2 "stats": { "avg": 25.5, "min": 0.0, "max": 50.0 },
3 "timeslices": [
4 { "start_ns": 1765615837721140145, "end_ns": 1765615839721140145, "avg": 22.9, "min": 0.0, "max": 42.0 },
5 { "start_ns": 1765615839721140145, "end_ns": 1765615841721140145, "avg": 49.8, "min": 49.0, "max": 50.0 }
6 ]
7}
  • Gauges: Each timeslice contains avg, min, max
  • Counters: Each timeslice contains total, rate
  • Histograms: Each timeslice contains count, sum, avg, buckets

Partial timeslices (at the end of the collection period) are marked with is_complete: false and excluded from aggregate statistics (e.g., rate_avg, rate_min) to ensure fair comparison. Individual timeslice data includes both complete and partial slices for data completeness.


Labeled Metrics

Prometheus metrics with labels (e.g., model, status) are aggregated separately for each unique label combination. When collecting from multiple endpoints, series are merged together with each tagged by endpoint_url.

Unit Inference

AIPerf automatically infers units from metric names and descriptions using standard Prometheus conventions (_seconds, _bytes, _requests, etc.). Units appear in both JSON and CSV exports. The unit field is optional—if no unit can be inferred, it’s omitted.

Common Metrics by Server

vLLM

MetricTypeDescription
vllm:num_requests_runninggaugeRequests in execution batches
vllm:num_requests_waitinggaugeRequests in queue (saturation indicator)
vllm:num_requests_waiting_by_reasongaugeWaiting requests split by capacity vs deferred
vllm:engine_sleep_stategaugeEngine sleep/offload state
vllm:kv_cache_usage_percgaugeKV-cache usage (0.0-1.0, >0.9 = capacity limit)
vllm:num_preemptionscounterRequests preempted due to memory pressure
vllm:prefix_cache_hitscounterTokens served from prefix cache
vllm:prefix_cache_queriescounterTokens queried (hit_rate = hits/queries)
vllm:external_prefix_cache_hitscounterTokens served from external KV connector cache
vllm:external_prefix_cache_queriescounterTokens queried from external KV connector cache
vllm:mm_cache_hitscounterMulti-modal cache hits
vllm:mm_cache_queriescounterMulti-modal cache queries
vllm:time_to_first_token_secondshistogramTime to first token (TTFT)
vllm:e2e_request_latency_secondshistogramEnd-to-end latency
vllm:inter_token_latency_secondshistogramTime between output tokens (ITL)
vllm:request_queue_time_secondshistogramTime spent waiting in queue
vllm:request_prefill_time_secondshistogramTime spent in prefill phase
vllm:request_decode_time_secondshistogramTime spent in decode phase
vllm:request_prefill_kv_computed_tokenshistogramNew KV tokens computed during prefill, excluding cached tokens
vllm:request_successcounterCompleted requests
vllm:prompt_tokenscounterTotal prompt tokens (rate = prefill throughput)
vllm:prompt_tokens_by_sourcecounterPrompt tokens by local_compute, local_cache_hit, or external_kv_transfer
vllm:prompt_tokens_cachedcounterCached prompt tokens (local + external)
vllm:generation_tokenscounterTotal generated tokens (rate = decode throughput)

Dynamo

MetricTypeDescription
dynamo_frontend_requestscounterRequests by endpoint/model/status
dynamo_frontend_inflight_requestsgaugeRequests currently processing
dynamo_frontend_queued_requestsgaugeRequests awaiting first token
dynamo_frontend_request_duration_secondshistogramEnd-to-end HTTP latency
dynamo_frontend_time_to_first_token_secondshistogramTTFT including routing overhead
dynamo_frontend_inter_token_latency_secondshistogramInter-token latency (ITL)
dynamo_frontend_input_sequence_tokenshistogramPrompt token distribution
dynamo_frontend_output_sequence_tokenshistogramResponse token distribution
dynamo_component_requestscounterPer-component (prefill/decode) requests
dynamo_component_request_duration_secondshistogramPer-component processing time
dynamo_component_inflight_requestsgaugeActive requests per worker
dynamo_component_errorscounterErrors by component/type
dynamo_component_gpu_cache_usage_percentgaugeBackend KV-cache usage
dynamo_component_embedding_cache_hitscounterMultimodal embedding-cache hits
dynamo_component_embedding_cache_missescounterMultimodal embedding-cache misses
dynamo_component_kv_publisher_zmq_eventscounterKV publisher relay events
dynamo_tokio_global_queue_depthgaugeTokio runtime global queue depth
dynamo_frontend_event_loop_delay_secondshistogramEvent-loop delay canary

SGLang

MetricTypeDescription
sglang:num_running_reqsgaugeRunning requests
sglang:num_queue_reqsgaugeQueued requests (saturation indicator)
sglang:token_usagegaugeMemory utilization (>0.9 = capacity limit)
sglang:cache_hit_rategaugePrefix cache hit rate
sglang:gen_throughputgaugeReal-time generation tokens/s
sglang:prompt_tokenscounterTotal prompt tokens (rate = prefill throughput)
sglang:generation_tokenscounterTotal generated tokens (rate = decode throughput)
sglang:time_to_first_token_secondshistogramTime to first token (TTFT)
sglang:inter_token_latency_secondshistogramTime between output tokens (ITL)
sglang:e2e_request_latency_secondshistogramEnd-to-end latency
sglang:queue_time_secondshistogramQueue wait time
sglang:per_stage_req_latency_secondshistogramLatency by observed stage (request_process, prefill_forward, decode_waiting, etc.)

TRT-LLM

MetricTypeDescription
trtllm_time_to_first_token_secondshistogramTime to first token (TTFT)
trtllm_e2e_request_latency_secondshistogramEnd-to-end latency
trtllm_time_per_output_token_secondshistogramPer-token generation time (ITL)
trtllm_request_queue_time_secondshistogramTime in waiting phase
trtllm_request_prefill_time_secondshistogramPrefill/context phase duration
trtllm_request_decode_time_secondshistogramDecode/generation phase duration
trtllm_request_inference_time_secondshistogramTotal scheduled inference duration
trtllm_request_successcounterCompleted requests by finished_reason
trtllm_prompt_tokenscounterTotal prompt tokens (rate = prefill throughput)
trtllm_generation_tokenscounterTotal generated tokens (rate = decode throughput)
trtllm_num_requests_runninggaugeActive requests
trtllm_num_requests_waitinggaugeQueued requests
trtllm_kv_cache_utilizationgaugeKV cache utilization
trtllm_kv_cache_hit_rategaugeKV cache hit rate
trtllm_num_aborted_requestscounterDynamo-TRTLLM additional aborted/cancelled requests
trtllm_kv_transfer_latency_secondshistogramDynamo-TRTLLM additional KV-transfer latency
trtllm_kv_transfer_byteshistogramDynamo-TRTLLM additional KV-transfer size
trtllm_kv_transfer_speed_gb_shistogramDynamo-TRTLLM additional KV-transfer speed

Triton Inference Server

MetricTypeDescription
nv_inference_request_successcounterSuccessful inference requests
nv_inference_request_failurecounterFailed inference requests by reason
nv_inference_countcounterInferences performed; divide by nv_inference_exec_count for average batch size
nv_inference_exec_countcounterBackend batch executions
nv_inference_pending_request_countgaugeRequests received by Triton but not yet executing
nv_inference_request_duration_uscounterCumulative end-to-end request handling time
nv_inference_queue_duration_uscounterCumulative scheduler queue time
nv_inference_first_response_histogram_mshistogramOptional first-response latency histogram
nv_gpu_utilizationgaugeGPU utilization
nv_gpu_memory_used_bytesgaugeUsed GPU memory
nv_cache_num_hits_per_modelcounterResponse-cache hits per model (when response cache is enabled)
nv_cache_num_misses_per_modelcounterResponse-cache misses per model (when response cache is enabled)

Troubleshooting

ProblemCheckSolution
High p99, good p50vllm:num_requests_waiting spikesQueue buildup—reduce concurrency or increase server capacity
OOM crashesvllm:kv_cache_usage_perc approaching 1.0Reduce max_model_len or increase gpu_memory_utilization
Low throughputvllm:num_requests_running vs vllm:num_requests_waitingLow both = client bottleneck; high waiting = server bottleneck
Endpoint unreachablecurl http://localhost:8000/metrics or curl http://localhost:8002/metrics for TritonCheck server running, network, firewall; use explicit --server-metrics URL
WARNING ... non-Prometheus content-type 'application/json'curl -i <base>/metrics shows Content-Type: application/jsonServer isn’t serving Prometheus at /metrics. For TRT-LLM, set return_perf_metrics: true in extra_llm_api_options.yaml so AIPerf’s auto-probe finds /prometheus/metrics. To silence the warning entirely, pass --no-server-metrics. See Compatibility & auto-disable.

CI/CD Integration

1import json
2
3with open('server_metrics_export.json') as f:
4 data = json.load(f)
5
6latency = data['metrics']['vllm:e2e_request_latency_seconds']['series'][0]['stats']
7assert latency['p99_estimate'] < 5.0, f"P99 latency too high: {latency['p99_estimate']}"