Server Metrics Collection

View as Markdown

AIPerf automatically collects metrics from Prometheus-compatible endpoints exposed by LLM inference servers (vLLM, SGLang, TRT-LLM, Dynamo, etc.).

Quick Reference

FeatureDescriptionDefault
Auto-discoveryAutomatically finds /metrics endpoint on server URLEnabled
CollectionScrapes metrics every 333ms during benchmarkEnabled
OutputsJSON (aggregated), CSV (tabular), JSONL (time-series), Parquet (cumulative deltas)JSON + CSV + Parquet
Custom endpoints--server-metrics URL [URL...] for additional endpointsNone
Disable--no-server-metrics to turn off collectionEnabled
Windowed stats--slice-duration SECONDS for time-sliced analysisOff

Key metrics by server:

MetricTypeWhat to Watch
vllm:num_requests_runninggaugeActive batch size (stats.avg)
vllm:num_requests_waitinggaugeQueue depth—growing = saturation (stats.max)
vllm:kv_cache_usage_percgauge>0.9 = capacity limit (stats.max)
vllm:num_preemptionscounter>0 = memory pressure (stats.total)
vllm:e2e_request_latency_secondshistogramE2E latency (stats.p99_estimate)
vllm:time_to_first_token_secondshistogramTTFT (stats.p99_estimate)
vllm:inter_token_latency_secondshistogramITL (stats.p99_estimate)
vllm:generation_tokenscounterDecode throughput (stats.rate)
MetricTypeWhat to Watch
dynamo_frontend_inflight_requestsgaugeActive requests (stats.avg)
dynamo_frontend_queued_requestsgaugeRequests awaiting first token (stats.avg)
dynamo_frontend_request_duration_secondshistogramE2E latency (stats.p99_estimate)
dynamo_frontend_time_to_first_token_secondshistogramTTFT (stats.p99_estimate)
dynamo_frontend_inter_token_latency_secondshistogramITL (stats.p99_estimate)
dynamo_frontend_requestscounterThroughput (stats.rate)
dynamo_component_kvstats_gpu_cache_usage_percentgaugeBackend cache usage (stats.max)
MetricTypeWhat to Watch
sglang:num_running_reqsgaugeActive batch size (stats.avg)
sglang:num_queue_reqsgaugeQueue depth—growing = saturation (stats.max)
sglang:token_usagegauge>0.9 = capacity limit (stats.max)
sglang:cache_hit_rategaugePrefix cache efficiency (stats.avg)
sglang:gen_throughputgaugeReal-time tokens/s (stats.avg)
sglang:queue_time_secondshistogramQueue wait (stats.p99_estimate)
MetricTypeWhat to Watch
trtllm:e2e_request_latency_secondshistogramE2E latency (stats.p99_estimate)
trtllm:time_to_first_token_secondshistogramTTFT (stats.p99_estimate)
trtllm:time_per_output_token_secondshistogramITL (stats.p99_estimate)
trtllm:request_queue_time_secondshistogramQueue wait (stats.p99_estimate)
trtllm:request_successcounterCompleted requests (stats.rate)

TRT-LLM server-side setup is required. Unlike vLLM and SGLang, trtllm-serve does not expose Prometheus exposition format at /metrics by default — the default /metrics returns an iteration-stats JSON array (application/json), which is not parseable as Prometheus. Two consequences:

  1. Enable Prometheus on the server. Pass return_perf_metrics: true in your extra_llm_api_options.yaml. This mounts the proper Prometheus exposition at /prometheus/metrics (a non-standard path).
  2. AIPerf auto-detects and falls back. When AIPerf hits /metrics and gets application/json, it automatically probes <base>/prometheus/metrics once. If the alt path serves Prometheus, AIPerf swaps the URL and continues — no manual override needed. If the alt path also fails (e.g. return_perf_metrics was not set), the collector auto-disables for the remainder of the run with a single warning.

Example extra_llm_api_options.yaml snippet:

1return_perf_metrics: true

Quick Start

Server metrics are collected by default - just run AIPerf normally:

$aiperf profile \
> --model Qwen/Qwen3-0.6B \
> --endpoint-type chat \
> --endpoint /v1/chat/completions \
> --url localhost:8000 \
> --concurrency 4 \
> --request-count 100

AIPerf automatically:

  1. Discovers the /metrics endpoint on your inference server (base URL + /metrics)
  2. Tests endpoint reachability before profiling starts
  3. Captures baseline metrics before warmup period begins (reference point for deltas) — also where AIPerf first parses the response and validates it as Prometheus exposition format; see Compatibility & auto-disable for what happens when an endpoint returns non-Prometheus content
  4. Collects metrics at configurable intervals during warmup and profiling
  5. Performs final scrape after profiling completes (captures end state)
  6. Exports selected formats (default: JSON + CSV + Parquet):
    • server_metrics_export.json - Aggregated statistics (profiling period only)
    • server_metrics_export.csv - Tabular format (profiling period only)
    • server_metrics_export.parquet - Raw time-series with delta calculations
    • server_metrics_export.jsonl - Time-series data (all scrapes, opt-in only)

Custom file naming: The --profile-export-prefix (or --profile-export-file) flag changes the prefix for all export files, including server metrics. Any file extension is automatically stripped from the provided value. For example:

$aiperf profile --model MODEL ... --profile-export-prefix my_benchmark
$# Produces: my_benchmark_server_metrics.json, my_benchmark_server_metrics.csv, etc.
$
$# --profile-export-file is an alias for --profile-export-prefix, so this is equivalent:
$aiperf profile --model MODEL ... --profile-export-file my_benchmark.json
$# Produces the same files (the .json extension is stripped automatically)

Time filtering: Statistics in JSON/CSV exports exclude the warmup period, showing only metrics from the profiling phase. The JSONL file contains all scrapes (including warmup) for complete time-series analysis.

Format selection: By default, JSON, CSV, and Parquet formats are generated (JSONL is opt-in to avoid large files). To opt out of Parquet, or to include JSONL for time-series analysis:

$# Disable Parquet (JSON + CSV only)
$aiperf profile --model MODEL ... --server-metrics-formats json csv
$
$# Add JSONL for raw time-series snapshots
$aiperf profile --model MODEL ... --server-metrics-formats json csv parquet jsonl

Adding Custom Endpoints

$# Single endpoint
$aiperf profile --model MODEL ... --server-metrics http://localhost:8081
$
$# Multiple endpoints (distributed deployment)
$aiperf profile --model MODEL ... --server-metrics \
> http://node1:8081 \
> http://node2:8081

Disabling Server Metrics

$aiperf profile --model MODEL ... --no-server-metrics

Selecting Output Formats

$# Default: JSON + CSV + Parquet
$aiperf profile --model MODEL ...
$
$# Opt out of Parquet (JSON + CSV only)
$aiperf profile --model MODEL ... --server-metrics-formats json csv
$
$# Add JSONL for raw time-series snapshots
$aiperf profile --model MODEL ... --server-metrics-formats json csv parquet jsonl
FormatUse CaseSize
JSON/CSV (default)Summary statistics, CI/CD thresholdsSmall
Parquet (default)SQL queries, pandas/DuckDB analyticsCompressed
JSONL (opt-in)Debugging, raw Prometheus snapshots10-100x larger

Compatibility & auto-disable

AIPerf scrapes /metrics at ~3 Hz and parses the response as Prometheus exposition format. When a server speaks something else at that path (most commonly TRT-LLM, which serves an iteration-stats JSON array), AIPerf does not retry-and-spam — it detects the mismatch on the first scrape and disables collection for that endpoint with a single log line. This avoids the failure mode where parse errors at the scrape interval inflate run time by 10×+.

Detection. A response is treated as non-Prometheus when either:

  • the HTTP Content-Type is application/json (the response body is never read in this case — the rejection is cheaper than parsing); or
  • the body fails to parse as Prometheus exposition format (prometheus_client.parser.text_string_to_metric_families raises ValueError — e.g. a server returns text/plain with garbage, or a JSON body without a content-type).

TRT-LLM /prometheus/metrics fallback. Before disabling, AIPerf probes <base>/prometheus/metrics exactly once — TRT-LLM mounts the proper Prometheus path there when launched with return_perf_metrics: true (see the TRT-LLM entry in the Quick Reference table above). If the probe succeeds, the collector swaps its URL there and the run continues with the alt endpoint. The probe is attempted whenever the configured URL ends with /metrics and is not already /prometheus/metrics itself — so /metrics, /v1/metrics, and /api/metrics all trigger the fallback probe. URLs that don’t end in /metrics (e.g. /stats, /telemetry) are left untouched, and /prometheus/metrics is excluded to avoid probing the same path it would swap to.

On auto-disable. A single WARNING is emitted naming the endpoint and the suppression flag. Subsequent scrape cycles short-circuit, the collector emits no further log noise, and the rest of the benchmark proceeds normally — other configured endpoints (DCGM telemetry, additional --server-metrics URLs) are unaffected.

WARNING Disabling server metrics collection for http://127.0.0.1:60000/metrics:
endpoint 'http://127.0.0.1:60000/metrics' returned non-Prometheus
content-type 'application/json'; expected text/plain (Prometheus
exposition format). To suppress this warning, pass --no-server-metrics.

To suppress the warning entirely, pass --no-server-metrics — collection is skipped, no probe is attempted, no warning is logged.

Configuration

Environment VariableDefaultDescription
AIPERF_SERVER_METRICS_COLLECTION_INTERVAL0.333sCollection frequency (333ms, ~3Hz)
AIPERF_SERVER_METRICS_COLLECTION_FLUSH_PERIOD2.0sWait time for final metrics after benchmark
AIPERF_SERVER_METRICS_REACHABILITY_TIMEOUT10sTimeout for endpoint reachability tests
AIPERF_SERVER_METRICS_EXPORT_BATCH_SIZE100Batch size for JSONL writer
AIPERF_SERVER_METRICS_SHUTDOWN_DELAY5.0sShutdown delay for command response transmission

Output Files

The filenames below are defaults. When --profile-export-prefix <prefix> is used, server metrics files are named <prefix>_server_metrics.{json,csv,jsonl,parquet} (any file extension in the prefix is stripped automatically). All files are written to the artifact directory (--artifact-dir / --output-artifact-dir, default: ./artifacts/<run_info>).

1. Time-Series: server_metrics_export.jsonl

Line-delimited JSON with metrics snapshots over time:

1{
2 "endpoint_url": "http://localhost:8000/metrics",
3 "timestamp_ns": 1763591215220757503,
4 "endpoint_latency_ns": 719764167,
5 "metrics": {
6 "vllm:num_requests_running": [{"value": 12.0}],
7 "vllm:kv_cache_usage_perc": [{"value": 0.72}],
8 "vllm:request_success": [{"value": 1500.0}],
9 "vllm:time_to_first_token_seconds": [{
10 "buckets": {"0.01": 145.0, "0.1": 1498.0, "+Inf": 1500.0},
11 "sum": 32.456,
12 "count": 1500.0
13 }]
14 },
15 "request_sent_ns": 1763591214500993336,
16 "first_byte_ns": 1763591215220757503
17}

Fields:

  • endpoint_url: Source Prometheus endpoint
  • timestamp_ns: Collection timestamp in nanoseconds
  • endpoint_latency_ns: HTTP round-trip time in nanoseconds
  • metrics: All metrics from this endpoint
    • Counter/Gauge: {"value": N} or {"labels": {...}, "value": N}
    • Histogram: {"buckets": {"le": count}, "sum": N, "count": N} with optional labels

2. Aggregated Statistics: server_metrics_export.json

Aggregated statistics from profiling period. Metrics from all endpoints are merged, each series tagged with endpoint_url.

1{
2 "schema_version": "1.0",
3 "aiperf_version": "0.3.0",
4 "benchmark_id": "2900a136-3c1a-4520-adaa-5719822b729b",
5 "summary": {
6 "endpoints_configured": ["http://localhost:8000/metrics"],
7 "endpoints_successful": ["http://localhost:8000/metrics"],
8 "start_time": "2025-12-15T02:04:23.028529",
9 "end_time": "2025-12-15T02:05:15.294690",
10 "endpoint_info": {
11 "http://localhost:8000/metrics": {
12 "total_fetches": 157,
13 "first_fetch_ns": 1765793061967310848,
14 "last_fetch_ns": 1765793114960054143,
15 "avg_fetch_latency_ms": 246.83,
16 "unique_updates": 157,
17 "first_update_ns": 1765793061967310848,
18 "last_update_ns": 1765793114960054143,
19 "duration_seconds": 52.99,
20 "avg_update_interval_ms": 339.70,
21 "median_update_interval_ms": 333.48
22 }
23 }
24 },
25 "metrics": {
26 "vllm:kv_cache_usage_perc": {
27 "type": "gauge",
28 "description": "KV-cache usage. 1 means 100 percent usage.",
29 "unit": "ratio",
30 "series": [{
31 "endpoint_url": "http://localhost:8000/metrics",
32 "labels": { "engine": "0", "model_name": "Qwen/Qwen3-0.6B" },
33 "stats": {
34 "avg": 0.191, "min": 0.0, "max": 0.202, "std": 0.038,
35 "p1": 0.003, "p5": 0.178, "p10": 0.191, "p25": 0.198,
36 "p50": 0.202, "p75": 0.202, "p90": 0.202, "p95": 0.202, "p99": 0.202
37 },
38 "timeslices": [
39 { "start_ns": 1765793063028529452, "end_ns": 1765793068028529452, "avg": 0.107, "min": 0.0, "max": 0.191 },
40 { "start_ns": 1765793068028529452, "end_ns": 1765793073028529452, "avg": 0.192, "min": 0.191, "max": 0.194 }
41 ]
42 }]
43 },
44 "vllm:request_success": {
45 "type": "counter",
46 "description": "Count of successfully processed requests.",
47 "unit": "requests",
48 "series": [{
49 "endpoint_url": "http://localhost:8000/metrics",
50 "labels": { "engine": "0", "finished_reason": "length", "model_name": "Qwen/Qwen3-0.6B" },
51 "stats": {
52 "total": 19.0, "rate": 0.359,
53 "rate_avg": 0.38, "rate_min": 0.0, "rate_max": 1.8, "rate_std": 0.751
54 },
55 "timeslices": [
56 { "start_ns": 1765793063028529452, "end_ns": 1765793068028529452, "total": 0.0, "rate": 0.0 },
57 { "start_ns": 1765793073028529452, "end_ns": 1765793078028529452, "total": 9.0, "rate": 1.8 }
58 ]
59 }]
60 },
61 "vllm:e2e_request_latency_seconds": {
62 "type": "histogram",
63 "description": "Histogram of e2e request latency in seconds.",
64 "unit": "seconds",
65 "series": [{
66 "endpoint_url": "http://localhost:8000/metrics",
67 "labels": { "engine": "0", "model_name": "Qwen/Qwen3-0.6B" },
68 "stats": {
69 "count": 19, "sum": 259.87, "avg": 13.68,
70 "count_rate": 0.359, "sum_rate": 4.90,
71 "p1_estimate": 2.25, "p5_estimate": 5.77, "p10_estimate": 8.26,
72 "p25_estimate": 10.82, "p50_estimate": 13.75, "p75_estimate": 15.35,
73 "p90_estimate": 17.24, "p95_estimate": 19.51, "p99_estimate": 31.77
74 },
75 "buckets": {
76 "0.3": 0, "0.5": 0, "1.0": 0, "2.5": 1, "5.0": 1,
77 "10.0": 3, "15.0": 11, "20.0": 18, "30.0": 18, "+Inf": 19
78 },
79 "timeslices": [
80 {
81 "start_ns": 1765793063028529452, "end_ns": 1765793068028529452,
82 "count": 0, "sum": 0.0, "avg": 0.0,
83 "buckets": { "0.3": 0, "0.5": 0, "1.0": 0, "2.5": 0, "5.0": 0, "10.0": 0, "15.0": 0, "20.0": 0, "+Inf": 0 }
84 }
85 ]
86 }]
87 }
88 },
89 "input_config": {
90 "models": ["Qwen/Qwen3-0.6B"],
91 "endpoint": { "urls": ["http://localhost:8000"], "streaming": true },
92 "datasets": [{ "name": "default", "type": "synthetic", "count": 30000 }],
93 "phases": [
94 { "name": "profiling", "type": "concurrency", "concurrency": 400, "requests": 30000 }
95 ],
96 "artifacts": { "slice_duration": 5.0 }
97 }
98}

Query with jq:

$jq '.metrics["vllm:e2e_request_latency_seconds"].series[0].stats.p99_estimate' server_metrics_export.json

3. CSV Export: server_metrics_export.csv

Tabular export organized in five sections (separated by blank lines): gauge, counter, histogram, unknown, info. The unknown section holds families that the Prometheus server declared as # TYPE foo untyped (or with no # TYPE line at all); they use the same statistics columns as gauges.

  • Labels expanded into individual columns for easy filtering/pivoting
  • Open directly in Excel/Sheets or load with pandas
1from io import StringIO
2import pandas as pd
3
4with open("server_metrics_export.csv") as f:
5 sections = [pd.read_csv(StringIO(s)) for s in f.read().strip().split('\n\n') if s.strip()]

4. Parquet Export: server_metrics_export.parquet

Raw time-series data with delta calculations applied. Uses a normalized schema (~50% smaller than wide format) where histogram buckets are separate rows. Each label becomes a column for SQL filtering.

Schema overview:

ColumnTypeDescription
endpoint_urlstringSource Prometheus endpoint
metric_namestringMetric name
metric_typestringgauge, unknown, counter, or histogram
timestamp_nsint64Collection timestamp (nanoseconds)
valuefloat64Gauge/counter value (delta for counters)
sum, countfloat64Histogram sum/count deltas
bucket_le, bucket_countstring, float64Histogram bucket bound and delta count
(label columns)stringDynamic columns from Prometheus labels

See Parquet Schema Reference for complete schema, metadata, and query examples.

Related documentation:

Quick examples:

$# DuckDB queries
$duckdb -c "SELECT * FROM 'server_metrics_export.parquet' WHERE metric_name LIKE 'vllm:%' ORDER BY timestamp_ns"
$duckdb -c "SELECT metric_name, AVG(value) FROM '*.parquet' WHERE metric_type='gauge' GROUP BY metric_name"
$
$# Combine multiple runs (handles schema differences)
$duckdb -c "SELECT * FROM read_parquet('artifacts/*/server_metrics_export.parquet', union_by_name=true)"
1import pandas as pd
2df = pd.read_parquet('server_metrics_export.parquet')
3df[df['metric_name'] == 'vllm:kv_cache_usage_perc'].plot(x='timestamp_ns', y='value')

Statistics by Metric Type

Now that you understand the output formats, let’s examine how statistics are structured within each metric type.

Statistics are nested under a stats field within each series item. All metrics use the stats format for consistent API access.

Gauge (point-in-time values)

Statistics: avg, min, max, std, p1, p5, p10, p25, p50, p75, p90, p95, p99

Gauge percentiles are computed from actual collected samples (not estimated from buckets).

Counter (cumulative totals)

Statistics: total, rate, and when --slice-duration is set: rate_avg, rate_min, rate_max, rate_std

  • total: Change during profiling period (uses last pre-profiling sample as reference)
  • rate: Increase per second (total/duration)
  • Counter resets are detected and handled (negative deltas → total = 0)

Histogram (distributions)

Statistics (stats): count, count_rate, sum, sum_rate, avg, p1_estimate, p5_estimate, p10_estimate, p25_estimate, p50_estimate, p75_estimate, p90_estimate, p95_estimate, p99_estimate

Series-level field: buckets (per-bucket delta counts, not cumulative)

  • avg (sum/count) is exact
  • Percentiles are estimates from bucket interpolation

Prometheus Summary metrics are not supported. Summary quantiles are computed cumulatively over the entire server lifetime, making them unsuitable for benchmark-specific analysis. Major LLM inference servers (vLLM, SGLang, TRT-LLM, Dynamo) use Histograms instead, which allow period-specific percentile estimation.

Timesliced Statistics

When configured with --slice-duration, AIPerf computes windowed statistics over fixed time intervals. Each series includes a timeslices array with per-window statistics:

1{
2 "stats": { "avg": 25.5, "min": 0.0, "max": 50.0 },
3 "timeslices": [
4 { "start_ns": 1765615837721140145, "end_ns": 1765615839721140145, "avg": 22.9, "min": 0.0, "max": 42.0 },
5 { "start_ns": 1765615839721140145, "end_ns": 1765615841721140145, "avg": 49.8, "min": 49.0, "max": 50.0 }
6 ]
7}
  • Gauges: Each timeslice contains avg, min, max
  • Counters: Each timeslice contains total, rate
  • Histograms: Each timeslice contains count, sum, avg, buckets

Partial timeslices (at the end of the collection period) are marked with is_complete: false and excluded from aggregate statistics (e.g., rate_avg, rate_min) to ensure fair comparison. Individual timeslice data includes both complete and partial slices for data completeness.


Labeled Metrics

Prometheus metrics with labels (e.g., model, status) are aggregated separately for each unique label combination. When collecting from multiple endpoints, series are merged together with each tagged by endpoint_url.

Unit Inference

AIPerf automatically infers units from metric names and descriptions using standard Prometheus conventions (_seconds, _bytes, _requests, etc.). Units appear in both JSON and CSV exports. The unit field is optional—if no unit can be inferred, it’s omitted.

Common Metrics by Server

vLLM

MetricTypeDescription
vllm:num_requests_runninggaugeRequests in execution batches
vllm:num_requests_waitinggaugeRequests in queue (saturation indicator)
vllm:kv_cache_usage_percgaugeKV-cache usage (0.0-1.0, >0.9 = capacity limit)
vllm:num_preemptionscounterRequests preempted due to memory pressure
vllm:prefix_cache_hitscounterTokens served from prefix cache
vllm:prefix_cache_queriescounterTokens queried (hit_rate = hits/queries)
vllm:time_to_first_token_secondshistogramTime to first token (TTFT)
vllm:e2e_request_latency_secondshistogramEnd-to-end latency
vllm:inter_token_latency_secondshistogramTime between output tokens (ITL)
vllm:request_queue_time_secondshistogramTime spent waiting in queue
vllm:request_prefill_time_secondshistogramTime spent in prefill phase
vllm:request_decode_time_secondshistogramTime spent in decode phase
vllm:request_successcounterCompleted requests
vllm:prompt_tokenscounterTotal prompt tokens (rate = prefill throughput)
vllm:generation_tokenscounterTotal generated tokens (rate = decode throughput)

Dynamo

MetricTypeDescription
dynamo_frontend_requestscounterRequests by endpoint/model/status
dynamo_frontend_inflight_requestsgaugeRequests currently processing
dynamo_frontend_queued_requestsgaugeRequests awaiting first token
dynamo_frontend_request_duration_secondshistogramEnd-to-end HTTP latency
dynamo_frontend_time_to_first_token_secondshistogramTTFT including routing overhead
dynamo_frontend_inter_token_latency_secondshistogramInter-token latency (ITL)
dynamo_frontend_input_sequence_tokenshistogramPrompt token distribution
dynamo_frontend_output_sequence_tokenshistogramResponse token distribution
dynamo_component_requestscounterPer-component (prefill/decode) requests
dynamo_component_request_duration_secondshistogramPer-component processing time
dynamo_component_inflight_requestsgaugeActive requests per worker
dynamo_component_errorscounterErrors by component/type
dynamo_component_kvstats_gpu_cache_usage_percentgaugeBackend KV-cache usage

SGLang

MetricTypeDescription
sglang:num_running_reqsgaugeRunning requests
sglang:num_queue_reqsgaugeQueued requests (saturation indicator)
sglang:token_usagegaugeMemory utilization (>0.9 = capacity limit)
sglang:cache_hit_rategaugePrefix cache hit rate
sglang:gen_throughputgaugeReal-time generation tokens/s
sglang:queue_time_secondshistogramQueue wait time
sglang:per_stage_req_latency_secondshistogramLatency by stage (prefill_/decode_)

TRT-LLM

MetricTypeDescription
trtllm:time_to_first_token_secondshistogramTime to first token (TTFT)
trtllm:e2e_request_latency_secondshistogramEnd-to-end latency
trtllm:time_per_output_token_secondshistogramPer-token generation time (ITL)
trtllm:request_queue_time_secondshistogramTime in WAITING phase
trtllm:request_successcounterCompleted requests

Troubleshooting

ProblemCheckSolution
High p99, good p50vllm:num_requests_waiting spikesQueue buildup—reduce concurrency or increase server capacity
OOM crashesvllm:kv_cache_usage_perc approaching 1.0Reduce max_model_len or increase gpu_memory_utilization
Low throughputvllm:num_requests_running vs vllm:num_requests_waitingLow both = client bottleneck; high waiting = server bottleneck
Endpoint unreachablecurl http://localhost:8000/metricsCheck server running, network, firewall; use explicit --server-metrics URL
WARNING ... non-Prometheus content-type 'application/json'curl -i <base>/metrics shows Content-Type: application/jsonServer isn’t serving Prometheus at /metrics. For TRT-LLM, set return_perf_metrics: true in extra_llm_api_options.yaml so AIPerf’s auto-probe finds /prometheus/metrics. To silence the warning entirely, pass --no-server-metrics. See Compatibility & auto-disable.

CI/CD Integration

1import json
2
3with open('server_metrics_export.json') as f:
4 data = json.load(f)
5
6latency = data['metrics']['vllm:e2e_request_latency_seconds']['series'][0]['stats']
7assert latency['p99_estimate'] < 5.0, f"P99 latency too high: {latency['p99_estimate']}"