Server Metrics Collection
AIPerf automatically collects metrics from Prometheus-compatible endpoints exposed by LLM inference servers (vLLM, SGLang, TRT-LLM, Dynamo, etc.).
Quick Reference
Key metrics by server:
vLLM
Dynamo
SGLang
TRT-LLM
TRT-LLM server-side setup is required. Unlike vLLM and SGLang, trtllm-serve does not expose Prometheus exposition format at /metrics by default — the default /metrics returns an iteration-stats JSON array (application/json), which is not parseable as Prometheus. Two consequences:
- Enable Prometheus on the server. Pass
return_perf_metrics: truein yourextra_llm_api_options.yaml. This mounts the proper Prometheus exposition at/prometheus/metrics(a non-standard path). - AIPerf auto-detects and falls back. When AIPerf hits
/metricsand getsapplication/json, it automatically probes<base>/prometheus/metricsonce. If the alt path serves Prometheus, AIPerf swaps the URL and continues — no manual override needed. If the alt path also fails (e.g.return_perf_metricswas not set), the collector auto-disables for the remainder of the run with a single warning.
Example extra_llm_api_options.yaml snippet:
Quick Start
Server metrics are collected by default - just run AIPerf normally:
AIPerf automatically:
- Discovers the
/metricsendpoint on your inference server (base URL +/metrics) - Tests endpoint reachability before profiling starts
- Captures baseline metrics before warmup period begins (reference point for deltas) — also where AIPerf first parses the response and validates it as Prometheus exposition format; see Compatibility & auto-disable for what happens when an endpoint returns non-Prometheus content
- Collects metrics at configurable intervals during warmup and profiling
- Performs final scrape after profiling completes (captures end state)
- Exports selected formats (default: JSON + CSV + Parquet):
server_metrics_export.json- Aggregated statistics (profiling period only)server_metrics_export.csv- Tabular format (profiling period only)server_metrics_export.parquet- Raw time-series with delta calculationsserver_metrics_export.jsonl- Time-series data (all scrapes, opt-in only)
Custom file naming: The --profile-export-prefix (or --profile-export-file) flag changes the prefix for all export files, including server metrics. Any file extension is automatically stripped from the provided value. For example:
Time filtering: Statistics in JSON/CSV exports exclude the warmup period, showing only metrics from the profiling phase. The JSONL file contains all scrapes (including warmup) for complete time-series analysis.
Format selection: By default, JSON, CSV, and Parquet formats are generated (JSONL is opt-in to avoid large files). To opt out of Parquet, or to include JSONL for time-series analysis:
Adding Custom Endpoints
Disabling Server Metrics
Selecting Output Formats
Compatibility & auto-disable
AIPerf scrapes /metrics at ~3 Hz and parses the response as Prometheus exposition format. When a server speaks something else at that path (most commonly TRT-LLM, which serves an iteration-stats JSON array), AIPerf does not retry-and-spam — it detects the mismatch on the first scrape and disables collection for that endpoint with a single log line. This avoids the failure mode where parse errors at the scrape interval inflate run time by 10×+.
Detection. A response is treated as non-Prometheus when either:
- the HTTP
Content-Typeisapplication/json(the response body is never read in this case — the rejection is cheaper than parsing); or - the body fails to parse as Prometheus exposition format (
prometheus_client.parser.text_string_to_metric_familiesraisesValueError— e.g. a server returnstext/plainwith garbage, or a JSON body without a content-type).
TRT-LLM /prometheus/metrics fallback. Before disabling, AIPerf probes <base>/prometheus/metrics exactly once — TRT-LLM mounts the proper Prometheus path there when launched with return_perf_metrics: true (see the TRT-LLM entry in the Quick Reference table above). If the probe succeeds, the collector swaps its URL there and the run continues with the alt endpoint. The probe is attempted whenever the configured URL ends with /metrics and is not already /prometheus/metrics itself — so /metrics, /v1/metrics, and /api/metrics all trigger the fallback probe. URLs that don’t end in /metrics (e.g. /stats, /telemetry) are left untouched, and /prometheus/metrics is excluded to avoid probing the same path it would swap to.
On auto-disable. A single WARNING is emitted naming the endpoint and the suppression flag. Subsequent scrape cycles short-circuit, the collector emits no further log noise, and the rest of the benchmark proceeds normally — other configured endpoints (DCGM telemetry, additional --server-metrics URLs) are unaffected.
To suppress the warning entirely, pass --no-server-metrics — collection is skipped, no probe is attempted, no warning is logged.
Configuration
Output Files
The filenames below are defaults. When --profile-export-prefix <prefix> is used, server metrics files are named <prefix>_server_metrics.{json,csv,jsonl,parquet} (any file extension in the prefix is stripped automatically). All files are written to the artifact directory (--artifact-dir / --output-artifact-dir, default: ./artifacts/<run_info>).
1. Time-Series: server_metrics_export.jsonl
Line-delimited JSON with metrics snapshots over time:
Fields:
endpoint_url: Source Prometheus endpointtimestamp_ns: Collection timestamp in nanosecondsendpoint_latency_ns: HTTP round-trip time in nanosecondsmetrics: All metrics from this endpoint- Counter/Gauge:
{"value": N}or{"labels": {...}, "value": N} - Histogram:
{"buckets": {"le": count}, "sum": N, "count": N}with optional labels
- Counter/Gauge:
2. Aggregated Statistics: server_metrics_export.json
Aggregated statistics from profiling period. Metrics from all endpoints are merged, each series tagged with endpoint_url.
Query with jq:
3. CSV Export: server_metrics_export.csv
Tabular export organized in five sections (separated by blank lines): gauge, counter, histogram, unknown, info. The unknown section holds families that the Prometheus server declared as # TYPE foo untyped (or with no # TYPE line at all); they use the same statistics columns as gauges.
- Labels expanded into individual columns for easy filtering/pivoting
- Open directly in Excel/Sheets or load with pandas
4. Parquet Export: server_metrics_export.parquet
Raw time-series data with delta calculations applied. Uses a normalized schema (~50% smaller than wide format) where histogram buckets are separate rows. Each label becomes a column for SQL filtering.
Schema overview:
See Parquet Schema Reference for complete schema, metadata, and query examples.
Related documentation:
- JSON Schema Reference - Complete JSON export format specification
- Server Metrics Reference - Metric definitions by backend (vLLM, SGLang, TRT-LLM, Dynamo)
- Parquet Schema Reference - Raw time-series data schema
Quick examples:
Statistics by Metric Type
Now that you understand the output formats, let’s examine how statistics are structured within each metric type.
Statistics are nested under a stats field within each series item. All metrics use the stats format for consistent API access.
Gauge (point-in-time values)
Statistics: avg, min, max, std, p1, p5, p10, p25, p50, p75, p90, p95, p99
Gauge percentiles are computed from actual collected samples (not estimated from buckets).
Counter (cumulative totals)
Statistics: total, rate, and when --slice-duration is set: rate_avg, rate_min, rate_max, rate_std
total: Change during profiling period (uses last pre-profiling sample as reference)rate: Increase per second (total/duration)- Counter resets are detected and handled (negative deltas → total = 0)
Histogram (distributions)
Statistics (stats): count, count_rate, sum, sum_rate, avg, p1_estimate, p5_estimate, p10_estimate, p25_estimate, p50_estimate, p75_estimate, p90_estimate, p95_estimate, p99_estimate
Series-level field: buckets (per-bucket delta counts, not cumulative)
avg(sum/count) is exact- Percentiles are estimates from bucket interpolation
Prometheus Summary metrics are not supported. Summary quantiles are computed cumulatively over the entire server lifetime, making them unsuitable for benchmark-specific analysis. Major LLM inference servers (vLLM, SGLang, TRT-LLM, Dynamo) use Histograms instead, which allow period-specific percentile estimation.
Timesliced Statistics
When configured with --slice-duration, AIPerf computes windowed statistics over fixed time intervals. Each series includes a timeslices array with per-window statistics:
- Gauges: Each timeslice contains
avg,min,max - Counters: Each timeslice contains
total,rate - Histograms: Each timeslice contains
count,sum,avg,buckets
Partial timeslices (at the end of the collection period) are marked with is_complete: false and excluded from aggregate statistics (e.g., rate_avg, rate_min) to ensure fair comparison. Individual timeslice data includes both complete and partial slices for data completeness.
Labeled Metrics
Prometheus metrics with labels (e.g., model, status) are aggregated separately for each unique label combination. When collecting from multiple endpoints, series are merged together with each tagged by endpoint_url.
Unit Inference
AIPerf automatically infers units from metric names and descriptions using standard Prometheus conventions (_seconds, _bytes, _requests, etc.). Units appear in both JSON and CSV exports. The unit field is optional—if no unit can be inferred, it’s omitted.