AIPerf automatically collects metrics from Prometheus-compatible endpoints exposed by LLM inference servers and serving frontends (vLLM, SGLang, TRT-LLM, Dynamo, Triton, etc.).
Key metrics by server:
TRT-LLM server-side setup is required. Unlike vLLM and SGLang, trtllm-serve does not expose Prometheus exposition format at /metrics by default — the default /metrics returns an iteration-stats JSON array (application/json), which is not parseable as Prometheus. Two consequences:
return_perf_metrics: true in your extra_llm_api_options.yaml. This mounts the proper Prometheus exposition at /prometheus/metrics (a non-standard path). Add enable_iter_perf_stats: true when you want iteration-derived queue/KV/memory metrics from the PyTorch backend./metrics and gets application/json, it automatically probes <base>/prometheus/metrics once. If the alt path serves Prometheus, AIPerf swaps the URL and continues — no manual override needed. If the alt path also fails (e.g. return_perf_metrics was not set), the collector auto-disables for the remainder of the run with a single warning.Example extra_llm_api_options.yaml snippet:
Triton serves Prometheus metrics at http://localhost:8002/metrics by default, not on the inference HTTP port. Use --server-metrics http://HOST:8002/metrics when the inference URL and metrics URL differ. Triton latency summaries are ignored by AIPerf; enable --metrics-config histogram_latencies=true for first-response histogram percentiles.
Server metrics are collected by default - just run AIPerf normally:
AIPerf automatically:
/metrics endpoint on your inference server (base URL + /metrics)server_metrics_export.json - Aggregated statistics (profiling period only)server_metrics_export.csv - Tabular format (profiling period only)server_metrics_export.parquet - Raw time-series with delta calculationsserver_metrics_export.jsonl - Time-series data (all scrapes, opt-in only)Custom file naming: The --profile-export-prefix (or --profile-export-file) flag changes the prefix for all export files, including server metrics. Any file extension is automatically stripped from the provided value. For example:
Time filtering: Statistics in JSON/CSV exports exclude the warmup period, showing only metrics from the profiling phase. The JSONL file contains all scrapes (including warmup) for complete time-series analysis.
Format selection: By default, JSON, CSV, and Parquet formats are generated (JSONL is opt-in to avoid large files). To opt out of Parquet, or to include JSONL for time-series analysis:
AIPerf scrapes /metrics at ~3 Hz and parses the response as Prometheus exposition format. When a server speaks something else at that path (most commonly TRT-LLM, which serves an iteration-stats JSON array), AIPerf does not retry-and-spam — it detects the mismatch on the first scrape and disables collection for that endpoint with a single log line. This avoids the failure mode where parse errors at the scrape interval inflate run time by 10×+.
Detection. A response is treated as non-Prometheus when either:
Content-Type is application/json (the response body is never read in this case — the rejection is cheaper than parsing); orprometheus_client.parser.text_string_to_metric_families raises ValueError — e.g. a server returns text/plain with garbage, or a JSON body without a content-type).TRT-LLM /prometheus/metrics fallback. Before disabling, AIPerf probes <base>/prometheus/metrics exactly once — TRT-LLM mounts the proper Prometheus path there when launched with return_perf_metrics: true (see the TRT-LLM entry in the Quick Reference table above). If the probe succeeds, the collector swaps its URL there and the run continues with the alt endpoint. The probe is attempted whenever the configured URL ends with /metrics and is not already /prometheus/metrics itself — so /metrics, /v1/metrics, and /api/metrics all trigger the fallback probe. URLs that don’t end in /metrics (e.g. /stats, /telemetry) are left untouched, and /prometheus/metrics is excluded to avoid probing the same path it would swap to.
On auto-disable. A single WARNING is emitted naming the endpoint and the suppression flag. Subsequent scrape cycles short-circuit, the collector emits no further log noise, and the rest of the benchmark proceeds normally — other configured endpoints (DCGM telemetry, additional --server-metrics URLs) are unaffected.
To suppress the warning entirely, pass --no-server-metrics — collection is skipped, no probe is attempted, no warning is logged.
The filenames below are defaults. When --profile-export-prefix <prefix> is used, server metrics files are named <prefix>_server_metrics.{json,csv,jsonl,parquet} (any file extension in the prefix is stripped automatically). All files are written to the artifact directory (--artifact-dir / --output-artifact-dir, default: ./artifacts/<run_info>).
server_metrics_export.jsonlLine-delimited JSON with metrics snapshots over time:
Fields:
endpoint_url: Source Prometheus endpointtimestamp_ns: Collection timestamp in nanosecondsendpoint_latency_ns: HTTP round-trip time in nanosecondsmetrics: All metrics from this endpoint
{"value": N} or {"labels": {...}, "value": N}{"buckets": {"le": count}, "sum": N, "count": N} with optional labelsserver_metrics_export.jsonAggregated statistics from profiling period. Metrics from all endpoints are merged, each series tagged with endpoint_url.
Query with jq:
server_metrics_export.csvTabular export organized in five sections (separated by blank lines): gauge, counter, histogram, unknown, info. The unknown section holds families that the Prometheus server declared as # TYPE foo untyped (or with no # TYPE line at all); they use the same statistics columns as gauges.
server_metrics_export.parquetRaw time-series data with delta calculations applied. Uses a normalized schema (~50% smaller than wide format) where histogram buckets are separate rows. Each label becomes a column for SQL filtering.
Schema overview:
See Parquet Schema Reference for complete schema, metadata, and query examples.
Related documentation:
Quick examples:
Now that you understand the output formats, let’s examine how statistics are structured within each metric type.
Statistics are nested under a stats field within each series item. All metrics use the stats format for consistent API access.
Statistics: avg, min, max, std, p1, p5, p10, p25, p50, p75, p90, p95, p99
Gauge percentiles are computed from actual collected samples (not estimated from buckets).
Statistics: total, rate, and when --slice-duration is set: rate_avg, rate_min, rate_max, rate_std
total: Change during profiling period (uses last pre-profiling sample as reference)rate: Increase per second (total/duration)Statistics (stats): count, count_rate, sum, sum_rate, avg, p1_estimate, p5_estimate, p10_estimate, p25_estimate, p50_estimate, p75_estimate, p90_estimate, p95_estimate, p99_estimate
Series-level field: buckets (per-bucket delta counts, not cumulative)
avg (sum/count) is exactPrometheus Summary metrics are not supported. Summary quantiles are computed cumulatively over the entire server lifetime, making them unsuitable for benchmark-specific analysis. Use Histogram families for percentile estimation when the server offers them. Rare optional Summary families, such as SGLang’s sglang:eplb_balancedness, Triton’s nv_inference_*_summary_us, or Triton’s response-cache nv_cache_*_summary_per_model, are ignored by AIPerf exports.
When configured with --slice-duration, AIPerf computes windowed statistics over fixed time intervals. Each series includes a timeslices array with per-window statistics:
avg, min, maxtotal, ratecount, sum, avg, bucketsPartial timeslices (at the end of the collection period) are marked with is_complete: false and excluded from aggregate statistics (e.g., rate_avg, rate_min) to ensure fair comparison. Individual timeslice data includes both complete and partial slices for data completeness.
Prometheus metrics with labels (e.g., model, status) are aggregated separately for each unique label combination. When collecting from multiple endpoints, series are merged together with each tagged by endpoint_url.
AIPerf automatically infers units from metric names and descriptions using standard Prometheus conventions (_seconds, _bytes, _requests, etc.). Units appear in both JSON and CSV exports. The unit field is optional—if no unit can be inferred, it’s omitted.