Server Metrics Collection
AIPerf automatically collects metrics from Prometheus-compatible endpoints exposed by LLM inference servers (vLLM, SGLang, TRT-LLM, Dynamo, etc.).
Quick Reference
Key metrics by server:
vLLM
Dynamo
SGLang
TRT-LLM
Quick Start
Server metrics are collected by default - just run AIPerf normally:
AIPerf automatically:
- Discovers the
/metricsendpoint on your inference server (base URL +/metrics) - Tests endpoint reachability before profiling starts
- Captures baseline metrics before warmup period begins (reference point for deltas)
- Collects metrics at configurable intervals during warmup and profiling
- Performs final scrape after profiling completes (captures end state)
- Exports selected formats (default: JSON + CSV):
server_metrics_export.json- Aggregated statistics (profiling period only)server_metrics_export.csv- Tabular format (profiling period only)server_metrics_export.jsonl- Time-series data (all scrapes, opt-in only)server_metrics_export.parquet- Raw time-series with delta calculations (opt-in only)
Custom file naming: The --profile-export-prefix (or --profile-export-file) flag changes the prefix for all export files, including server metrics. Any file extension is automatically stripped from the provided value. For example:
Time filtering: Statistics in JSON/CSV exports exclude the warmup period, showing only metrics from the profiling phase. The JSONL file contains all scrapes (including warmup) for complete time-series analysis.
Format selection: By default, only JSON and CSV formats are generated to avoid large JSONL files. To include JSONL for time-series analysis:
Adding Custom Endpoints
Disabling Server Metrics
Selecting Output Formats
Configuration
Output Files
The filenames below are defaults. When --profile-export-prefix <prefix> is used, server metrics files are named <prefix>_server_metrics.{json,csv,jsonl,parquet} (any file extension in the prefix is stripped automatically). All files are written to the artifact directory (--artifact-directory, default: ./artifacts/<run_info>).
1. Time-Series: server_metrics_export.jsonl
Line-delimited JSON with metrics snapshots over time:
Fields:
endpoint_url: Source Prometheus endpointtimestamp_ns: Collection timestamp in nanosecondsendpoint_latency_ns: HTTP round-trip time in nanosecondsmetrics: All metrics from this endpoint- Counter/Gauge:
{"value": N}or{"labels": {...}, "value": N} - Histogram:
{"buckets": {"le": count}, "sum": N, "count": N}with optional labels
- Counter/Gauge:
2. Aggregated Statistics: server_metrics_export.json
Aggregated statistics from profiling period. Metrics from all endpoints are merged, each series tagged with endpoint_url.
Query with jq:
3. CSV Export: server_metrics_export.csv
Tabular export organized in four sections (separated by blank lines): gauge, counter, histogram, info.
- Labels expanded into individual columns for easy filtering/pivoting
- Open directly in Excel/Sheets or load with pandas
4. Parquet Export: server_metrics_export.parquet
Raw time-series data with delta calculations applied. Uses a normalized schema (~50% smaller than wide format) where histogram buckets are separate rows. Each label becomes a column for SQL filtering.
Schema overview:
See Parquet Schema Reference for complete schema, metadata, and query examples.
Related documentation:
- JSON Schema Reference - Complete JSON export format specification
- Server Metrics Reference - Metric definitions by backend (vLLM, SGLang, TRT-LLM, Dynamo)
- Parquet Schema Reference - Raw time-series data schema
Quick examples:
Statistics by Metric Type
Now that you understand the output formats, let’s examine how statistics are structured within each metric type.
Statistics are nested under a stats field within each series item. All metrics use the stats format for consistent API access.
Gauge (point-in-time values)
Statistics: avg, min, max, std, p1, p5, p10, p25, p50, p75, p90, p95, p99
Gauge percentiles are computed from actual collected samples (not estimated from buckets).
Counter (cumulative totals)
Statistics: total, rate, and when --slice-duration is set: rate_avg, rate_min, rate_max, rate_std
total: Change during profiling period (uses last pre-profiling sample as reference)rate: Increase per second (total/duration)- Counter resets are detected and handled (negative deltas → total = 0)
Histogram (distributions)
Statistics (stats): count, count_rate, sum, sum_rate, avg, p1_estimate, p5_estimate, p10_estimate, p25_estimate, p50_estimate, p75_estimate, p90_estimate, p95_estimate, p99_estimate
Series-level field: buckets (per-bucket delta counts, not cumulative)
avg(sum/count) is exact- Percentiles are estimates from bucket interpolation
Prometheus Summary metrics are not supported. Summary quantiles are computed cumulatively over the entire server lifetime, making them unsuitable for benchmark-specific analysis. Major LLM inference servers (vLLM, SGLang, TRT-LLM, Dynamo) use Histograms instead, which allow period-specific percentile estimation.
Timesliced Statistics
When configured with --slice-duration, AIPerf computes windowed statistics over fixed time intervals. Each series includes a timeslices array with per-window statistics:
- Gauges: Each timeslice contains
avg,min,max - Counters: Each timeslice contains
total,rate - Histograms: Each timeslice contains
count,sum,avg,buckets
Partial timeslices (at the end of the collection period) are marked with is_complete: false and excluded from aggregate statistics (e.g., rate_avg, rate_min) to ensure fair comparison. Individual timeslice data includes both complete and partial slices for data completeness.
Labeled Metrics
Prometheus metrics with labels (e.g., model, status) are aggregated separately for each unique label combination. When collecting from multiple endpoints, series are merged together with each tagged by endpoint_url.
Unit Inference
AIPerf automatically infers units from metric names and descriptions using standard Prometheus conventions (_seconds, _bytes, _requests, etc.). Units appear in both JSON and CSV exports. The unit field is optional—if no unit can be inferred, it’s omitted.