> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/aiperf/llms.txt.
> For full documentation content, see https://docs.nvidia.com/aiperf/llms-full.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/aiperf/_mcp/server.

# Server Metrics Collection

AIPerf automatically collects metrics from Prometheus-compatible endpoints exposed by LLM inference servers and serving frontends (vLLM, SGLang, TRT-LLM, Dynamo, Triton, etc.).

## Quick Reference

| Feature | Description | Default |
|---------|-------------|---------|
| **Auto-discovery** | Automatically finds `/metrics` endpoint on server URL | Enabled |
| **Collection** | Scrapes metrics every 333ms during benchmark | Enabled |
| **Outputs** | JSON (aggregated), CSV (tabular), JSONL (time-series), Parquet (cumulative deltas) | JSON + CSV + Parquet |
| **Custom endpoints** | `--server-metrics URL [URL...]` for additional endpoints | None |
| **Disable** | `--no-server-metrics` to turn off collection | Enabled |
| **Windowed stats** | `--slice-duration SECONDS` for time-sliced analysis | Off |

**Key metrics by server:**

<Accordion title="vLLM">

| Metric | Type | What to Watch |
|--------|------|---------------|
| `vllm:num_requests_running` | gauge | Active batch size (`stats.avg`) |
| `vllm:num_requests_waiting` | gauge | Queue depth—growing = saturation (`stats.max`) |
| `vllm:num_requests_waiting_by_reason` | gauge | Queue depth split into `capacity` and `deferred` (`stats.max`) |
| `vllm:kv_cache_usage_perc` | gauge | >0.9 = capacity limit (`stats.max`) |
| `vllm:num_preemptions` | counter | >0 = memory pressure (`stats.total`) |
| `vllm:e2e_request_latency_seconds` | histogram | E2E latency (`stats.p99_estimate`) |
| `vllm:time_to_first_token_seconds` | histogram | TTFT (`stats.p99_estimate`) |
| `vllm:inter_token_latency_seconds` | histogram | ITL (`stats.p99_estimate`) |
| `vllm:prompt_tokens_by_source` | counter | Prompt-token source mix (`source` label) |
| `vllm:generation_tokens` | counter | Decode throughput (`stats.rate`) |

</Accordion>

<Accordion title="Dynamo">

| Metric | Type | What to Watch |
|--------|------|---------------|
| `dynamo_frontend_active_requests` | gauge | HTTP handler active requests (`stats.avg`) |
| `dynamo_frontend_inflight_requests` | gauge | Engine-bound active requests (`stats.avg`) |
| `dynamo_frontend_queued_requests` | gauge | HTTP requests awaiting first token (`stats.avg`) |
| `dynamo_frontend_request_duration_seconds` | histogram | E2E latency (`stats.p99_estimate`) |
| `dynamo_frontend_time_to_first_token_seconds` | histogram | TTFT (`stats.p99_estimate`) |
| `dynamo_frontend_inter_token_latency_seconds` | histogram | ITL (`stats.p99_estimate`) |
| `dynamo_frontend_requests` | counter | Completed request throughput (`stats.rate`) |
| `dynamo_frontend_output_tokens` | counter | Decode throughput (`stats.rate`) |
| `dynamo_component_gpu_cache_usage_percent` | gauge | Backend cache usage (`stats.max`) |

</Accordion>

<Accordion title="SGLang">

| Metric | Type | What to Watch |
|--------|------|---------------|
| `sglang:num_running_reqs` | gauge | Active batch size (`stats.avg`) |
| `sglang:num_queue_reqs` | gauge | Queue depth—growing = saturation (`stats.max`) |
| `sglang:token_usage` | gauge | >0.9 = capacity limit (`stats.max`) |
| `sglang:cache_hit_rate` | gauge | Prefix cache efficiency (`stats.avg`) |
| `sglang:gen_throughput` | gauge | Real-time tokens/s (`stats.avg`) |
| `sglang:time_to_first_token_seconds` | histogram | TTFT (`stats.p99_estimate`) |
| `sglang:inter_token_latency_seconds` | histogram | ITL (`stats.p99_estimate`) |
| `sglang:e2e_request_latency_seconds` | histogram | E2E latency (`stats.p99_estimate`) |
| `sglang:queue_time_seconds` | histogram | Queue wait (`stats.p99_estimate`) |
| `sglang:prompt_tokens` | counter | Prefill throughput (`stats.rate`) |
| `sglang:generation_tokens` | counter | Decode throughput (`stats.rate`) |

</Accordion>

<Accordion title="TRT-LLM">

| Metric | Type | What to Watch |
|--------|------|---------------|
| `trtllm_e2e_request_latency_seconds` | histogram | E2E latency (`stats.p99_estimate`) |
| `trtllm_time_to_first_token_seconds` | histogram | TTFT (`stats.p99_estimate`) |
| `trtllm_time_per_output_token_seconds` | histogram | ITL (`stats.p99_estimate`) |
| `trtllm_request_queue_time_seconds` | histogram | Queue wait (`stats.p99_estimate`) |
| `trtllm_request_prefill_time_seconds` | histogram | Prefill duration (`stats.p99_estimate`) |
| `trtllm_request_decode_time_seconds` | histogram | Decode duration (`stats.p99_estimate`) |
| `trtllm_request_success` | counter | Completed requests (`stats.rate`) |
| `trtllm_prompt_tokens` | counter | Prefill throughput (`stats.rate`) |
| `trtllm_generation_tokens` | counter | Decode throughput (`stats.rate`) |
| `trtllm_num_requests_running` | gauge | Active requests (`stats.avg`) |
| `trtllm_num_requests_waiting` | gauge | Queued requests (`stats.max`) |
| `trtllm_kv_cache_utilization` | gauge | KV cache usage (`stats.max`) |
| `trtllm_kv_cache_hit_rate` | gauge | KV cache reuse efficiency (`stats.avg`) |

<Info>
**TRT-LLM server-side setup is required.** Unlike vLLM and SGLang, `trtllm-serve` does not expose Prometheus exposition format at `/metrics` by default — the default `/metrics` returns an iteration-stats JSON array (`application/json`), which is not parseable as Prometheus. Two consequences:

1. **Enable Prometheus on the server.** Pass `return_perf_metrics: true` in your `extra_llm_api_options.yaml`. This mounts the proper Prometheus exposition at `/prometheus/metrics` (a non-standard path). Add `enable_iter_perf_stats: true` when you want iteration-derived queue/KV/memory metrics from the PyTorch backend.
2. **AIPerf auto-detects and falls back.** When AIPerf hits `/metrics` and gets `application/json`, it automatically probes `<base>/prometheus/metrics` once. If the alt path serves Prometheus, AIPerf swaps the URL and continues — no manual override needed. If the alt path also fails (e.g. `return_perf_metrics` was not set), the collector auto-disables for the remainder of the run with a single warning.

Example `extra_llm_api_options.yaml` snippet:
```yaml
return_perf_metrics: true
enable_iter_perf_stats: true
```
</Info>

</Accordion>

<Accordion title="Triton Inference Server">

| Metric | Type | What to Watch |
|--------|------|---------------|
| `nv_inference_request_success` | counter | Successful request throughput (`stats.rate`) |
| `nv_inference_request_failure` | counter | Failed requests by `reason` (`stats.total`) |
| `nv_inference_count` | counter | Inference throughput and average batch size numerator (`stats.rate`) |
| `nv_inference_exec_count` | counter | Execution throughput and average batch size denominator (`stats.rate`) |
| `nv_inference_pending_request_count` | gauge | Backend queue depth (`stats.max`) |
| `nv_inference_request_duration_us` | counter | Cumulative E2E request time (`stats.total`, microseconds) |
| `nv_inference_queue_duration_us` | counter | Cumulative queue time (`stats.total`, microseconds) |
| `nv_inference_first_response_histogram_ms` | histogram | First-response latency when histogram latencies are enabled (`stats.p99_estimate`) |
| `nv_gpu_utilization` | gauge | GPU utilization (`stats.avg`) |
| `nv_gpu_memory_used_bytes` | gauge | GPU memory pressure (`stats.max`) |
| `nv_cache_num_hits_per_model` | counter | Response-cache hits (`stats.total`) |
| `nv_cache_num_misses_per_model` | counter | Response-cache misses (`stats.total`) |

Triton serves Prometheus metrics at `http://localhost:8002/metrics` by default, not on the inference HTTP port. Use `--server-metrics http://HOST:8002/metrics` when the inference URL and metrics URL differ. Triton latency summaries are ignored by AIPerf; enable `--metrics-config histogram_latencies=true` for first-response histogram percentiles.

</Accordion>

## Quick Start

Server metrics are **collected by default** - just run AIPerf normally:

```bash
aiperf profile \
    --model Qwen/Qwen3-0.6B \
    --endpoint-type chat \
    --endpoint /v1/chat/completions \
    --url localhost:8000 \
    --concurrency 4 \
    --request-count 100
```

AIPerf automatically:
1. Discovers the `/metrics` endpoint on your inference server (base URL + `/metrics`)
2. Tests endpoint reachability before profiling starts
3. Captures baseline metrics before warmup period begins (reference point for deltas) — also where AIPerf first parses the response and validates it as Prometheus exposition format; see [Compatibility & auto-disable](#compatibility--auto-disable) for what happens when an endpoint returns non-Prometheus content
4. Collects metrics at configurable intervals during warmup and profiling
5. Performs final scrape after profiling completes (captures end state)
6. Exports selected formats (default: JSON + CSV + Parquet):
   - `server_metrics_export.json` - Aggregated statistics (profiling period only)
   - `server_metrics_export.csv` - Tabular format (profiling period only)
   - `server_metrics_export.parquet` - Raw time-series with delta calculations
   - `server_metrics_export.jsonl` - Time-series data (all scrapes, opt-in only)

<Note>
**Custom file naming:** The `--profile-export-prefix` (or `--profile-export-file`) flag changes the prefix for all export files, including server metrics. Any file extension is automatically stripped from the provided value. For example:
```bash
aiperf profile --model MODEL ... --profile-export-prefix my_benchmark
# Produces: my_benchmark_server_metrics.json, my_benchmark_server_metrics.csv, etc.

# --profile-export-file is an alias for --profile-export-prefix, so this is equivalent:
aiperf profile --model MODEL ... --profile-export-file my_benchmark.json
# Produces the same files (the .json extension is stripped automatically)
```
</Note>

**Time filtering:** Statistics in JSON/CSV exports exclude the warmup period, showing only metrics from the profiling phase. The JSONL file contains all scrapes (including warmup) for complete time-series analysis.

**Format selection:** By default, JSON, CSV, and Parquet formats are generated (JSONL is opt-in to avoid large files). To opt out of Parquet, or to include JSONL for time-series analysis:
```bash
# Disable Parquet (JSON + CSV only)
aiperf profile --model MODEL ... --server-metrics-formats json csv

# Add JSONL for raw time-series snapshots
aiperf profile --model MODEL ... --server-metrics-formats json csv parquet jsonl
```

### Adding Custom Endpoints

```bash
# Single endpoint
aiperf profile --model MODEL ... --server-metrics http://localhost:8081

# Multiple endpoints (distributed deployment)
aiperf profile --model MODEL ... --server-metrics \
    http://node1:8081 \
    http://node2:8081
```

### Disabling Server Metrics

```bash
aiperf profile --model MODEL ... --no-server-metrics
```

### Selecting Output Formats

```bash
# Default: JSON + CSV + Parquet
aiperf profile --model MODEL ...

# Opt out of Parquet (JSON + CSV only)
aiperf profile --model MODEL ... --server-metrics-formats json csv

# Add JSONL for raw time-series snapshots
aiperf profile --model MODEL ... --server-metrics-formats json csv parquet jsonl
```

| Format | Use Case | Size |
|--------|----------|------|
| **JSON/CSV** (default) | Summary statistics, CI/CD thresholds | Small |
| **Parquet** (default) | SQL queries, pandas/DuckDB analytics | Compressed |
| **JSONL** (opt-in) | Debugging, raw Prometheus snapshots | 10-100x larger |

## Compatibility & auto-disable

AIPerf scrapes `/metrics` at ~3 Hz and parses the response as Prometheus exposition format. When a server speaks something else at that path (most commonly TRT-LLM, which serves an iteration-stats JSON array), AIPerf does not retry-and-spam — it detects the mismatch on the first scrape and disables collection for that endpoint with a single log line. This avoids the failure mode where parse errors at the scrape interval inflate run time by 10×+.

**Detection.** A response is treated as non-Prometheus when either:
- the HTTP `Content-Type` is `application/json` (the response body is never read in this case — the rejection is cheaper than parsing); or
- the body fails to parse as Prometheus exposition format (`prometheus_client.parser.text_string_to_metric_families` raises `ValueError` — e.g. a server returns `text/plain` with garbage, or a JSON body without a content-type).

**TRT-LLM `/prometheus/metrics` fallback.** Before disabling, AIPerf probes `<base>/prometheus/metrics` exactly once — TRT-LLM mounts the proper Prometheus path there when launched with `return_perf_metrics: true` (see the TRT-LLM entry in the [Quick Reference table](#quick-reference) above). If the probe succeeds, the collector swaps its URL there and the run continues with the alt endpoint. The probe is attempted whenever the configured URL ends with `/metrics` and is not already `/prometheus/metrics` itself — so `/metrics`, `/v1/metrics`, and `/api/metrics` all trigger the fallback probe. URLs that don't end in `/metrics` (e.g. `/stats`, `/telemetry`) are left untouched, and `/prometheus/metrics` is excluded to avoid probing the same path it would swap to.

**On auto-disable.** A single `WARNING` is emitted naming the endpoint and the suppression flag. Subsequent scrape cycles short-circuit, the collector emits no further log noise, and the rest of the benchmark proceeds normally — other configured endpoints (DCGM telemetry, additional `--server-metrics` URLs) are unaffected.

```text
WARNING  Disabling server metrics collection for http://127.0.0.1:60000/metrics:
         endpoint 'http://127.0.0.1:60000/metrics' returned non-Prometheus
         content-type 'application/json'; expected text/plain (Prometheus
         exposition format). To suppress this warning, pass --no-server-metrics.
```

**To suppress the warning entirely**, pass `--no-server-metrics` — collection is skipped, no probe is attempted, no warning is logged.

## Configuration

| Environment Variable | Default | Description |
|---------------------|---------|-------------|
| `AIPERF_SERVER_METRICS_COLLECTION_INTERVAL` | 0.333s | Collection frequency (333ms, ~3Hz) |
| `AIPERF_SERVER_METRICS_COLLECTION_FLUSH_PERIOD` | 2.0s | Wait time for final metrics after benchmark |
| `AIPERF_SERVER_METRICS_REACHABILITY_TIMEOUT` | 10s | Timeout for endpoint reachability tests |
| `AIPERF_SERVER_METRICS_EXPORT_BATCH_SIZE` | 100 | Batch size for JSONL writer |
| `AIPERF_SERVER_METRICS_SHUTDOWN_DELAY` | 5.0s | Shutdown delay for command response transmission |

## Output Files

<Note>
The filenames below are defaults. When `--profile-export-prefix <prefix>` is used, server metrics files are named `<prefix>_server_metrics.{json,csv,jsonl,parquet}` (any file extension in the prefix is stripped automatically). All files are written to the artifact directory (`--artifact-dir` / `--output-artifact-dir`, default: `./artifacts/<run_info>`).
</Note>

### 1. Time-Series: `server_metrics_export.jsonl`

Line-delimited JSON with metrics snapshots over time:

```json
{
  "endpoint_url": "http://localhost:8000/metrics",
  "timestamp_ns": 1763591215220757503,
  "endpoint_latency_ns": 719764167,
  "metrics": {
    "vllm:num_requests_running": [{"value": 12.0}],
    "vllm:kv_cache_usage_perc": [{"value": 0.72}],
    "vllm:request_success": [{"value": 1500.0}],
    "vllm:time_to_first_token_seconds": [{
      "buckets": {"0.01": 145.0, "0.1": 1498.0, "+Inf": 1500.0},
      "sum": 32.456,
      "count": 1500.0
    }]
  },
  "request_sent_ns": 1763591214500993336,
  "first_byte_ns": 1763591215220757503
}
```

**Fields:**
- `endpoint_url`: Source Prometheus endpoint
- `timestamp_ns`: Collection timestamp in nanoseconds
- `endpoint_latency_ns`: HTTP round-trip time in nanoseconds
- `metrics`: All metrics from this endpoint
  - Counter/Gauge: `{"value": N}` or `{"labels": {...}, "value": N}`
  - Histogram: `{"buckets": {"le": count}, "sum": N, "count": N}` with optional labels

### 2. Aggregated Statistics: `server_metrics_export.json`

Aggregated statistics from profiling period. Metrics from all endpoints are merged, each series tagged with `endpoint_url`.

```json
{
  "schema_version": "1.0",
  "aiperf_version": "0.3.0",
  "benchmark_id": "2900a136-3c1a-4520-adaa-5719822b729b",
  "summary": {
    "endpoints_configured": ["http://localhost:8000/metrics"],
    "endpoints_successful": ["http://localhost:8000/metrics"],
    "start_time": "2025-12-15T02:04:23.028529",
    "end_time": "2025-12-15T02:05:15.294690",
    "endpoint_info": {
      "http://localhost:8000/metrics": {
        "total_fetches": 157,
        "first_fetch_ns": 1765793061967310848,
        "last_fetch_ns": 1765793114960054143,
        "avg_fetch_latency_ms": 246.83,
        "unique_updates": 157,
        "first_update_ns": 1765793061967310848,
        "last_update_ns": 1765793114960054143,
        "duration_seconds": 52.99,
        "avg_update_interval_ms": 339.70,
        "median_update_interval_ms": 333.48
      }
    }
  },
  "metrics": {
    "vllm:kv_cache_usage_perc": {
      "type": "gauge",
      "description": "KV-cache usage. 1 means 100 percent usage.",
      "unit": "ratio",
      "series": [{
        "endpoint_url": "http://localhost:8000/metrics",
        "labels": { "engine": "0", "model_name": "Qwen/Qwen3-0.6B" },
        "stats": {
          "avg": 0.191, "min": 0.0, "max": 0.202, "std": 0.038,
          "p1": 0.003, "p5": 0.178, "p10": 0.191, "p25": 0.198,
          "p50": 0.202, "p75": 0.202, "p90": 0.202, "p95": 0.202, "p99": 0.202
        },
        "timeslices": [
          { "start_ns": 1765793063028529452, "end_ns": 1765793068028529452, "avg": 0.107, "min": 0.0, "max": 0.191 },
          { "start_ns": 1765793068028529452, "end_ns": 1765793073028529452, "avg": 0.192, "min": 0.191, "max": 0.194 }
        ]
      }]
    },
    "vllm:request_success": {
      "type": "counter",
      "description": "Count of successfully processed requests.",
      "unit": "requests",
      "series": [{
        "endpoint_url": "http://localhost:8000/metrics",
        "labels": { "engine": "0", "finished_reason": "length", "model_name": "Qwen/Qwen3-0.6B" },
        "stats": {
          "total": 19.0, "rate": 0.359,
          "rate_avg": 0.38, "rate_min": 0.0, "rate_max": 1.8, "rate_std": 0.751
        },
        "timeslices": [
          { "start_ns": 1765793063028529452, "end_ns": 1765793068028529452, "total": 0.0, "rate": 0.0 },
          { "start_ns": 1765793073028529452, "end_ns": 1765793078028529452, "total": 9.0, "rate": 1.8 }
        ]
      }]
    },
    "vllm:e2e_request_latency_seconds": {
      "type": "histogram",
      "description": "Histogram of e2e request latency in seconds.",
      "unit": "seconds",
      "series": [{
        "endpoint_url": "http://localhost:8000/metrics",
        "labels": { "engine": "0", "model_name": "Qwen/Qwen3-0.6B" },
        "stats": {
          "count": 19, "sum": 259.87, "avg": 13.68,
          "count_rate": 0.359, "sum_rate": 4.90,
          "p1_estimate": 2.25, "p5_estimate": 5.77, "p10_estimate": 8.26,
          "p25_estimate": 10.82, "p50_estimate": 13.75, "p75_estimate": 15.35,
          "p90_estimate": 17.24, "p95_estimate": 19.51, "p99_estimate": 31.77
        },
        "buckets": {
          "0.3": 0, "0.5": 0, "1.0": 0, "2.5": 1, "5.0": 1,
          "10.0": 3, "15.0": 11, "20.0": 18, "30.0": 18, "+Inf": 19
        },
        "timeslices": [
          {
            "start_ns": 1765793063028529452, "end_ns": 1765793068028529452,
            "count": 0, "sum": 0.0, "avg": 0.0,
            "buckets": { "0.3": 0, "0.5": 0, "1.0": 0, "2.5": 0, "5.0": 0, "10.0": 0, "15.0": 0, "20.0": 0, "+Inf": 0 }
          }
        ]
      }]
    }
  },
  "input_config": {
    "models": ["Qwen/Qwen3-0.6B"],
    "endpoint": { "urls": ["http://localhost:8000"], "streaming": true },
    "datasets": [{ "name": "default", "type": "synthetic", "count": 30000 }],
    "phases": [
      { "name": "profiling", "type": "concurrency", "concurrency": 400, "requests": 30000 }
    ],
    "artifacts": { "slice_duration": 5.0 }
  }
}
```

Query with jq:
```bash
jq '.metrics["vllm:e2e_request_latency_seconds"].series[0].stats.p99_estimate' server_metrics_export.json
```

### 3. CSV Export: `server_metrics_export.csv`

Tabular export organized in five sections (separated by blank lines): **gauge**, **counter**, **histogram**, **unknown**, **info**. The **unknown** section holds families that the Prometheus server declared as `# TYPE foo untyped` (or with no `# TYPE` line at all); they use the same statistics columns as gauges.

- Labels expanded into individual columns for easy filtering/pivoting
- Open directly in Excel/Sheets or load with pandas

```python
from io import StringIO
import pandas as pd

with open("server_metrics_export.csv") as f:
    sections = [pd.read_csv(StringIO(s)) for s in f.read().strip().split('\n\n') if s.strip()]
```

### 4. Parquet Export: `server_metrics_export.parquet`

Raw time-series data with delta calculations applied. Uses a normalized schema (~50% smaller than wide format) where histogram buckets are separate rows. Each label becomes a column for SQL filtering.

**Schema overview:**

| Column | Type | Description |
|--------|------|-------------|
| `endpoint_url` | string | Source Prometheus endpoint |
| `metric_name` | string | Metric name |
| `metric_type` | string | `gauge`, `unknown`, `counter`, or `histogram` |
| `timestamp_ns` | int64 | Collection timestamp (nanoseconds) |
| `value` | float64 | Gauge/counter value (delta for counters) |
| `sum`, `count` | float64 | Histogram sum/count deltas |
| `bucket_le`, `bucket_count` | string, float64 | Histogram bucket bound and delta count |
| *(label columns)* | string | Dynamic columns from Prometheus labels |

See [Parquet Schema Reference](/aiperf/dev/server-metrics/ai-perf-server-metrics-parquet-export-schema) for complete schema, metadata, and query examples.

**Related documentation:**
- [JSON Schema Reference](/aiperf/dev/server-metrics/ai-perf-server-metrics-json-export-schema) - Complete JSON export format specification
- [Server Metrics Reference](/aiperf/dev/server-metrics/ai-perf-server-metrics-reference) - Metric definitions by backend (vLLM, SGLang, TRT-LLM, Dynamo, Triton)
- [Parquet Schema Reference](/aiperf/dev/server-metrics/ai-perf-server-metrics-parquet-export-schema) - Raw time-series data schema

**Quick examples:**

```bash
# DuckDB queries
duckdb -c "SELECT * FROM 'server_metrics_export.parquet' WHERE metric_name LIKE 'vllm:%' ORDER BY timestamp_ns"
duckdb -c "SELECT metric_name, AVG(value) FROM '*.parquet' WHERE metric_type='gauge' GROUP BY metric_name"

# Combine multiple runs (handles schema differences)
duckdb -c "SELECT * FROM read_parquet('artifacts/*/server_metrics_export.parquet', union_by_name=true)"
```

```python
import pandas as pd
df = pd.read_parquet('server_metrics_export.parquet')
df[df['metric_name'] == 'vllm:kv_cache_usage_perc'].plot(x='timestamp_ns', y='value')
```

---

## Statistics by Metric Type

Now that you understand the output formats, let's examine how statistics are structured within each metric type.

Statistics are nested under a `stats` field within each series item. All metrics use the `stats` format for consistent API access.

### Gauge (point-in-time values)

Statistics: `avg`, `min`, `max`, `std`, `p1`, `p5`, `p10`, `p25`, `p50`, `p75`, `p90`, `p95`, `p99`

Gauge percentiles are computed from **actual collected samples** (not estimated from buckets).

### Counter (cumulative totals)

Statistics: `total`, `rate`, and when `--slice-duration` is set: `rate_avg`, `rate_min`, `rate_max`, `rate_std`

- `total`: Change during profiling period (uses last pre-profiling sample as reference)
- `rate`: Increase per second (total/duration)
- Counter resets are detected and handled (negative deltas → total = 0)

### Histogram (distributions)

Statistics (`stats`): `count`, `count_rate`, `sum`, `sum_rate`, `avg`, `p1_estimate`, `p5_estimate`, `p10_estimate`, `p25_estimate`, `p50_estimate`, `p75_estimate`, `p90_estimate`, `p95_estimate`, `p99_estimate`

Series-level field: `buckets` (per-bucket delta counts, not cumulative)

- `avg` (sum/count) is **exact**
- Percentiles are **estimates** from bucket interpolation

<Note>
**Prometheus Summary metrics are not supported.** Summary quantiles are computed cumulatively over the entire server lifetime, making them unsuitable for benchmark-specific analysis. Use Histogram families for percentile estimation when the server offers them. Rare optional Summary families, such as SGLang's `sglang:eplb_balancedness`, Triton's `nv_inference_*_summary_us`, or Triton's response-cache `nv_cache_*_summary_per_model`, are ignored by AIPerf exports.
</Note>

## Timesliced Statistics

When configured with `--slice-duration`, AIPerf computes windowed statistics over fixed time intervals. Each series includes a `timeslices` array with per-window statistics:

```json
{
  "stats": { "avg": 25.5, "min": 0.0, "max": 50.0 },
  "timeslices": [
    { "start_ns": 1765615837721140145, "end_ns": 1765615839721140145, "avg": 22.9, "min": 0.0, "max": 42.0 },
    { "start_ns": 1765615839721140145, "end_ns": 1765615841721140145, "avg": 49.8, "min": 49.0, "max": 50.0 }
  ]
}
```

- **Gauges**: Each timeslice contains `avg`, `min`, `max`
- **Counters**: Each timeslice contains `total`, `rate`
- **Histograms**: Each timeslice contains `count`, `sum`, `avg`, `buckets`

Partial timeslices (at the end of the collection period) are marked with `is_complete: false` and excluded from aggregate statistics (e.g., `rate_avg`, `rate_min`) to ensure fair comparison. Individual timeslice data includes both complete and partial slices for data completeness.

---

## Labeled Metrics

Prometheus metrics with labels (e.g., `model`, `status`) are aggregated separately for each unique label combination. When collecting from multiple endpoints, series are merged together with each tagged by `endpoint_url`.

## Unit Inference

AIPerf automatically infers units from metric names and descriptions using standard Prometheus conventions (`_seconds`, `_bytes`, `_requests`, etc.). Units appear in both JSON and CSV exports. The `unit` field is optional—if no unit can be inferred, it's omitted.

## Common Metrics by Server

### vLLM

| Metric | Type | Description |
|--------|------|-------------|
| `vllm:num_requests_running` | gauge | Requests in execution batches |
| `vllm:num_requests_waiting` | gauge | Requests in queue (saturation indicator) |
| `vllm:num_requests_waiting_by_reason` | gauge | Waiting requests split by `capacity` vs `deferred` |
| `vllm:engine_sleep_state` | gauge | Engine sleep/offload state |
| `vllm:kv_cache_usage_perc` | gauge | KV-cache usage (0.0-1.0, >0.9 = capacity limit) |
| `vllm:num_preemptions` | counter | Requests preempted due to memory pressure |
| `vllm:prefix_cache_hits` | counter | Tokens served from prefix cache |
| `vllm:prefix_cache_queries` | counter | Tokens queried (hit_rate = hits/queries) |
| `vllm:external_prefix_cache_hits` | counter | Tokens served from external KV connector cache |
| `vllm:external_prefix_cache_queries` | counter | Tokens queried from external KV connector cache |
| `vllm:mm_cache_hits` | counter | Multi-modal cache hits |
| `vllm:mm_cache_queries` | counter | Multi-modal cache queries |
| `vllm:time_to_first_token_seconds` | histogram | Time to first token (TTFT) |
| `vllm:e2e_request_latency_seconds` | histogram | End-to-end latency |
| `vllm:inter_token_latency_seconds` | histogram | Time between output tokens (ITL) |
| `vllm:request_queue_time_seconds` | histogram | Time spent waiting in queue |
| `vllm:request_prefill_time_seconds` | histogram | Time spent in prefill phase |
| `vllm:request_decode_time_seconds` | histogram | Time spent in decode phase |
| `vllm:request_prefill_kv_computed_tokens` | histogram | New KV tokens computed during prefill, excluding cached tokens |
| `vllm:request_success` | counter | Completed requests |
| `vllm:prompt_tokens` | counter | Total prompt tokens (rate = prefill throughput) |
| `vllm:prompt_tokens_by_source` | counter | Prompt tokens by `local_compute`, `local_cache_hit`, or `external_kv_transfer` |
| `vllm:prompt_tokens_cached` | counter | Cached prompt tokens (local + external) |
| `vllm:generation_tokens` | counter | Total generated tokens (rate = decode throughput) |

### Dynamo

| Metric | Type | Description |
|--------|------|-------------|
| `dynamo_frontend_requests` | counter | Requests by endpoint/model/status |
| `dynamo_frontend_inflight_requests` | gauge | Requests currently processing |
| `dynamo_frontend_queued_requests` | gauge | Requests awaiting first token |
| `dynamo_frontend_request_duration_seconds` | histogram | End-to-end HTTP latency |
| `dynamo_frontend_time_to_first_token_seconds` | histogram | TTFT including routing overhead |
| `dynamo_frontend_inter_token_latency_seconds` | histogram | Inter-token latency (ITL) |
| `dynamo_frontend_input_sequence_tokens` | histogram | Prompt token distribution |
| `dynamo_frontend_output_sequence_tokens` | histogram | Response token distribution |
| `dynamo_component_requests` | counter | Per-component (prefill/decode) requests |
| `dynamo_component_request_duration_seconds` | histogram | Per-component processing time |
| `dynamo_component_inflight_requests` | gauge | Active requests per worker |
| `dynamo_component_errors` | counter | Errors by component/type |
| `dynamo_component_gpu_cache_usage_percent` | gauge | Backend KV-cache usage |
| `dynamo_component_embedding_cache_hits` | counter | Multimodal embedding-cache hits |
| `dynamo_component_embedding_cache_misses` | counter | Multimodal embedding-cache misses |
| `dynamo_component_kv_publisher_zmq_events` | counter | KV publisher relay events |
| `dynamo_tokio_global_queue_depth` | gauge | Tokio runtime global queue depth |
| `dynamo_frontend_event_loop_delay_seconds` | histogram | Event-loop delay canary |

### SGLang

| Metric | Type | Description |
|--------|------|-------------|
| `sglang:num_running_reqs` | gauge | Running requests |
| `sglang:num_queue_reqs` | gauge | Queued requests (saturation indicator) |
| `sglang:token_usage` | gauge | Memory utilization (>0.9 = capacity limit) |
| `sglang:cache_hit_rate` | gauge | Prefix cache hit rate |
| `sglang:gen_throughput` | gauge | Real-time generation tokens/s |
| `sglang:prompt_tokens` | counter | Total prompt tokens (rate = prefill throughput) |
| `sglang:generation_tokens` | counter | Total generated tokens (rate = decode throughput) |
| `sglang:time_to_first_token_seconds` | histogram | Time to first token (TTFT) |
| `sglang:inter_token_latency_seconds` | histogram | Time between output tokens (ITL) |
| `sglang:e2e_request_latency_seconds` | histogram | End-to-end latency |
| `sglang:queue_time_seconds` | histogram | Queue wait time |
| `sglang:per_stage_req_latency_seconds` | histogram | Latency by observed stage (`request_process`, `prefill_forward`, `decode_waiting`, etc.) |

### TRT-LLM

| Metric | Type | Description |
|--------|------|-------------|
| `trtllm_time_to_first_token_seconds` | histogram | Time to first token (TTFT) |
| `trtllm_e2e_request_latency_seconds` | histogram | End-to-end latency |
| `trtllm_time_per_output_token_seconds` | histogram | Per-token generation time (ITL) |
| `trtllm_request_queue_time_seconds` | histogram | Time in waiting phase |
| `trtllm_request_prefill_time_seconds` | histogram | Prefill/context phase duration |
| `trtllm_request_decode_time_seconds` | histogram | Decode/generation phase duration |
| `trtllm_request_inference_time_seconds` | histogram | Total scheduled inference duration |
| `trtllm_request_success` | counter | Completed requests by `finished_reason` |
| `trtllm_prompt_tokens` | counter | Total prompt tokens (rate = prefill throughput) |
| `trtllm_generation_tokens` | counter | Total generated tokens (rate = decode throughput) |
| `trtllm_num_requests_running` | gauge | Active requests |
| `trtllm_num_requests_waiting` | gauge | Queued requests |
| `trtllm_kv_cache_utilization` | gauge | KV cache utilization |
| `trtllm_kv_cache_hit_rate` | gauge | KV cache hit rate |
| `trtllm_num_aborted_requests` | counter | Dynamo-TRTLLM additional aborted/cancelled requests |
| `trtllm_kv_transfer_latency_seconds` | histogram | Dynamo-TRTLLM additional KV-transfer latency |
| `trtllm_kv_transfer_bytes` | histogram | Dynamo-TRTLLM additional KV-transfer size |
| `trtllm_kv_transfer_speed_gb_s` | histogram | Dynamo-TRTLLM additional KV-transfer speed |

### Triton Inference Server

| Metric | Type | Description |
|--------|------|-------------|
| `nv_inference_request_success` | counter | Successful inference requests |
| `nv_inference_request_failure` | counter | Failed inference requests by `reason` |
| `nv_inference_count` | counter | Inferences performed; divide by `nv_inference_exec_count` for average batch size |
| `nv_inference_exec_count` | counter | Backend batch executions |
| `nv_inference_pending_request_count` | gauge | Requests received by Triton but not yet executing |
| `nv_inference_request_duration_us` | counter | Cumulative end-to-end request handling time |
| `nv_inference_queue_duration_us` | counter | Cumulative scheduler queue time |
| `nv_inference_first_response_histogram_ms` | histogram | Optional first-response latency histogram |
| `nv_gpu_utilization` | gauge | GPU utilization |
| `nv_gpu_memory_used_bytes` | gauge | Used GPU memory |
| `nv_cache_num_hits_per_model` | counter | Response-cache hits per model (when response cache is enabled) |
| `nv_cache_num_misses_per_model` | counter | Response-cache misses per model (when response cache is enabled) |

---

## Troubleshooting

| Problem | Check | Solution |
|---------|-------|----------|
| High p99, good p50 | `vllm:num_requests_waiting` spikes | Queue buildup—reduce concurrency or increase server capacity |
| OOM crashes | `vllm:kv_cache_usage_perc` approaching 1.0 | Reduce `max_model_len` or increase `gpu_memory_utilization` |
| Low throughput | `vllm:num_requests_running` vs `vllm:num_requests_waiting` | Low both = client bottleneck; high waiting = server bottleneck |
| Endpoint unreachable | `curl http://localhost:8000/metrics` or `curl http://localhost:8002/metrics` for Triton | Check server running, network, firewall; use explicit `--server-metrics` URL |
| `WARNING ... non-Prometheus content-type 'application/json'` | `curl -i <base>/metrics` shows `Content-Type: application/json` | Server isn't serving Prometheus at `/metrics`. For TRT-LLM, set `return_perf_metrics: true` in `extra_llm_api_options.yaml` so AIPerf's auto-probe finds `/prometheus/metrics`. To silence the warning entirely, pass `--no-server-metrics`. See [Compatibility & auto-disable](#compatibility--auto-disable). |

---

## CI/CD Integration

```python
import json

with open('server_metrics_export.json') as f:
    data = json.load(f)

latency = data['metrics']['vllm:e2e_request_latency_seconds']['series'][0]['stats']
assert latency['p99_estimate'] < 5.0, f"P99 latency too high: {latency['p99_estimate']}"
```