AIPerf Server Metrics Reference
Comprehensive reference for server metrics collected during AIPerf benchmark runs from NVIDIA Dynamo, vLLM, SGLang, and TensorRT-LLM inference servers.
Table of Contents
- Quick Reference: Common Questions
- Backend Comparison Matrix
- Metric Interpretation Guide
- Detailed Metric Definitions
- Appendix
Quick Reference: Common Questions
”What is my throughput?”
“What is my latency?”
“Am I hitting capacity limits?”
“What does my workload look like?”
“Where is time being spent?”
vLLM latency breakdown:
SGLang latency breakdown (via sglang:per_stage_req_latency_seconds with stage label):
TensorRT-LLM latency breakdown:
Backend Comparison Matrix
Key equivalent metrics across backends:
Key insight: Dynamo metrics measure at the HTTP/routing layer (user-facing), while backend metrics measure inside the inference engine (debugging). Use both for complete visibility.
Metric Interpretation Guide
Metric Types
Counter (cumulative, monotonically increasing):
stats.total= Total change during benchmarkstats.rate= Rate of change (per second)- Example:
vllm:prompt_tokenswithstats.rate= prefill throughput
Gauge (point-in-time snapshot):
stats.avg= Typical valuestats.max= Peak valuestats.min= Minimum valuestats.p50,stats.p90,stats.p99= Percentile values- Example:
vllm:num_requests_waitingwithstats.max= worst-case queue depth
Histogram (distribution):
stats.total= Total count of observationsstats.sum= Sum of all observed valuesstats.avg= Mean (sum/count)stats.p50_estimate,stats.p90_estimate,stats.p95_estimate,stats.p99_estimate= Estimated percentiles from buckets- Example:
vllm:e2e_request_latency_secondswithstats.p99_estimate= tail latency
Info (static labels):
- Only
stats.avgis meaningful (value is typically 1.0) - Labels contain the actual configuration data
- Example:
vllm:cache_config_infoexposes cache settings as labels
Understanding Percentiles
Histogram percentiles are estimated from bucket boundaries, not exact values. Accuracy depends on bucket granularity. See Histogram Buckets for bucket definitions.
Multiple Endpoints
When scraping multiple server instances, each series includes an endpoint_url label to identify the source.
Detailed Metric Definitions
Dynamo Frontend
The Dynamo frontend is the HTTP entry point that receives client requests and routes them to backend workers. These metrics provide user-facing visibility into request processing.
Request Flow
Label values:
endpoint:chat_completions,completionsrequest_type:stream,unarystatus:success,error
Latency
Tokens
Model Configuration (Static Gauges)
These are constant values that don’t change during the benchmark. Only stats.avg is meaningful.
Dynamo Component
Dynamo components are backend workers that execute inference. These metrics come from the worker process level and provide visibility into backend-level request processing.
Request Processing
Data Transfer
KV Cache Statistics
NATS Messaging (Internal)
NATS metrics track the internal messaging system used for component communication within Dynamo.
vLLM
vLLM is a high-performance inference engine. These metrics provide deep visibility into model execution, cache usage, and request processing phases.
Cache & Memory
Queue State
Token Throughput
Common finished_reason values: length, stop, error
Request-Level Latency Breakdown
These histograms show where time is spent for each request. Together they decompose the end-to-end latency.
Token-Level Latency
Request Parameters
These histograms show the distribution of request parameters processed by vLLM.
Configuration Info
Common cache config labels:
block_size: KV cache block size in tokens (e.g.,16)cache_dtype: Cache data type (e.g.,auto)enable_prefix_caching: Whether prefix caching is enabled (True/False)gpu_memory_utilization: GPU memory utilization target (e.g.,0.9)num_gpu_blocks: Total GPU blocks allocated (e.g.,71671)
SGLang
SGLang is a fast inference engine with RadixAttention for efficient prefix caching. These metrics provide visibility into SGLang’s scheduling, execution, and advanced features like disaggregated inference and speculative decoding.
Throughput & Performance
Queue State
Disaggregated Inference Queues
For disaggregated prefill/decode deployments where prefill and decode run on separate instances.
Request Latency Breakdown
Histogram buckets for sglang:per_stage_req_latency_seconds:
Stage labels for sglang:per_stage_req_latency_seconds:
KV Cache Transfer (Disaggregated)
For disaggregated prefill/decode deployments, these metrics track KV cache transfer between instances.
Speculative Decoding
System Configuration
Common label values:
engine_type:unifiedmodel_name: Model identifier (e.g.,Qwen/Qwen3-0.6B)tp_rank: Tensor parallel rank (e.g.,0,1, …)pp_rank: Pipeline parallel rank (e.g.,0,1, …)pid: Process ID
TensorRT-LLM
TensorRT-LLM (trtllm) is NVIDIA’s high-performance inference engine optimized for NVIDIA GPUs. These metrics focus on request latency and completion tracking.
Request Latency
Request Completion
Common label values:
engine_type:trtllmmodel_name: Model identifier (e.g.,Qwen/Qwen3-0.6B)finished_reason:length(reached max_tokens),stop(stop sequence),error(error occurred)
KVBM (KV Block Manager)
Note: These metrics are only available with Dynamo deployments using the KV Block Manager feature for advanced KV cache management.
Block Transfer Operations
All metrics are counters tracking cumulative block movement operations.
Block transfer patterns:
- d2d: Device ↔ Disk (direct, fast path)
- d2h: Device → Host (offload to CPU memory)
- h2d: Host → Device (onboard from CPU memory)
- h2d (disk): Host → Disk (persist to storage)
Appendix
Common Metric Labels
Labels that appear across multiple metrics:
Notes on Metric Usage
-
Dynamo vs backend metrics: Dynamo metrics measure at the HTTP/routing layer (user-facing), while vLLM/SGLang/TensorRT-LLM metrics measure inside the inference engine. Use Dynamo for user-facing SLAs, backend metrics for debugging performance.
-
Counter vs Gauge interpretation:
- Counters: Use
stats.totalfor total change during benchmark,stats.ratefor rate of change (per second) - Gauges: Use
stats.avgfor typical value,stats.maxfor peak,stats.p99for tail behavior
- Counters: Use
-
Histogram percentiles: Histogram percentiles (
stats.p50_estimate,stats.p90_estimate,stats.p95_estimate,stats.p99_estimate) are estimated from bucket boundaries. Exact values depend on bucket configuration. -
Multiple endpoints: When scraping multiple instances, each series includes an
endpoint_urllabel to identify the source. -
Backend-specific capabilities:
- vLLM: Most comprehensive metrics including full request phase breakdown, cache statistics, and batch efficiency
- SGLang: RadixAttention cache metrics, disaggregated inference support, speculative decoding stats, per-stage latency breakdowns
- TensorRT-LLM: Focused on core latency metrics (queue, TTFT, e2e) with minimal overhead
For detailed implementation and usage examples, see the Server Metrics Tutorial. For aggregated statistics, see the JSON Schema Reference. For raw time-series analysis, see the Parquet Schema Reference.