Comprehensive reference for server metrics collected during AIPerf benchmark runs from NVIDIA Dynamo, vLLM, SGLang, TensorRT-LLM, and Triton Inference Server endpoints.
vLLM latency breakdown:
SGLang latency breakdown (via sglang:per_stage_req_latency_seconds with stage label):
TensorRT-LLM latency breakdown:
Key equivalent metrics across backends:
Key insight: Dynamo metrics measure at the HTTP/routing layer (user-facing), while backend metrics measure inside the inference engine (debugging). Use both for complete visibility.
Counter (cumulative, monotonically increasing):
stats.total = Total change during benchmarkstats.rate = Rate of change (per second)vllm:prompt_tokens with stats.rate = prefill throughput_total suffix, so upstream *_total counter samples usually appear as * in AIPerf exports.Gauge (point-in-time snapshot):
stats.avg = Typical valuestats.max = Peak valuestats.min = Minimum valuestats.p50, stats.p90, stats.p99 = Percentile valuesvllm:num_requests_waiting with stats.max = worst-case queue depthHistogram (distribution):
stats.total = Total count of observationsstats.sum = Sum of all observed valuesstats.avg = Mean (sum/count)stats.p50_estimate, stats.p90_estimate, stats.p95_estimate, stats.p99_estimate = Estimated percentiles from bucketsvllm:e2e_request_latency_seconds with stats.p99_estimate = tail latencyInfo (static labels):
stats.avg is meaningful (value is typically 1.0)vllm:cache_config_info exposes cache settings as labelsHistogram percentiles are estimated from bucket boundaries, not exact values. Accuracy depends on bucket granularity. See Histogram Buckets for bucket definitions.
When scraping multiple server instances, each series includes an endpoint_url label to identify the source.
The Dynamo frontend is the HTTP entry point that receives client requests and routes them to backend workers. These metrics provide user-facing visibility into request processing.
Label values:
endpoint: completions, chat_completions, embeddings, images, videos, audios, responses, anthropic_messages, tensorrequest_type: stream, unarystatus: success, errorerror_type: empty string for success, or validation, not_found, overload, cancelled, response_timeout, internal, not_implementedHistogram buckets:
dynamo_frontend_request_duration_seconds: 0.0, 2.0, 4.0, 8.0, 16.0, 32.0, 64.0, 130.0, 260.0, 510.0, +Infdynamo_frontend_time_to_first_token_seconds: 0.0, 0.0022, 0.0047, 0.01, 0.022, 0.047, 0.1, 0.22, 0.47, 1.0, 2.2, 4.7, 10.0, 22.0, 48.0, 100.0, 220.0, 480.0, +Infdynamo_frontend_inter_token_latency_seconds: 0.0, 0.0019, 0.0035, 0.0067, 0.013, 0.024, 0.045, 0.084, 0.16, 0.3, 0.56, 1.1, 2.0, +InfHistogram buckets:
dynamo_frontend_cached_tokens: Same as dynamo_frontend_input_sequence_tokensdynamo_frontend_tokenizer_latency_ms: 0.5, 1.0, 2.0, 4.0, 8.0, 16.0, 32.0, 64.0, 128.0, 256.0, 512.0, +Infdynamo_frontend_input_sequence_tokens: 0.0, 100.0, 210.0, 430.0, 870.0, 1800.0, 3600.0, 7400.0, 15000.0, 31000.0, 63000.0, 130000.0, +Infdynamo_frontend_output_sequence_tokens: 0.0, 100.0, 210.0, 430.0, 880.0, 1800.0, 3700.0, 7600.0, 16000.0, 32000.0, +InfThese are constant values that don’t change during the benchmark. Only stats.avg is meaningful.
Histogram buckets:
dynamo_frontend_stage_duration_seconds: 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 2.5, 5.0, +Infdynamo_frontend_tokenize_seconds: 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, +Infdynamo_frontend_template_seconds: 0.00001, 0.00005, 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, +InfRouter request metrics are component-scoped and therefore also carry dynamo_namespace, dynamo_component, optional dynamo_endpoint, worker_id, and router_id labels.
Histogram buckets:
dynamo_component_router_time_to_first_token_seconds: Same as dynamo_frontend_time_to_first_token_secondsdynamo_component_router_inter_token_latency_seconds: Same as dynamo_frontend_inter_token_latency_secondsdynamo_component_router_input_sequence_tokens: Same as dynamo_frontend_input_sequence_tokensdynamo_component_router_output_sequence_tokens: Same as dynamo_frontend_output_sequence_tokensdynamo_component_router_kv_hit_rate: 0.0, 0.05, 0.1, ... 1.0, +Infdynamo_component_router_kv_transfer_estimated_latency_seconds: 0.0, 0.0019, 0.0037, 0.0072, 0.014, 0.027, 0.052, 0.1, 0.19, 0.37, 0.72, 1.4, 2.7, 5.2, 10.0, +Infdynamo_component_router_shared_cache_hit_rate: 0.0, 0.05, 0.1, ... 1.0, +Infdynamo_component_router_shared_cache_beyond_blocks: 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, +Infdynamo_router_overhead_block_hashing_ms: exponential 0.001 * 2^n, 15 bucketsdynamo_router_overhead_indexer_find_matches_ms: exponential 0.01 * 3^n, 17 bucketsdynamo_router_overhead_seq_hashing_ms: exponential 0.001 * 2^n, 15 bucketsdynamo_router_overhead_scheduling_ms: exponential 0.01 * 3^n, 17 bucketsdynamo_router_overhead_total_ms: exponential 0.01 * 3^n, 17 bucketsdynamo_router_overhead_shared_cache_query_ms: exponential 0.01 * 3^n, 17 bucketsThese component-scoped metrics track Dynamo’s KV-event publisher and relay path.
Dynamo component metrics come from worker, router, and backend processes. Metrics created through Dynamo’s hierarchy usually carry dynamo_namespace, dynamo_component, optional dynamo_endpoint, and worker_id labels; endpoint handlers may also add engine labels such as model.
Histogram buckets:
dynamo_component_request_duration_seconds: 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 20.0, 30.0, 60.0, 120.0, 300.0, 600.0, +InfHistogram buckets:
dynamo_work_handler_network_transit_seconds: 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, +Infdynamo_work_handler_time_to_first_response_seconds: 0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0, 60.0, +Infdynamo_work_handler_permit_wait_seconds: 0.0001, 0.001, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10.0, 30.0, 60.0, +InfDynamo’s current in-code NATS metric is a transport error counter. Older dynamo_component_nats_client_* and dynamo_component_nats_service_* families were not verified in current upstream code and are not documented as current.
Histogram buckets:
dynamo_request_plane_queue_seconds: 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, +Infdynamo_request_plane_send_seconds: 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, +Infdynamo_request_plane_roundtrip_ttft_seconds: 0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, +InfvLLM is a high-performance inference engine. These metrics provide deep visibility into model execution, cache usage, and request processing phases. Current vLLM v1 Prometheus metrics use model_name and engine labels unless noted otherwise.
Common finished_reason values: stop, length, abort, error, repetition
These histograms show where time is spent for each request. Together they decompose the end-to-end latency.
Histogram buckets:
vllm:e2e_request_latency_seconds: 0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0, 960.0, 1920.0, 7680.0, +Infvllm:request_queue_time_seconds: 0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0, 960.0, 1920.0, 7680.0, +Infvllm:request_prefill_time_seconds: 0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0, 960.0, 1920.0, 7680.0, +Infvllm:request_decode_time_seconds: 0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0, 960.0, 1920.0, 7680.0, +Infvllm:request_inference_time_seconds: 0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0, 960.0, 1920.0, 7680.0, +InfHistogram buckets:
vllm:time_to_first_token_seconds: 0.001, 0.005, 0.01, 0.02, 0.04, 0.06, 0.08, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0, 20.0, 40.0, 80.0, 160.0, 640.0, 2560.0, +Infvllm:inter_token_latency_seconds: 0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0, 20.0, 40.0, 80.0, +Infvllm:request_time_per_output_token_seconds: 0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0, 20.0, 40.0, 80.0, +InfThese histograms show the distribution of request parameters processed by vLLM.
Histogram buckets:
vllm:request_prompt_tokens: 1, 2, 5, 10, 20, 50, ... up to max_model_len, +Infvllm:request_generation_tokens: 1, 2, 5, 10, 20, 50, ... up to max_model_len, +Infvllm:request_max_num_generation_tokens: 1, 2, 5, 10, 20, 50, ... up to max_model_len, +Infvllm:request_params_max_tokens: 1, 2, 5, 10, 20, 50, ... up to max_model_len, +Infvllm:request_params_n: 1, 2, 5, 10, 20, +Infvllm:iteration_tokens_total: 1, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, +Infvllm:request_prefill_kv_computed_tokens: 1, 2, 5, 10, 20, 50, ... up to max_model_len, +InfHistogram buckets:
vllm:kv_block_lifetime_seconds: 0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 30, 60, 120, 300, 600, 1200, 1800, +Infvllm:kv_block_idle_before_evict_seconds: same as abovevllm:kv_block_reuse_gap_seconds: same as abovevllm:kv_offload_size: 1000000, 5000000, 10000000, 20000000, 40000000, 60000000, 80000000, 100000000, 150000000, 200000000, +InfCommon cache config labels:
block_size: KV cache block size in tokens (e.g., 16)cache_dtype: Cache data type (e.g., auto)enable_prefix_caching: Whether prefix caching is enabled (True/False)gpu_memory_utilization: GPU memory utilization target (e.g., 0.9)num_gpu_blocks: Total GPU blocks allocated (e.g., 71671)SGLang is a fast inference engine with RadixAttention for efficient prefix caching. These metrics provide visibility into SGLang’s scheduling, execution, token accounting, disaggregated inference, speculative decoding, and optional cache features.
Unless noted otherwise, scheduler metrics use labels model_name, engine_type, tp_rank, pp_rank, and moe_ep_rank. dp_rank is added when data parallel rank is present, priority is added when priority scheduling is enabled, and user-configured extra_metric_labels may add more labels.
Histogram buckets:
sglang:prompt_tokens_histogram: 100, 300, 500, 700, 1000, 1500, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 12500, 15000, 17500, 20000, 22500, 25000, 27500, 30000, 35000, 40000, 60000, 80000, 100000, 200000, 300000, 400000, 600000, 800000, 1000000, 1100000, +Infsglang:uncached_prompt_tokens_histogram: Same as sglang:prompt_tokens_histogramsglang:generation_tokens_histogram: Same as sglang:prompt_tokens_histogram by defaultsglang:get_loads_duration_seconds: 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, +InfHistogram buckets:
sglang:time_to_first_token_seconds: 0.1, 0.2, 0.4, 0.6, 0.8, 1, 2, 4, 6, 8, 10, 20, 40, 60, 80, 100, 200, 400, +Infsglang:inter_token_latency_seconds: 0.002, 0.004, 0.006, 0.008, 0.010, 0.015, 0.020, 0.025, 0.030, 0.035, 0.040, 0.060, 0.080, 0.100, 0.200, 0.400, 0.600, 0.800, 1.000, 2.000, 4.000, 6.000, 8.000, +Infsglang:e2e_request_latency_seconds: 0.1, 0.2, 0.4, 0.6, 0.8, 1, 2, 4, 6, 8, 10, 20, 40, 60, 80, 100, 200, 400, 600, 1200, 1800, 2400, +Infsglang:queue_time_seconds: 0.0, 0.001, 0.005, 0.010, 0.050, 0.100, 0.200, 0.500, 1, 2, 3, 4, 5, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1200, 1400, 1600, 1800, 2000, 2500, 3000, +Infsglang:per_stage_req_latency_seconds: (see below)Histogram buckets for sglang:per_stage_req_latency_seconds:
Observed stage labels for sglang:per_stage_req_latency_seconds:
For disaggregated prefill/decode deployments where prefill and decode run on separate instances.
Histogram buckets:
sglang:kv_transfer_latency_ms: 1, 2, 5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, +Infsglang:kv_transfer_speed_gb_s: 0.1, 0.5, 1, 5, 10, 25, 50, 100, 200, 400, +Infsglang:kv_transfer_total_mb: 1, 5, 10, 50, 100, 500, 1000, 5000, 10000, +Infsglang:kv_transfer_alloc_ms: 1, 2, 5, 10, 25, 50, 100, 250, 500, 1000, 2500, +Infsglang:kv_transfer_bootstrap_ms: 1, 2, 5, 10, 25, 50, 100, 250, 500, 1000, 2500, +InfThese metric families are emitted only when the corresponding feature is enabled.
These are constant gauges emitted once at startup.
Common label values:
engine_type: unified, prefill, or decodemodel_name: Model identifier (e.g., Qwen/Qwen3-0.6B)tp_rank: Tensor parallel rank (e.g., 0, 1, …)pp_rank: Pipeline parallel rank (e.g., 0, 1, …)moe_ep_rank: MoE expert-parallel rankdp_rank: Data-parallel rank when presentpriority: empty string for totals, or a priority value for per-priority queue gaugesTensorRT-LLM (trtllm) is NVIDIA’s high-performance inference engine optimized for NVIDIA GPUs. These metrics cover request latency, token accounting, queue/load state, KV cache behavior, memory usage, and optional speculative decoding stats. Dynamo-TRTLLM does not rename the engine’s native trtllm_ metrics, but it can emit additional Python-side metrics with the same trtllm_ prefix so they pass the same prefix filters.
TRT-LLM exposes Prometheus at a non-standard path. By default trtllm-serve serves an iteration-stats JSON array at /metrics (not Prometheus exposition format). The metrics below are only available when the server is launched with return_perf_metrics: true in extra_llm_api_options.yaml, which mounts the proper Prometheus exposition at /prometheus/metrics. Iteration-derived metrics additionally require iteration stats to be enabled (enable_iter_perf_stats: true for the PyTorch backend; TensorRT backend iteration stats are enabled by default). AIPerf detects the JSON response on /metrics, probes the alt path automatically, and swaps the collector’s URL on success — see Compatibility & auto-disable.
AIPerf records Prometheus family names as exposed by the server, with Prometheus counter samples grouped under the counter family name without the sample’s trailing _total suffix. For example, upstream trtllm_request_success_total samples appear under trtllm_request_success in AIPerf outputs.
Histogram buckets:
trtllm_e2e_request_latency_seconds: 0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0, 960.0, 1920.0, 7680.0, +Inftrtllm_request_queue_time_seconds: 0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0, 960.0, 1920.0, 7680.0, +Inftrtllm_time_to_first_token_seconds: 0.001, 0.005, 0.01, 0.02, 0.04, 0.06, 0.08, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0, 20.0, 40.0, 80.0, 160.0, 640.0, 2560.0, +Inftrtllm_time_per_output_token_seconds: 0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0, 20.0, 40.0, 80.0, +Inftrtllm_request_prefill_time_seconds: 0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 20.0, 40.0, 80.0, 160.0, 640.0, 2560.0, +Inftrtllm_request_decode_time_seconds: 0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0, 960.0, 1920.0, 7680.0, +Inftrtllm_request_inference_time_seconds: 0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0, 960.0, 1920.0, 7680.0, +InfCommon label values:
engine_type: pytorch, _autodeploy, or unknown from the configured backend (not always trtllm).model_name: Model identifier (e.g., Qwen/Qwen3-0.6B).finished_reason: stop, length, timeout, or cancelled. Upstream code does not emit error as a finished_reason value for trtllm_request_success.These are emitted by Dynamo’s TRT-LLM worker integration in addition to the engine-native TensorRT-LLM metrics above. They intentionally use the trtllm_ prefix.
Triton Inference Server exposes Prometheus text metrics on a dedicated metrics service, by default http://localhost:8002/metrics. The endpoint is enabled unless tritonserver --allow-metrics=false is set; --allow-gpu-metrics=false and --allow-cpu-metrics=false disable only those metric groups. Use --metrics-port, --metrics-address, and --metrics-interval-ms to change where interval metrics are served and how often they refresh.
By default, Triton exposes cumulative latency counters in microseconds. AIPerf reports stats.total for the benchmark-window increase and stats.rate as microseconds accumulated per second. Optional histogram and summary latency families are controlled with --metrics-config; AIPerf exports histograms but skips Prometheus summary metrics. Model-level metrics use model and version labels, and can also include model_namespace, model tag labels prefixed with _, and gpu_uuid when configured by Triton.
Response-cache metrics are emitted only when Triton’s response cache is enabled.
When TensorRT-LLM runs as a Triton backend, the backend can expose additional custom families using the nv_trt_llm_* and nv_llm_* prefixes.
Note: These metrics are only available with Dynamo deployments using the KV Block Manager feature for advanced KV cache management.
All metrics are counters tracking cumulative block movement operations.
Block transfer patterns:
Dynamo’s logical KVBM pool collector also exports pool-scoped counters and gauges. These carry a pool label and may include external deployment labels such as instance_id.
Labels that appear across multiple metrics:
Dynamo vs backend metrics: Dynamo metrics measure at the HTTP/routing layer (user-facing), while vLLM/SGLang/TensorRT-LLM metrics measure inside the inference engine. Triton metrics measure Triton core/backend scheduling plus system telemetry. Use Dynamo for user-facing SLAs, backend/Triton metrics for debugging performance.
Counter vs Gauge interpretation:
stats.total for total change during benchmark, stats.rate for rate of change (per second)stats.avg for typical value, stats.max for peak, stats.p99 for tail behaviorHistogram percentiles: Histogram percentiles (stats.p50_estimate, stats.p90_estimate, stats.p95_estimate, stats.p99_estimate) are estimated from bucket boundaries. Exact values depend on bucket configuration.
Multiple endpoints: When scraping multiple instances, each series includes an endpoint_url label to identify the source.
Backend-specific capabilities:
For detailed implementation and usage examples, see the Server Metrics Tutorial. For aggregated statistics, see the JSON Schema Reference. For raw time-series analysis, see the Parquet Schema Reference.