AIPerf Server Metrics Reference

View as Markdown

Comprehensive reference for server metrics collected during AIPerf benchmark runs from NVIDIA Dynamo, vLLM, SGLang, and TensorRT-LLM inference servers.

Table of Contents

  1. Quick Reference: Common Questions
  2. Backend Comparison Matrix
  3. Metric Interpretation Guide
  4. Detailed Metric Definitions
  5. Appendix

Quick Reference: Common Questions

”What is my throughput?”

MetricFieldDescription
dynamo_frontend_requestsstats.rateRequests per second
dynamo_frontend_output_tokensstats.rateOutput tokens per second
vllm:prompt_tokensstats.rateInput tokens per second (vLLM)
vllm:generation_tokensstats.rateGeneration throughput (vLLM)
sglang:gen_throughputstats.avgReal-time generation throughput (SGLang)

“What is my latency?”

MetricFieldDescription
dynamo_frontend_request_duration_secondsstats.p99_estimateEnd-to-end p99 latency
dynamo_frontend_request_duration_secondsstats.avgAverage request latency
dynamo_frontend_time_to_first_token_secondsstats.p99_estimateTime to first token (TTFT) p99
dynamo_frontend_inter_token_latency_secondsstats.p99_estimateInter-token latency (ITL) p99
vllm:time_to_first_token_secondsstats.p99_estimateTTFT p99 (vLLM)
sglang:queue_time_secondsstats.p99_estimateQueue time p99 (SGLang)
trtllm:time_to_first_token_secondsstats.p99_estimateTTFT p99 (TensorRT-LLM)

“Am I hitting capacity limits?”

MetricFieldThresholdMeaning
vllm:kv_cache_usage_percstats.max>0.9KV cache near full capacity
vllm:num_preemptionsstats.total>0Memory pressure causing preemptions
vllm:num_requests_waitingstats.avgGrowingQueue building up
dynamo_frontend_queued_requestsstats.avgHighRequests awaiting first token
sglang:token_usagestats.max>0.9High memory utilization (SGLang)
sglang:num_queue_reqsstats.avgGrowingSaturation (SGLang)
trtllm:request_queue_time_secondsstats.avgHighSaturation (TensorRT-LLM)

“What does my workload look like?”

MetricFieldDescription
dynamo_frontend_input_sequence_tokensstats.avgAverage prompt length
dynamo_frontend_input_sequence_tokensstats.p99_estimateLongest prompts (p99)
dynamo_frontend_output_sequence_tokensstats.avgAverage response length
dynamo_frontend_output_sequence_tokensstats.p99_estimateLongest responses (p99)

“Where is time being spent?”

vLLM latency breakdown:

Total latency = Queue + Prefill + Decode
vllm:e2e_request_latency_seconds ≈
vllm:request_queue_time_seconds +
vllm:request_prefill_time_seconds +
vllm:request_decode_time_seconds
PhaseMetricWhat it means
Queuevllm:request_queue_time_secondsWaiting for GPU resources
Prefillvllm:request_prefill_time_secondsProcessing input tokens
Decodevllm:request_decode_time_secondsGenerating output tokens

SGLang latency breakdown (via sglang:per_stage_req_latency_seconds with stage label):

Stage LabelWhat it means
prefill_waitingWaiting before prefill
prefill_bootstrapPrefill scheduling overhead
prefill_preparePreparing prefill batch
prefill_forwardPrefill forward pass execution
prefill_transfer_kv_cacheKV cache transfer (disaggregated mode)
decode_waitingWaiting before decode
decode_transferredDecode phase execution

TensorRT-LLM latency breakdown:

PhaseMetricWhat it means
Queuetrtllm:request_queue_time_secondsWaiting for GPU resources
TTFTtrtllm:time_to_first_token_secondsTime to first output token
Totaltrtllm:e2e_request_latency_secondsComplete request duration

Backend Comparison Matrix

Key equivalent metrics across backends:

CapabilityDynamo FrontendvLLMSGLangTensorRT-LLM
End-to-end latencydynamo_frontend_request_duration_secondsvllm:e2e_request_latency_secondstrtllm:e2e_request_latency_seconds
TTFTdynamo_frontend_time_to_first_token_secondsvllm:time_to_first_token_secondstrtllm:time_to_first_token_seconds
ITLdynamo_frontend_inter_token_latency_secondsvllm:inter_token_latency_secondstrtllm:time_per_output_token_seconds
Queue timevllm:request_queue_time_secondssglang:queue_time_secondstrtllm:request_queue_time_seconds
KV cache usagedynamo_component_kvstats_gpu_cache_usage_percentvllm:kv_cache_usage_percsglang:token_usage
Requests runningdynamo_frontend_inflight_requestsvllm:num_requests_runningsglang:num_running_reqs
Requests queueddynamo_frontend_queued_requestsvllm:num_requests_waitingsglang:num_queue_reqs
Successful requestsdynamo_frontend_requestsvllm:request_successtrtllm:request_success
Prompt tokensdynamo_frontend_input_sequence_tokensvllm:request_prompt_tokens
Generation tokensdynamo_frontend_output_sequence_tokensvllm:request_generation_tokens

Key insight: Dynamo metrics measure at the HTTP/routing layer (user-facing), while backend metrics measure inside the inference engine (debugging). Use both for complete visibility.


Metric Interpretation Guide

Metric Types

Counter (cumulative, monotonically increasing):

  • stats.total = Total change during benchmark
  • stats.rate = Rate of change (per second)
  • Example: vllm:prompt_tokens with stats.rate = prefill throughput

Gauge (point-in-time snapshot):

  • stats.avg = Typical value
  • stats.max = Peak value
  • stats.min = Minimum value
  • stats.p50, stats.p90, stats.p99 = Percentile values
  • Example: vllm:num_requests_waiting with stats.max = worst-case queue depth

Histogram (distribution):

  • stats.total = Total count of observations
  • stats.sum = Sum of all observed values
  • stats.avg = Mean (sum/count)
  • stats.p50_estimate, stats.p90_estimate, stats.p95_estimate, stats.p99_estimate = Estimated percentiles from buckets
  • Example: vllm:e2e_request_latency_seconds with stats.p99_estimate = tail latency

Info (static labels):

  • Only stats.avg is meaningful (value is typically 1.0)
  • Labels contain the actual configuration data
  • Example: vllm:cache_config_info exposes cache settings as labels

Understanding Percentiles

Histogram percentiles are estimated from bucket boundaries, not exact values. Accuracy depends on bucket granularity. See Histogram Buckets for bucket definitions.

Multiple Endpoints

When scraping multiple server instances, each series includes an endpoint_url label to identify the source.


Detailed Metric Definitions

Dynamo Frontend

The Dynamo frontend is the HTTP entry point that receives client requests and routes them to backend workers. These metrics provide user-facing visibility into request processing.

Request Flow

MetricTypeUnitLabelsDescription
dynamo_frontend_requestscounterrequestsendpoint, model, request_type, statusTotal LLM requests processed. Use stats.total for count during benchmark, stats.rate for throughput (req/s).
dynamo_frontend_inflight_requestsgaugerequestsmodelRequests currently being processed. High values indicate saturation.
dynamo_frontend_queued_requestsgaugerequestsmodelRequests that have not yet received the first token.
dynamo_frontend_disconnected_clientsgaugeclientsClient connections that disconnected (possibly due to timeouts).

Label values:

  • endpoint: chat_completions, completions
  • request_type: stream, unary
  • status: success, error

Latency

MetricTypeUnitLabelsHistogram BucketsDescription
dynamo_frontend_request_duration_secondshistogramsecondsmodel0.0, 1.9, 3.4, 6.3, 12.0, 22.0, 40.0, 75.0, 140.0, 260.0, +InfEnd-to-end request latency from HTTP receive to response complete. Key metric for SLA compliance. Use stats.p99_estimate for tail latency.
dynamo_frontend_time_to_first_token_secondshistogramsecondsmodel0.0, 0.0022, 0.0047, 0.01, 0.022, 0.047, 0.1, 0.22, 0.47, 1.0, 2.2, 4.7, 10.0, 22.0, 48.0, 100.0, 220.0, 480.0, +InfTime to first token (TTFT) - latency until first token is generated. Critical for perceived responsiveness.
dynamo_frontend_inter_token_latency_secondshistogramsecondsmodel0.0, 0.0019, 0.0035, 0.0067, 0.013, 0.024, 0.045, 0.084, 0.16, 0.3, 0.56, 1.1, 2.0, +InfInter-token latency (ITL) - time between consecutive tokens. Lower is better for streaming UX.

Tokens

MetricTypeUnitLabelsHistogram BucketsDescription
dynamo_frontend_output_tokenscountertokensmodelTotal output tokens generated. stats.rate = output token throughput (tokens/s).
dynamo_frontend_input_sequence_tokenshistogramtokensmodel0.0, 100.0, 210.0, 430.0, 870.0, 1800.0, 3600.0, 7400.0, 15000.0, 31000.0, 63000.0, 130000.0, +InfInput sequence length distribution. stats.avg = mean prompt length, stats.p99_estimate = longest prompts.
dynamo_frontend_output_sequence_tokenshistogramtokensmodel0.0, 100.0, 210.0, 430.0, 880.0, 1800.0, 3700.0, 7600.0, 16000.0, 32000.0, +InfOutput sequence length distribution. stats.avg = mean response length.

Model Configuration (Static Gauges)

These are constant values that don’t change during the benchmark. Only stats.avg is meaningful.

MetricTypeLabelsDescription
dynamo_frontend_model_context_lengthgaugemodelMaximum context window size in tokens (e.g., 40960).
dynamo_frontend_model_kv_cache_block_sizegaugemodelKV cache block size in tokens (e.g., 16).
dynamo_frontend_model_max_num_batched_tokensgaugemodelMaximum tokens that can be batched together.
dynamo_frontend_model_max_num_seqsgaugemodelMaximum concurrent sequences per worker. (vLLM, TensorRT-LLM only)
dynamo_frontend_model_total_kv_blocksgaugemodelTotal KV cache blocks available per worker. (vLLM, SGLang only)
dynamo_frontend_model_migration_limitgaugemodelMaximum request migrations allowed (0 = disabled).

Dynamo Component

Dynamo components are backend workers that execute inference. These metrics come from the worker process level and provide visibility into backend-level request processing.

Request Processing

MetricTypeUnitLabelsDescription
dynamo_component_requestscounterrequestsdynamo_component, dynamo_endpoint, dynamo_namespace, modelRequests processed by this worker. Compare across workers to check load balancing.
dynamo_component_inflight_requestsgaugerequestsdynamo_component, dynamo_endpoint, dynamo_namespace, modelRequests currently executing on this worker.
dynamo_component_errorscountererrorsdynamo_component, dynamo_endpoint, dynamo_namespace, error_type, modelErrors in work handler. Non-zero indicates problems.
MetricTypeUnitLabelsHistogram BucketsDescription
dynamo_component_request_duration_secondshistogramsecondsdynamo_component, dynamo_endpoint, dynamo_namespace, model0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, +InfWorker-level request processing time. Compare to frontend duration to measure routing overhead.

Data Transfer

MetricTypeUnitLabelsDescription
dynamo_component_request_bytescounterbytesdynamo_component, dynamo_endpoint, dynamo_namespace, modelTotal bytes received in requests. stats.rate = inbound bandwidth.
dynamo_component_response_bytescounterbytesdynamo_component, dynamo_endpoint, dynamo_namespace, modelTotal bytes sent in responses. stats.rate = outbound bandwidth.

KV Cache Statistics

MetricTypeUnitLabelsDescription
dynamo_component_kvstats_active_blocksgaugeblocksdynamo_component, dynamo_namespaceKV cache blocks currently in use.
dynamo_component_kvstats_total_blocksgaugeblocksdynamo_component, dynamo_namespaceTotal KV cache blocks available.
dynamo_component_kvstats_gpu_cache_usage_percentgaugeratiodynamo_component, dynamo_namespaceGPU cache utilization (0.0-1.0). High values (>0.9) may cause preemptions.
dynamo_component_kvstats_gpu_prefix_cache_hit_rategaugeratiodynamo_component, dynamo_namespacePrefix cache hit rate (0.0-1.0). Higher = better reuse of cached prefixes.

NATS Messaging (Internal)

NATS metrics track the internal messaging system used for component communication within Dynamo.

MetricTypeUnitDescription
dynamo_component_nats_client_connection_stategaugeConnection state: 0=disconnected, 1=connected, 2=reconnecting.
dynamo_component_nats_client_current_connectionsgaugeconnectionsActive NATS connections.
dynamo_component_nats_client_in_messagesgaugemessagesMessages received via NATS.
dynamo_component_nats_client_out_messagesgaugemessagesMessages sent via NATS.
dynamo_component_nats_client_in_total_bytesgaugebytesBytes received via NATS.
dynamo_component_nats_client_out_overhead_bytesgaugebytesBytes sent via NATS (including protocol overhead).
MetricTypeUnitLabelsDescription
dynamo_component_nats_service_active_servicesgaugeservicesdynamo_component, dynamo_namespace, service_nameActive NATS services in component.
dynamo_component_nats_service_active_endpointsgaugeendpointsdynamo_component, dynamo_namespace, service_nameActive NATS endpoints.
dynamo_component_nats_service_requests_totalgaugerequestsdynamo_component, dynamo_namespace, service_nameTotal NATS service requests.
dynamo_component_nats_service_errors_totalgaugeerrorsdynamo_component, dynamo_namespace, service_nameNATS service errors.
dynamo_component_nats_service_processing_ms_totalgaugemillisecondsdynamo_component, dynamo_namespace, service_nameTotal NATS processing time.
dynamo_component_nats_service_processing_ms_avggaugemillisecondsdynamo_component, dynamo_namespace, service_nameAverage NATS processing time.

vLLM

vLLM is a high-performance inference engine. These metrics provide deep visibility into model execution, cache usage, and request processing phases.

Cache & Memory

MetricTypeUnitLabelsDescription
vllm:kv_cache_usage_percgaugeratioengine, model_nameKV cache utilization (0.0-1.0). Key capacity indicator. Values near 1.0 cause performance degradation. Monitor stats.max.
vllm:prefix_cache_hitscountertokensengine, model_nameTokens served from prefix cache. Higher = better prompt reuse.
vllm:prefix_cache_queriescountertokensengine, model_nameTokens queried against prefix cache. hits/queries = hit rate.
vllm:num_preemptionscounterpreemptionsengine, model_nameRequests preempted due to memory pressure. Non-zero indicates capacity issues.

Queue State

MetricTypeUnitLabelsDescription
vllm:num_requests_runninggaugerequestsengine, model_nameRequests currently in model execution batch. Indicates batch size.
vllm:num_requests_waitinggaugerequestsengine, model_nameRequests queued waiting for execution. High values indicate saturation.

Token Throughput

MetricTypeUnitLabelsDescription
vllm:prompt_tokenscountertokensengine, model_namePrefill tokens processed. stats.rate = prefill throughput.
vllm:generation_tokenscountertokensengine, model_nameGeneration tokens produced. stats.rate = decode throughput.
vllm:request_successcounterrequestsengine, finished_reason, model_nameSuccessfully completed requests.

Common finished_reason values: length, stop, error

Request-Level Latency Breakdown

These histograms show where time is spent for each request. Together they decompose the end-to-end latency.

MetricTypeUnitLabelsHistogram BucketsDescription
vllm:e2e_request_latency_secondshistogramsecondsengine, model_name0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0, 960.0, 1920.0, 7680.0, +InfTotal request latency inside vLLM (queue + inference).
vllm:request_queue_time_secondshistogramsecondsengine, model_name0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0, 960.0, 1920.0, 7680.0, +InfTime spent in WAITING phase (queued before execution).
vllm:request_prefill_time_secondshistogramsecondsengine, model_name0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0, 960.0, 1920.0, 7680.0, +InfTime spent in PREFILL phase (processing input tokens).
vllm:request_decode_time_secondshistogramsecondsengine, model_name0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0, 960.0, 1920.0, 7680.0, +InfTime spent in DECODE phase (generating output tokens).
vllm:request_inference_time_secondshistogramsecondsengine, model_name0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0, 960.0, 1920.0, 7680.0, +InfTime spent in RUNNING phase (prefill + decode).

Token-Level Latency

MetricTypeUnitLabelsHistogram BucketsDescription
vllm:time_to_first_token_secondshistogramsecondsengine, model_name0.001, 0.005, 0.01, 0.02, 0.04, 0.06, 0.08, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0, 20.0, 40.0, 80.0, 160.0, 640.0, 2560.0, +InfTTFT - time from request start to first output token.
vllm:inter_token_latency_secondshistogramsecondsengine, model_name0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0, 20.0, 40.0, 80.0, +InfITL - time between consecutive output tokens.
vllm:request_time_per_output_token_secondshistogramsecondsengine, model_name0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0, 20.0, 40.0, 80.0, +InfAverage time per token for each request (total_time / num_tokens).
vllm:time_per_output_token_secondshistogramsecondsengine, model_name0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0, 20.0, 40.0, 80.0, +Inf(Deprecated) Use vllm:inter_token_latency_seconds instead.

Request Parameters

These histograms show the distribution of request parameters processed by vLLM.

MetricTypeUnitLabelsHistogram BucketsDescription
vllm:request_prompt_tokenshistogramtokensengine, model_name1.0, 2.0, 5.0, 10.0, 20.0, 50.0, 100.0, 200.0, 500.0, 1000.0, 2000.0, 5000.0, 10000.0, 20000.0, +InfInput token count per request. Same as dynamo_frontend_input_sequence_tokens.
vllm:request_generation_tokenshistogramtokensengine, model_name1.0, 2.0, 5.0, 10.0, 20.0, 50.0, 100.0, 200.0, 500.0, 1000.0, 2000.0, 5000.0, 10000.0, 20000.0, +InfOutput token count per request. Same as dynamo_frontend_output_sequence_tokens.
vllm:request_max_num_generation_tokenshistogramtokensengine, model_name1.0, 2.0, 5.0, 10.0, 20.0, 50.0, 100.0, 200.0, 500.0, 1000.0, 2000.0, 5000.0, 10000.0, 20000.0, +InfMaximum tokens requested per request (max_tokens parameter).
vllm:request_params_max_tokenshistogramtokensengine, model_name1.0, 2.0, 5.0, 10.0, 20.0, 50.0, 100.0, 200.0, 500.0, 1000.0, 2000.0, 5000.0, 10000.0, 20000.0, +InfDistribution of max_tokens API parameter.
vllm:request_params_nhistogramengine, model_name1.0, 2.0, 5.0, 10.0, 20.0, +InfDistribution of n parameter (number of completions per request).
vllm:iteration_tokens_totalhistogramtokensengine, model_name1.0, 8.0, 16.0, 32.0, 64.0, 128.0, 256.0, 512.0, 1024.0, 2048.0, 4096.0, 8192.0, 16384.0, +InfTokens processed per engine step. Indicates batch efficiency.

Configuration Info

MetricTypeLabelsDescription
vllm:cache_config_infoinfoengine, block_size, cache_dtype, enable_prefix_caching, gpu_memory_utilization, num_gpu_blocks, etc.Static cache configuration. Info is exposed as labels on a gauge metric with value 1.0.

Common cache config labels:

  • block_size: KV cache block size in tokens (e.g., 16)
  • cache_dtype: Cache data type (e.g., auto)
  • enable_prefix_caching: Whether prefix caching is enabled (True/False)
  • gpu_memory_utilization: GPU memory utilization target (e.g., 0.9)
  • num_gpu_blocks: Total GPU blocks allocated (e.g., 71671)

SGLang

SGLang is a fast inference engine with RadixAttention for efficient prefix caching. These metrics provide visibility into SGLang’s scheduling, execution, and advanced features like disaggregated inference and speculative decoding.

Throughput & Performance

MetricTypeUnitLabelsDescription
sglang:gen_throughputgaugetokens/sengine_type, model_name, pp_rank, tp_rankGeneration throughput in tokens per second. Real-time throughput indicator.
sglang:cache_hit_rategaugeratioengine_type, model_name, pp_rank, tp_rankPrefix cache hit rate (0.0-1.0). Higher = better prompt reuse via RadixAttention.
sglang:token_usagegaugeratioengine_type, model_name, pp_rank, tp_rankToken usage ratio (0.0-1.0). Indicates memory utilization.
sglang:utilizationgaugeratioengine_type, model_name, pp_rank, tp_rankOverall utilization. -1.0 indicates idle, 0.0+ indicates active.

Queue State

MetricTypeUnitLabelsDescription
sglang:num_running_reqsgaugerequestsengine_type, model_name, pp_rank, tp_rankRequests currently executing in the batch.
sglang:num_queue_reqsgaugerequestsengine_type, model_name, pp_rank, tp_rankRequests in the waiting queue. High values indicate saturation.
sglang:num_used_tokensgaugetokensengine_type, model_name, pp_rank, tp_rankTotal tokens currently in use across all requests.
sglang:num_running_reqs_offline_batchgaugerequestsengine_type, model_name, pp_rank, tp_rankLow-priority offline batch requests running.
sglang:num_paused_reqsgaugerequestsengine_type, model_name, pid, pp_rank, tp_rankRequests paused by async weight sync.
sglang:num_retracted_reqsgaugerequestsengine_type, model_name, pid, pp_rank, tp_rankRequests that were retracted/preempted.
sglang:num_grammar_queue_reqsgaugerequestsengine_type, model_name, pp_rank, tp_rankRequests waiting for grammar processing.

Disaggregated Inference Queues

For disaggregated prefill/decode deployments where prefill and decode run on separate instances.

MetricTypeUnitLabelsDescription
sglang:num_prefill_prealloc_queue_reqsgaugerequestsengine_type, model_name, pp_rank, tp_rankRequests in prefill preallocation queue.
sglang:num_prefill_inflight_queue_reqsgaugerequestsengine_type, model_name, pp_rank, tp_rankRequests in prefill inflight queue.
sglang:num_decode_prealloc_queue_reqsgaugerequestsengine_type, model_name, pp_rank, tp_rankRequests in decode preallocation queue.
sglang:num_decode_transfer_queue_reqsgaugerequestsengine_type, model_name, pp_rank, tp_rankRequests in decode transfer queue.

Request Latency Breakdown

MetricTypeUnitLabelsHistogram BucketsDescription
sglang:queue_time_secondshistogramsecondsengine_type, model_name, pp_rank, tp_rank0.0, 0.1, 0.2, 0.5, 1.0, 2.0, 3.0, 4.0, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0, 200.0, 300.0, 400.0, 500.0, 600.0, 700.0, 800.0, 900.0, 1000.0, 1200.0, 1400.0, 1600.0, 1800.0, 2000.0, 2500.0, 3000.0, +InfTime spent in WAITING queue before execution starts.
sglang:per_stage_req_latency_secondshistogramsecondsengine_type, model_name, pp_rank, stage, tp_rank(see below)Per-stage latency breakdown. stage label identifies the phase.

Histogram buckets for sglang:per_stage_req_latency_seconds:

0.001, 0.0016, 0.0026, 0.0043, 0.0069, 0.0112, 0.0181, 0.0293, 0.0474, 0.0768, 0.1245, 0.2017, 0.3267, 0.5293, 0.8575, 1.3891, 2.2503, 3.6455, 5.9057, 9.5672, 15.4989, 25.1082, 40.6753, 65.8939, 106.7481, 172.9320, 280.1498, 453.8427, 735.2252, 1191.0649, +Inf

Stage labels for sglang:per_stage_req_latency_seconds:

StageDescription
prefill_waitingTime waiting before prefill begins
prefill_bootstrapTime to bootstrap prefill (scheduling overhead)
prefill_prepareTime preparing prefill batch
prefill_forwardTime executing prefill forward pass
prefill_transfer_kv_cacheTime transferring KV cache (disaggregated mode)
decode_waitingTime waiting before decode begins
decode_bootstrapTime to bootstrap decode
decode_prepareTime preparing decode batch
decode_transferredTotal time in transferred/decode phase

KV Cache Transfer (Disaggregated)

For disaggregated prefill/decode deployments, these metrics track KV cache transfer between instances.

MetricTypeUnitLabelsDescription
sglang:kv_transfer_latency_msgaugemillisecondsengine_type, model_name, pp_rank, tp_rankKV cache transfer latency.
sglang:kv_transfer_speed_gb_sgaugeGB/sengine_type, model_name, pp_rank, tp_rankKV cache transfer throughput.
sglang:kv_transfer_alloc_msgaugemillisecondsengine_type, model_name, pp_rank, tp_rankTime waiting for KV cache allocation.
sglang:kv_transfer_bootstrap_msgaugemillisecondsengine_type, model_name, pp_rank, tp_rankKV transfer bootstrap time.
sglang:pending_prealloc_token_usagegaugeratioengine_type, model_name, pp_rank, tp_rankToken usage for pending preallocated tokens (not preallocated yet).

Speculative Decoding

MetricTypeUnitLabelsDescription
sglang:spec_accept_rategaugeratioengine_type, model_name, pp_rank, tp_rankSpeculative decoding acceptance rate (accepted tokens / total draft tokens in batch). Higher = better speculation.
sglang:spec_accept_lengthgaugetokensengine_type, model_name, pp_rank, tp_rankAverage acceptance length of speculative decoding.

System Configuration

MetricTypeUnitLabelsDescription
sglang:is_cuda_graphgaugeengine_type, model_name, pp_rank, tp_rankWhether batch is using CUDA graph (1=yes, 0=no).
sglang:engine_startup_timegaugesecondsengine_type, model_name, pp_rank, tp_rankEngine startup time.
sglang:engine_load_weights_timegaugesecondsengine_type, model_name, pp_rank, tp_rankTime to load model weights.
sglang:mamba_usagegaugeratioengine_type, model_name, pp_rank, tp_rankToken usage for Mamba layers (hybrid models).
sglang:swa_token_usagegaugeratioengine_type, model_name, pp_rank, tp_rankToken usage for sliding window attention layers.

Common label values:

  • engine_type: unified
  • model_name: Model identifier (e.g., Qwen/Qwen3-0.6B)
  • tp_rank: Tensor parallel rank (e.g., 0, 1, …)
  • pp_rank: Pipeline parallel rank (e.g., 0, 1, …)
  • pid: Process ID

TensorRT-LLM

TensorRT-LLM (trtllm) is NVIDIA’s high-performance inference engine optimized for NVIDIA GPUs. These metrics focus on request latency and completion tracking.

Request Latency

MetricTypeUnitLabelsHistogram BucketsDescription
trtllm:e2e_request_latency_secondshistogramsecondsengine_type, model_name0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0, 960.0, 1920.0, 7680.0, +InfEnd-to-end request latency from submission to completion. Use stats.p99_estimate for tail latency.
trtllm:request_queue_time_secondshistogramsecondsengine_type, model_name0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0, 960.0, 1920.0, 7680.0, +InfTime spent in WAITING phase (queued before execution).
trtllm:time_to_first_token_secondshistogramsecondsengine_type, model_name0.001, 0.005, 0.01, 0.02, 0.04, 0.06, 0.08, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0, 20.0, 40.0, 80.0, 160.0, 640.0, 2560.0, +InfTTFT - time from request start to first output token.
trtllm:time_per_output_token_secondshistogramsecondsengine_type, model_name0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0, 20.0, 40.0, 80.0, +InfTime per output token (inter-token latency).

Request Completion

MetricTypeUnitLabelsDescription
trtllm:request_successcounterrequestsengine_type, finished_reason, model_nameSuccessfully completed requests. finished_reason label indicates completion reason.

Common label values:

  • engine_type: trtllm
  • model_name: Model identifier (e.g., Qwen/Qwen3-0.6B)
  • finished_reason: length (reached max_tokens), stop (stop sequence), error (error occurred)

KVBM (KV Block Manager)

Note: These metrics are only available with Dynamo deployments using the KV Block Manager feature for advanced KV cache management.

Block Transfer Operations

All metrics are counters tracking cumulative block movement operations.

MetricTypeUnitDescription
kvbm_matched_tokenscountertokensThe number of matched tokens (prefix cache hits).
kvbm_offload_blocks_d2dcounterblocksThe number of offload blocks from device to disk (bypassing host memory).
kvbm_offload_blocks_d2hcounterblocksThe number of offload blocks from device to host memory.
kvbm_offload_blocks_h2dcounterblocksThe number of offload blocks from host memory to disk.
kvbm_onboard_blocks_d2dcounterblocksThe number of onboard blocks from disk to device (bypassing host memory).
kvbm_onboard_blocks_h2dcounterblocksThe number of onboard blocks from host memory to device.

Block transfer patterns:

  • d2d: Device ↔ Disk (direct, fast path)
  • d2h: Device → Host (offload to CPU memory)
  • h2d: Host → Device (onboard from CPU memory)
  • h2d (disk): Host → Disk (persist to storage)

Appendix

Common Metric Labels

Labels that appear across multiple metrics:

LabelDescriptionExample Values
modelModel identifier (Dynamo)qwen/qwen3-0.6b
model_nameModel identifier (backends)Qwen/Qwen3-0.6B
endpointAPI endpointchat_completions, completions
request_typeRequest typestream, unary
statusRequest outcomesuccess, error
engineEngine identifier (vLLM)0, 1, …
engine_typeEngine typetrtllm, unified
tp_rankTensor parallel rank0, 1, …
pp_rankPipeline parallel rank0, 1, …
stageProcessing stage (SGLang)prefill_forward, decode_transferred
finished_reasonCompletion reasonlength, stop, error
dynamo_componentComponent identifierWorker name/ID
dynamo_endpointInternal endpointInternal routing info
dynamo_namespaceNamespaceDeployment namespace
error_typeError classificationError category
service_nameNATS service nameService identifier

Notes on Metric Usage

  1. Dynamo vs backend metrics: Dynamo metrics measure at the HTTP/routing layer (user-facing), while vLLM/SGLang/TensorRT-LLM metrics measure inside the inference engine. Use Dynamo for user-facing SLAs, backend metrics for debugging performance.

  2. Counter vs Gauge interpretation:

    • Counters: Use stats.total for total change during benchmark, stats.rate for rate of change (per second)
    • Gauges: Use stats.avg for typical value, stats.max for peak, stats.p99 for tail behavior
  3. Histogram percentiles: Histogram percentiles (stats.p50_estimate, stats.p90_estimate, stats.p95_estimate, stats.p99_estimate) are estimated from bucket boundaries. Exact values depend on bucket configuration.

  4. Multiple endpoints: When scraping multiple instances, each series includes an endpoint_url label to identify the source.

  5. Backend-specific capabilities:

    • vLLM: Most comprehensive metrics including full request phase breakdown, cache statistics, and batch efficiency
    • SGLang: RadixAttention cache metrics, disaggregated inference support, speculative decoding stats, per-stage latency breakdowns
    • TensorRT-LLM: Focused on core latency metrics (queue, TTFT, e2e) with minimal overhead

For detailed implementation and usage examples, see the Server Metrics Tutorial. For aggregated statistics, see the JSON Schema Reference. For raw time-series analysis, see the Parquet Schema Reference.