For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
    • Welcome to AIPerf Documentation
  • Getting Started
    • Profiling with AIPerf
    • Comprehensive LLM Benchmarking
    • Migrating from GenAI-Perf
    • GenAI-Perf vs AIPerf CLI Feature Comparison Matrix
  • Tutorials
      • Server Metrics Collection
      • Server Metrics Reference
      • Server Metrics JSON Export Schema
      • Server Metrics Parquet Export Schema
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
On this page
  • Table of Contents
  • Quick Reference: Common Questions
  • ”What is my throughput?”
  • “What is my latency?”
  • ”Am I hitting capacity limits?”
  • ”What does my workload look like?”
  • “Where is time being spent?”
  • Backend Comparison Matrix
  • Metric Interpretation Guide
  • Metric Types
  • Understanding Percentiles
  • Multiple Endpoints
  • Detailed Metric Definitions
  • Dynamo Frontend
  • Request Flow
  • Latency
  • Tokens
  • Model Configuration (Static Gauges)
  • Frontend Pipeline, Routing, and Worker Load
  • Tokio Runtime and Event Loop Metrics
  • Router Request and Overhead Metrics
  • KV Publisher Metrics
  • Dynamo Component
  • Work Handler Request Processing
  • Work Handler Data Transfer, Queue, and Pool Saturation
  • Backend KV Cache and Model Info
  • Transport and NATS Messaging
  • vLLM
  • Cache & Memory
  • Queue & Engine State
  • Token Throughput
  • Request-Level Latency Breakdown
  • Token-Level Latency
  • Request Parameters
  • Speculative Decoding
  • Optional KV and Performance Metrics
  • Configuration Info
  • SGLang
  • Throughput, Tokens & Requests
  • Queue, Cache & Memory State
  • Request Latency Breakdown
  • Disaggregated Inference Queues and KV Transfer
  • Speculative Decoding
  • Execution, CUDA Graph, and Estimated Performance
  • Optional Feature Metrics
  • System Configuration
  • TensorRT-LLM
  • Request Latency
  • Request Completion and Tokens
  • Queue, Batch, and Memory State
  • KV Cache Metrics
  • Speculative Decoding and Config Info
  • Dynamo-TRTLLM Additional Metrics
  • Triton Inference Server
  • Request Counts and Queue State
  • Latency Counters and Optional Histograms
  • GPU, CPU, Pinned Memory, and Response Cache
  • TensorRT-LLM Triton Backend Custom Metrics
  • KVBM (KV Block Manager)
  • Block Transfer Operations
  • Logical Pool Metrics
  • Appendix
  • Common Metric Labels
  • Notes on Metric Usage
Server Metrics

AIPerf Server Metrics Reference

||View as Markdown|
Previous

Server Metrics Collection

Next

Server Metrics JSON Export Schema

Comprehensive reference for server metrics collected during AIPerf benchmark runs from NVIDIA Dynamo, vLLM, SGLang, TensorRT-LLM, and Triton Inference Server endpoints.

Table of Contents

  1. Quick Reference: Common Questions
  2. Backend Comparison Matrix
  3. Metric Interpretation Guide
  4. Detailed Metric Definitions
    • Dynamo Frontend
    • Dynamo Component
    • vLLM
    • SGLang
    • TensorRT-LLM
    • Triton Inference Server
    • KVBM (KV Block Manager)
  5. Appendix

Quick Reference: Common Questions

”What is my throughput?”

MetricFieldDescription
dynamo_frontend_requestsstats.rateRequests per second
dynamo_frontend_output_tokensstats.rateOutput tokens per second
vllm:prompt_tokensstats.rateInput tokens per second (vLLM)
vllm:generation_tokensstats.rateGeneration throughput (vLLM)
sglang:prompt_tokensstats.ratePrefill throughput (SGLang)
sglang:generation_tokensstats.rateGeneration throughput (SGLang)
sglang:gen_throughputstats.avgReal-time generation throughput (SGLang)
nv_inference_request_successstats.rateSuccessful requests per second (Triton)
nv_inference_countstats.rateInferences per second (Triton)

“What is my latency?”

MetricFieldDescription
dynamo_frontend_request_duration_secondsstats.p99_estimateEnd-to-end p99 latency
dynamo_frontend_request_duration_secondsstats.avgAverage request latency
dynamo_frontend_time_to_first_token_secondsstats.p99_estimateTime to first token (TTFT) p99
dynamo_frontend_inter_token_latency_secondsstats.p99_estimateInter-token latency (ITL) p99
vllm:time_to_first_token_secondsstats.p99_estimateTTFT p99 (vLLM)
sglang:time_to_first_token_secondsstats.p99_estimateTTFT p99 (SGLang)
sglang:e2e_request_latency_secondsstats.p99_estimateEnd-to-end p99 latency (SGLang)
sglang:inter_token_latency_secondsstats.p99_estimateITL p99 (SGLang)
sglang:queue_time_secondsstats.p99_estimateQueue time p99 (SGLang)
trtllm_time_to_first_token_secondsstats.p99_estimateTTFT p99 (TensorRT-LLM)
nv_inference_request_duration_usstats.rateEnd-to-end request time accumulation rate (Triton, microseconds/s)
nv_inference_first_response_histogram_msstats.p99_estimateFirst-response latency p99 when Triton histogram latencies are enabled

”Am I hitting capacity limits?”

MetricFieldThresholdMeaning
vllm:kv_cache_usage_percstats.max>0.9KV cache near full capacity
vllm:num_preemptionsstats.total>0Memory pressure causing preemptions
vllm:num_requests_waitingstats.avgGrowingQueue building up
dynamo_frontend_queued_requestsstats.avgHighRequests awaiting first token
sglang:token_usagestats.max>0.9High memory utilization (SGLang)
sglang:num_queue_reqsstats.avgGrowingSaturation (SGLang)
trtllm_request_queue_time_secondsstats.avgHighSaturation (TensorRT-LLM)
nv_inference_pending_request_countstats.maxGrowingTriton backend queue saturation
nv_gpu_memory_used_bytesstats.maxNear totalTriton GPU memory pressure

”What does my workload look like?”

MetricFieldDescription
dynamo_frontend_input_sequence_tokensstats.avgAverage prompt length
dynamo_frontend_input_sequence_tokensstats.p99_estimateLongest prompts (p99)
dynamo_frontend_output_sequence_tokensstats.avgAverage response length
dynamo_frontend_output_sequence_tokensstats.p99_estimateLongest responses (p99)
nv_inference_count / nv_inference_exec_countstats.totalTriton average batch size (inference_count / exec_count)

“Where is time being spent?”

vLLM latency breakdown:

Total latency = Queue + Prefill + Decode
vllm:e2e_request_latency_seconds ≈
vllm:request_queue_time_seconds +
vllm:request_prefill_time_seconds +
vllm:request_decode_time_seconds
PhaseMetricWhat it means
Queuevllm:request_queue_time_secondsWaiting for GPU resources
Prefillvllm:request_prefill_time_secondsProcessing input tokens
Decodevllm:request_decode_time_secondsGenerating output tokens

SGLang latency breakdown (via sglang:per_stage_req_latency_seconds with stage label):

Stage LabelWhat it means
request_processUnified-mode request processing before queue entry
prefill_bootstrapPrefill bootstrap queue time in disaggregated prefill mode
prefill_forwardPrefill forward pass execution
chunked_prefillAdditional chunked-prefill forward slices
prefill_transfer_kv_cacheKV cache transfer from prefill to decode worker
decode_prepareDecode preallocation preparation
decode_bootstrapDecode bootstrap/transfer setup
decode_waitingWaiting before decode forward execution
decode_transferredDecode-side transferred request processing before queue entry
fake_outputFake-output/prebuilt decode stage

TensorRT-LLM latency breakdown:

PhaseMetricWhat it means
Queuetrtllm_request_queue_time_secondsWaiting for GPU resources
TTFTtrtllm_time_to_first_token_secondsTime to first output token
Totaltrtllm_e2e_request_latency_secondsComplete request duration

Backend Comparison Matrix

Key equivalent metrics across backends:

CapabilityDynamo FrontendvLLMSGLangTensorRT-LLMTriton
End-to-end latencydynamo_frontend_request_duration_secondsvllm:e2e_request_latency_secondssglang:e2e_request_latency_secondstrtllm_e2e_request_latency_secondsnv_inference_request_duration_us
TTFT / first responsedynamo_frontend_time_to_first_token_secondsvllm:time_to_first_token_secondssglang:time_to_first_token_secondstrtllm_time_to_first_token_secondsnv_inference_first_response_histogram_ms
ITLdynamo_frontend_inter_token_latency_secondsvllm:inter_token_latency_secondssglang:inter_token_latency_secondstrtllm_time_per_output_token_seconds—
Queue time—vllm:request_queue_time_secondssglang:queue_time_secondstrtllm_request_queue_time_secondsnv_inference_queue_duration_us
KV/cache usagedynamo_component_gpu_cache_usage_percentvllm:kv_cache_usage_percsglang:token_usagetrtllm_kv_cache_utilizationresponse cache nv_cache_*
Requests runningdynamo_frontend_inflight_requestsvllm:num_requests_runningsglang:num_running_reqstrtllm_num_requests_running—
Requests queueddynamo_frontend_queued_requestsvllm:num_requests_waitingsglang:num_queue_reqstrtllm_num_requests_waitingnv_inference_pending_request_count
Successful requestsdynamo_frontend_requestsvllm:request_successsglang:num_requeststrtllm_request_successnv_inference_request_success
Prompt tokensdynamo_frontend_input_sequence_tokensvllm:request_prompt_tokenssglang:prompt_tokens_histogramtrtllm_prompt_tokens—
Generation tokensdynamo_frontend_output_sequence_tokensvllm:request_generation_tokenssglang:generation_tokens_histogramtrtllm_generation_tokens—

Key insight: Dynamo metrics measure at the HTTP/routing layer (user-facing), while backend metrics measure inside the inference engine (debugging). Use both for complete visibility.


Metric Interpretation Guide

Metric Types

Counter (cumulative, monotonically increasing):

  • stats.total = Total change during benchmark
  • stats.rate = Rate of change (per second)
  • Example: vllm:prompt_tokens with stats.rate = prefill throughput
  • AIPerf stores Prometheus counter family names without the exposition sample’s trailing _total suffix, so upstream *_total counter samples usually appear as * in AIPerf exports.

Gauge (point-in-time snapshot):

  • stats.avg = Typical value
  • stats.max = Peak value
  • stats.min = Minimum value
  • stats.p50, stats.p90, stats.p99 = Percentile values
  • Example: vllm:num_requests_waiting with stats.max = worst-case queue depth

Histogram (distribution):

  • stats.total = Total count of observations
  • stats.sum = Sum of all observed values
  • stats.avg = Mean (sum/count)
  • stats.p50_estimate, stats.p90_estimate, stats.p95_estimate, stats.p99_estimate = Estimated percentiles from buckets
  • Example: vllm:e2e_request_latency_seconds with stats.p99_estimate = tail latency

Info (static labels):

  • Only stats.avg is meaningful (value is typically 1.0)
  • Labels contain the actual configuration data
  • Example: vllm:cache_config_info exposes cache settings as labels

Understanding Percentiles

Histogram percentiles are estimated from bucket boundaries, not exact values. Accuracy depends on bucket granularity. See Histogram Buckets for bucket definitions.

Multiple Endpoints

When scraping multiple server instances, each series includes an endpoint_url label to identify the source.


Detailed Metric Definitions

Dynamo Frontend

The Dynamo frontend is the HTTP entry point that receives client requests and routes them to backend workers. These metrics provide user-facing visibility into request processing.

Request Flow

MetricTypeUnitLabelsDescription
dynamo_frontend_requests_startedcounterrequestsendpoint, model, request_typeRequests accepted by the frontend handler.
dynamo_frontend_requestscounterrequestsendpoint, error_type, model, request_type, statusCompleted LLM requests. Use stats.total for count during benchmark, stats.rate for throughput (req/s).
dynamo_frontend_active_requestsgaugerequestsmodelRequests currently being handled by the frontend, from HTTP handler entry to response completion.
dynamo_frontend_inflight_requestsgaugerequestsmodelEngine-bound requests currently being processed.
dynamo_frontend_queued_requestsgaugerequestsmodelHTTP-processing queue: requests from handler start until first token generation.
dynamo_frontend_disconnected_clientsgaugeclients—Client connections that disconnected.

Label values:

  • endpoint: completions, chat_completions, embeddings, images, videos, audios, responses, anthropic_messages, tensor
  • request_type: stream, unary
  • status: success, error
  • error_type: empty string for success, or validation, not_found, overload, cancelled, response_timeout, internal, not_implemented

Latency

MetricTypeUnitLabelsDescription
dynamo_frontend_request_duration_secondshistogramsecondsmodelEnd-to-end request latency from HTTP handler entry to response completion. Key metric for SLA compliance. Use stats.p99_estimate for tail latency.
dynamo_frontend_time_to_first_token_secondshistogramsecondsmodelTime to first token (TTFT) - latency until first token is generated. Critical for perceived responsiveness.
dynamo_frontend_inter_token_latency_secondshistogramsecondsmodelInter-token latency (ITL) - time between consecutive tokens. Lower is better for streaming UX.

Histogram buckets:

  • dynamo_frontend_request_duration_seconds: 0.0, 2.0, 4.0, 8.0, 16.0, 32.0, 64.0, 130.0, 260.0, 510.0, +Inf
  • dynamo_frontend_time_to_first_token_seconds: 0.0, 0.0022, 0.0047, 0.01, 0.022, 0.047, 0.1, 0.22, 0.47, 1.0, 2.2, 4.7, 10.0, 22.0, 48.0, 100.0, 220.0, 480.0, +Inf
  • dynamo_frontend_inter_token_latency_seconds: 0.0, 0.0019, 0.0035, 0.0067, 0.013, 0.024, 0.045, 0.084, 0.16, 0.3, 0.56, 1.1, 2.0, +Inf

Tokens

MetricTypeUnitLabelsDescription
dynamo_frontend_output_tokenscountertokensmodelTotal output tokens generated. stats.rate = output token throughput (tokens/s).
dynamo_frontend_cached_tokenshistogramtokensmodelCached tokens (prefix cache hits) per request.
dynamo_frontend_tokenizer_latency_mshistogrammillisecondsoperationTokenizer latency. operation: tokenize, detokenize.
dynamo_frontend_input_sequence_tokenshistogramtokensmodelInput sequence length distribution. stats.avg = mean prompt length, stats.p99_estimate = longest prompts.
dynamo_frontend_output_sequence_tokenshistogramtokensmodelOutput sequence length distribution. stats.avg = mean response length.

Histogram buckets:

  • dynamo_frontend_cached_tokens: Same as dynamo_frontend_input_sequence_tokens
  • dynamo_frontend_tokenizer_latency_ms: 0.5, 1.0, 2.0, 4.0, 8.0, 16.0, 32.0, 64.0, 128.0, 256.0, 512.0, +Inf
  • dynamo_frontend_input_sequence_tokens: 0.0, 100.0, 210.0, 430.0, 870.0, 1800.0, 3600.0, 7400.0, 15000.0, 31000.0, 63000.0, 130000.0, +Inf
  • dynamo_frontend_output_sequence_tokens: 0.0, 100.0, 210.0, 430.0, 880.0, 1800.0, 3700.0, 7600.0, 16000.0, 32000.0, +Inf

Model Configuration (Static Gauges)

These are constant values that don’t change during the benchmark. Only stats.avg is meaningful.

MetricTypeLabelsDescription
dynamo_frontend_model_context_lengthgaugemodelMaximum context window size in tokens.
dynamo_frontend_model_kv_cache_block_sizegaugemodelKV cache block size in tokens.
dynamo_frontend_model_max_num_batched_tokensgaugemodelMaximum tokens that can be batched together.
dynamo_frontend_model_max_num_seqsgaugemodelMaximum concurrent sequences per worker.
dynamo_frontend_model_total_kv_blocksgaugemodelTotal KV cache blocks available per worker.
dynamo_frontend_model_migration_limitgaugemodelMaximum request migrations allowed for the model.
dynamo_frontend_model_migrationcountermigration_type, modelRequest migrations due to worker unavailability. migration_type: new_request, ongoing_request.
dynamo_frontend_model_migration_max_seq_len_exceededcountermodelMigrations disabled because the sequence length exceeded the configured limit.
dynamo_frontend_model_cancellationcounterendpoint, model, request_typeRequest cancellations.
dynamo_frontend_model_rejectioncounterendpoint, modelRequests rejected due to resource exhaustion.

Frontend Pipeline, Routing, and Worker Load

MetricTypeUnitLabelsDescription
dynamo_frontend_stage_requestsgaugerequestsphase, stageRequests currently in a frontend pipeline stage. stage: preprocess, route, dispatch; phase: empty string, prefill, decode, or aggregated.
dynamo_frontend_stage_duration_secondshistogramsecondsstagePipeline stage duration.
dynamo_frontend_tokenize_secondshistogramseconds—Tokenization time in the preprocessor.
dynamo_frontend_template_secondshistogramseconds—Chat-template application time in the preprocessor.
dynamo_frontend_detokenize_total_uscountermicroseconds—Cumulative detokenization time.
dynamo_frontend_detokenize_token_countcountertokens—Tokens detokenized.
dynamo_frontend_worker_active_decode_blocksgaugeblocksdp_rank, worker_id, worker_typeActive KV-cache decode blocks per worker.
dynamo_frontend_worker_active_prefill_tokensgaugetokensdp_rank, worker_id, worker_typeActive prefill tokens queued per worker.
dynamo_frontend_worker_last_time_to_first_token_secondsgaugesecondsdp_rank, worker_id, worker_typeLast observed TTFT for a worker.
dynamo_frontend_worker_last_input_sequence_tokensgaugetokensdp_rank, worker_id, worker_typeInput-token count from the same request as the last observed worker TTFT.
dynamo_frontend_worker_last_inter_token_latency_secondsgaugesecondsdp_rank, worker_id, worker_typeLast observed ITL for a worker.
dynamo_frontend_router_queue_pending_requestsgaugerequestsworker_typeRequests pending in the router scheduler queue.
dynamo_frontend_router_queue_pending_isl_tokensgaugetokensworker_typeSum of input-sequence tokens for pending router scheduler requests.

Histogram buckets:

  • dynamo_frontend_stage_duration_seconds: 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 2.5, 5.0, +Inf
  • dynamo_frontend_tokenize_seconds: 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, +Inf
  • dynamo_frontend_template_seconds: 0.00001, 0.00005, 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, +Inf

Tokio Runtime and Event Loop Metrics

MetricTypeUnitLabelsDescription
dynamo_tokio_global_queue_depthgaugetasks—Tokio runtime global queue depth.
dynamo_tokio_budget_forced_yieldcounteryields—Tasks forced to yield after exhausting Tokio’s cooperative budget.
dynamo_tokio_blocking_threadsgaugethreads—Threads in Tokio’s blocking pool.
dynamo_tokio_blocking_idle_threadsgaugethreads—Idle threads in Tokio’s blocking pool.
dynamo_tokio_blocking_queue_depthgaugetasks—Blocking-pool queue depth.
dynamo_tokio_alive_tasksgaugetasks—Alive Tokio tasks.
dynamo_tokio_worker_mean_poll_time_nsgaugenanosecondsworkerWorker mean poll time.
dynamo_tokio_worker_busy_ratiogaugeratioworkerWorker busy ratio.
dynamo_tokio_worker_park_countcounterparksworkerWorker park count.
dynamo_tokio_worker_local_queue_depthgaugetasksworkerWorker local queue depth.
dynamo_tokio_worker_steal_countcounterstealsworkerWorker steal count.
dynamo_tokio_worker_overflow_countcounteroverflowsworkerWorker local-queue overflow count.
dynamo_frontend_event_loop_delay_secondshistogramseconds—Event-loop delay canary.
dynamo_frontend_event_loop_stallcounterstalls—Event-loop stalls over the configured threshold.

Router Request and Overhead Metrics

Router request metrics are component-scoped and therefore also carry dynamo_namespace, dynamo_component, optional dynamo_endpoint, worker_id, and router_id labels.

MetricTypeUnitLabelsDescription
dynamo_component_router_requestscounterrequestshierarchy labels + router_idRequests processed by the router.
dynamo_component_router_time_to_first_token_secondshistogramsecondshierarchy labels + router_idTime to first token observed at the router.
dynamo_component_router_inter_token_latency_secondshistogramsecondshierarchy labels + router_idAverage inter-token latency observed at the router.
dynamo_component_router_input_sequence_tokenshistogramtokenshierarchy labels + router_idInput sequence length observed at the router.
dynamo_component_router_output_sequence_tokenshistogramtokenshierarchy labels + router_idOutput sequence length observed at the router.
dynamo_component_router_kv_hit_ratehistogramratiohierarchy labels + router_idPredicted KV cache hit rate at routing time.
dynamo_component_router_kv_transfer_estimated_latency_secondshistogramsecondshierarchy labels + router_idUpper-bound estimate of KV transfer latency in disaggregated serving.
dynamo_component_router_shared_cache_hit_ratehistogramratiohierarchy labels + router_idFraction of request blocks found in shared KV cache.
dynamo_component_router_shared_cache_beyond_blockshistogramblockshierarchy labels + router_idShared cache blocks beyond device overlap for the selected worker.
dynamo_component_router_remote_indexer_query_failurescountererrorshierarchy labels + router_idRemote indexer overlap queries that failed.
dynamo_component_router_remote_indexer_write_failurescountererrorshierarchy labels + router_idRemote indexer routing-decision writes that failed.
dynamo_router_overhead_block_hashing_mshistogrammillisecondsrouter_idTime spent computing block hashes.
dynamo_router_overhead_indexer_find_matches_mshistogrammillisecondsrouter_idTime spent in indexer find_matches.
dynamo_router_overhead_seq_hashing_mshistogrammillisecondsrouter_idTime spent computing sequence hashes.
dynamo_router_overhead_scheduling_mshistogrammillisecondsrouter_idTime spent in scheduler worker selection.
dynamo_router_overhead_total_mshistogrammillisecondsrouter_idTotal routing overhead per request.
dynamo_router_overhead_shared_cache_query_mshistogrammillisecondsrouter_idTime spent querying shared KV cache.
dynamo_router_shared_cache_errorscountererrorsrouter_idShared cache query errors.

Histogram buckets:

  • dynamo_component_router_time_to_first_token_seconds: Same as dynamo_frontend_time_to_first_token_seconds
  • dynamo_component_router_inter_token_latency_seconds: Same as dynamo_frontend_inter_token_latency_seconds
  • dynamo_component_router_input_sequence_tokens: Same as dynamo_frontend_input_sequence_tokens
  • dynamo_component_router_output_sequence_tokens: Same as dynamo_frontend_output_sequence_tokens
  • dynamo_component_router_kv_hit_rate: 0.0, 0.05, 0.1, ... 1.0, +Inf
  • dynamo_component_router_kv_transfer_estimated_latency_seconds: 0.0, 0.0019, 0.0037, 0.0072, 0.014, 0.027, 0.052, 0.1, 0.19, 0.37, 0.72, 1.4, 2.7, 5.2, 10.0, +Inf
  • dynamo_component_router_shared_cache_hit_rate: 0.0, 0.05, 0.1, ... 1.0, +Inf
  • dynamo_component_router_shared_cache_beyond_blocks: 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, +Inf
  • dynamo_router_overhead_block_hashing_ms: exponential 0.001 * 2^n, 15 buckets
  • dynamo_router_overhead_indexer_find_matches_ms: exponential 0.01 * 3^n, 17 buckets
  • dynamo_router_overhead_seq_hashing_ms: exponential 0.001 * 2^n, 15 buckets
  • dynamo_router_overhead_scheduling_ms: exponential 0.01 * 3^n, 17 buckets
  • dynamo_router_overhead_total_ms: exponential 0.01 * 3^n, 17 buckets
  • dynamo_router_overhead_shared_cache_query_ms: exponential 0.01 * 3^n, 17 buckets

KV Publisher Metrics

These component-scoped metrics track Dynamo’s KV-event publisher and relay path.

MetricTypeUnitLabelsDescription
dynamo_component_kv_publisher_engines_dropped_eventscountereventshierarchy labelsRaw KV events dropped by engines before reaching the publisher, detected through event ID gaps.
dynamo_component_kv_publisher_zmq_eventscountereventshierarchy labels + stage, event_typeZMQ KV events seen by the relay.
dynamo_component_kv_publisher_zmq_filtered_eventscountereventshierarchy labels + event_type, reasonZMQ KV events filtered before conversion.
dynamo_component_kv_publisher_zmq_conversion_issuescountereventshierarchy labels + event_type, reasonZMQ KV events dropped due to conversion issues.
dynamo_component_kv_publisher_zmq_suspicious_eventscountereventshierarchy labels + event_type, reasonSuspicious ZMQ KV events that were forwarded.

Dynamo Component

Dynamo component metrics come from worker, router, and backend processes. Metrics created through Dynamo’s hierarchy usually carry dynamo_namespace, dynamo_component, optional dynamo_endpoint, and worker_id labels; endpoint handlers may also add engine labels such as model.

Work Handler Request Processing

MetricTypeUnitLabelsDescription
dynamo_component_requestscounterrequestshierarchy labels plus engine labelsRequests processed by the work handler. Compare across workers to check load balancing.
dynamo_component_inflight_requestsgaugerequestshierarchy labels plus engine labelsRequests currently being processed by the work handler.
dynamo_component_errorscountererrorshierarchy labels plus engine labels, error_typeWork-handler errors. error_type: deserialization, invalid_message, response_stream, generate, publish_response, publish_final.
dynamo_component_cancellationcounterrequestshierarchy labels plus engine labelsRequests cancelled by the work handler.
dynamo_component_request_duration_secondshistogramsecondshierarchy labels plus engine labelsWorker-level request processing time. Compare to frontend duration to measure routing overhead.

Histogram buckets:

  • dynamo_component_request_duration_seconds: 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 20.0, 30.0, 60.0, 120.0, 300.0, 600.0, +Inf

Work Handler Data Transfer, Queue, and Pool Saturation

MetricTypeUnitLabelsDescription
dynamo_component_request_bytescounterbyteshierarchy labels plus engine labelsTotal bytes received in requests. stats.rate = inbound bandwidth.
dynamo_component_response_bytescounterbyteshierarchy labels plus engine labelsTotal bytes sent in responses. stats.rate = outbound bandwidth.
dynamo_work_handler_network_transit_secondshistogramseconds—Frontend-to-backend network transit time.
dynamo_work_handler_time_to_first_response_secondshistogramseconds—Backend processing time from payload handling to first response.
dynamo_work_handler_queue_depthgaugerequests—Items in the bounded work queue awaiting dispatcher pickup.
dynamo_work_handler_queue_capacitygaugerequests—Configured capacity of the bounded work queue.
dynamo_work_handler_enqueue_rejectedcounterrequests—Times enqueuing failed because the dispatcher channel was closed.
dynamo_work_handler_permit_wait_secondshistogramseconds—Time spent waiting for a worker-pool permit.
dynamo_work_handler_pool_active_tasksgaugetasks—Active worker-pool tasks holding permits.
dynamo_work_handler_pool_capacitygaugetasks—Configured worker-pool capacity.

Histogram buckets:

  • dynamo_work_handler_network_transit_seconds: 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, +Inf
  • dynamo_work_handler_time_to_first_response_seconds: 0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0, 60.0, +Inf
  • dynamo_work_handler_permit_wait_seconds: 0.0001, 0.001, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10.0, 30.0, 60.0, +Inf

Backend KV Cache and Model Info

MetricTypeUnitLabelsDescription
dynamo_component_total_blocksgaugeblocksdynamo_component, dp_rank, modelTotal KV cache blocks available on a worker.
dynamo_component_gpu_cache_usage_percentgaugeratiodynamo_component, dp_rank, modelGPU cache utilization (0.0-1.0). High values (>0.9) indicate capacity pressure.
dynamo_component_model_load_time_secondsgaugesecondsdynamo_component, modelModel load time.
dynamo_component_embedding_cache_hitscounterhitsdynamo_component, modelMultimodal embedding-cache hits.
dynamo_component_embedding_cache_missescountermissesdynamo_component, modelMultimodal embedding-cache misses.
dynamo_component_embedding_cache_evictionscounterevictionsdynamo_component, modelMultimodal embedding-cache evictions.
dynamo_component_embedding_cache_utilizationgaugeratiodynamo_component, modelMultimodal embedding-cache memory utilization (0.0-1.0).
dynamo_component_embedding_cache_current_bytesgaugebytesdynamo_component, modelCurrent multimodal embedding-cache memory usage.
dynamo_component_embedding_cache_entriesgaugeentriesdynamo_component, modelCurrent number of multimodal embedding-cache entries.

Transport and NATS Messaging

Dynamo’s current in-code NATS metric is a transport error counter. Older dynamo_component_nats_client_* and dynamo_component_nats_service_* families were not verified in current upstream code and are not documented as current.

MetricTypeUnitLabelsDescription
dynamo_transport_nats_errorscountererrorserror_typeNATS request errors. Current error_type value: request_failed.
dynamo_transport_tcp_bytes_sentcounterbytes—Bytes sent by the TCP request client.
dynamo_transport_tcp_bytes_receivedcounterbytes—Bytes received by the TCP request client.
dynamo_transport_tcp_errorscountererrors—TCP request send failures or timeouts.
dynamo_request_plane_queue_secondshistogramseconds—Time from generate() entry to send_request().
dynamo_request_plane_send_secondshistogramseconds—Time for send_request() to complete.
dynamo_request_plane_roundtrip_ttft_secondshistogramseconds—Time from send_request() to first response item.
dynamo_request_plane_inflight_requestsgaugerequests—Currently in-flight requests at AddressedPushRouter.

Histogram buckets:

  • dynamo_request_plane_queue_seconds: 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, +Inf
  • dynamo_request_plane_send_seconds: 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, +Inf
  • dynamo_request_plane_roundtrip_ttft_seconds: 0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, +Inf

vLLM

vLLM is a high-performance inference engine. These metrics provide deep visibility into model execution, cache usage, and request processing phases. Current vLLM v1 Prometheus metrics use model_name and engine labels unless noted otherwise.

Cache & Memory

MetricTypeUnitLabelsDescription
vllm:kv_cache_usage_percgaugeratiomodel_name, engineKV cache utilization (0.0-1.0). Key capacity indicator. Values near 1.0 cause performance degradation. Monitor stats.max.
vllm:prefix_cache_hitscountertokensmodel_name, enginePrefix cache hits, in terms of number of cached tokens.
vllm:prefix_cache_queriescountertokensmodel_name, enginePrefix cache queries, in terms of number of queried tokens. hits/queries = hit rate.
vllm:external_prefix_cache_hitscountertokensmodel_name, engineExternal prefix cache hits from KV connector cross-instance cache sharing, in terms of number of cached tokens.
vllm:external_prefix_cache_queriescountertokensmodel_name, engineExternal prefix cache queries from KV connector cross-instance cache sharing, in terms of number of queried tokens.
vllm:prompt_tokens_cachedcountertokensmodel_name, engineCached prompt tokens (local + external).
vllm:mm_cache_hitscounteritemsmodel_name, engineMulti-modal cache hits, in terms of number of cached items.
vllm:mm_cache_queriescounteritemsmodel_name, engineMulti-modal cache queries, in terms of number of queried items.
vllm:num_preemptionscounterpreemptionsmodel_name, engineCumulative number of preemptions from the engine. Non-zero indicates capacity pressure.
vllm:corrupted_requestscounterrequestsmodel_name, engineRequests with NaNs in logits. Only emitted when VLLM_COMPUTE_NANS_IN_LOGITS is enabled.

Queue & Engine State

MetricTypeUnitLabelsDescription
vllm:num_requests_runninggaugerequestsmodel_name, engineRequests currently in model execution batches. Indicates batch size.
vllm:num_requests_waitinggaugerequestsmodel_name, engineRequests queued waiting for execution. High values indicate saturation.
vllm:num_requests_waiting_by_reasongaugerequestsmodel_name, engine, reasonWaiting requests split by reason. capacity means waiting for scheduling capacity; deferred means deferred by transient constraints such as LoRA budget, KV transfer, or blocked status.
vllm:engine_sleep_stategauge—model_name, engine, sleep_stateEngine sleep state. sleep_state values are awake, weights_offloaded, and discard_all; the active state is reported as 1.

Token Throughput

MetricTypeUnitLabelsDescription
vllm:prompt_tokenscountertokensmodel_name, engineNumber of prefill tokens processed. stats.rate = prefill throughput.
vllm:prompt_tokens_by_sourcecountertokensmodel_name, engine, sourceNumber of prompt tokens by source. source values are local_compute, local_cache_hit, and external_kv_transfer.
vllm:generation_tokenscountertokensmodel_name, engineNumber of generation tokens processed. stats.rate = decode throughput.
vllm:request_successcounterrequestsmodel_name, engine, finished_reasonSuccessfully completed requests.

Common finished_reason values: stop, length, abort, error, repetition

Request-Level Latency Breakdown

These histograms show where time is spent for each request. Together they decompose the end-to-end latency.

MetricTypeUnitLabelsDescription
vllm:e2e_request_latency_secondshistogramsecondsmodel_name, engineHistogram of e2e request latency in seconds.
vllm:request_queue_time_secondshistogramsecondsmodel_name, engineHistogram of time spent in WAITING phase for request.
vllm:request_prefill_time_secondshistogramsecondsmodel_name, engineHistogram of time spent in PREFILL phase for request.
vllm:request_decode_time_secondshistogramsecondsmodel_name, engineHistogram of time spent in DECODE phase for request.
vllm:request_inference_time_secondshistogramsecondsmodel_name, engineHistogram of time spent in RUNNING phase for request.

Histogram buckets:

  • vllm:e2e_request_latency_seconds: 0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0, 960.0, 1920.0, 7680.0, +Inf
  • vllm:request_queue_time_seconds: 0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0, 960.0, 1920.0, 7680.0, +Inf
  • vllm:request_prefill_time_seconds: 0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0, 960.0, 1920.0, 7680.0, +Inf
  • vllm:request_decode_time_seconds: 0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0, 960.0, 1920.0, 7680.0, +Inf
  • vllm:request_inference_time_seconds: 0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0, 960.0, 1920.0, 7680.0, +Inf

Token-Level Latency

MetricTypeUnitLabelsDescription
vllm:time_to_first_token_secondshistogramsecondsmodel_name, engineTTFT - histogram of time to first token in seconds.
vllm:inter_token_latency_secondshistogramsecondsmodel_name, engineITL - histogram of inter-token latency in seconds.
vllm:request_time_per_output_token_secondshistogramsecondsmodel_name, engineHistogram of time_per_output_token_seconds per request.

Histogram buckets:

  • vllm:time_to_first_token_seconds: 0.001, 0.005, 0.01, 0.02, 0.04, 0.06, 0.08, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0, 20.0, 40.0, 80.0, 160.0, 640.0, 2560.0, +Inf
  • vllm:inter_token_latency_seconds: 0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0, 20.0, 40.0, 80.0, +Inf
  • vllm:request_time_per_output_token_seconds: 0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0, 20.0, 40.0, 80.0, +Inf

Request Parameters

These histograms show the distribution of request parameters processed by vLLM.

MetricTypeUnitLabelsDescription
vllm:request_prompt_tokenshistogramtokensmodel_name, engineNumber of prefill tokens processed per request. Bucket maximum is derived from the configured model length.
vllm:request_generation_tokenshistogramtokensmodel_name, engineNumber of generation tokens processed per request. Bucket maximum is derived from the configured model length.
vllm:request_max_num_generation_tokenshistogramtokensmodel_name, engineHistogram of maximum number of requested generation tokens.
vllm:request_params_max_tokenshistogramtokensmodel_name, engineHistogram of the max_tokens request parameter.
vllm:request_params_nhistogram—model_name, engineHistogram of the n request parameter.
vllm:iteration_tokens_totalhistogramtokensmodel_name, engineHistogram of number of tokens per engine step.
vllm:request_prefill_kv_computed_tokenshistogramtokensmodel_name, engineHistogram of new KV tokens computed during prefill, excluding cached tokens.

Histogram buckets:

  • vllm:request_prompt_tokens: 1, 2, 5, 10, 20, 50, ... up to max_model_len, +Inf
  • vllm:request_generation_tokens: 1, 2, 5, 10, 20, 50, ... up to max_model_len, +Inf
  • vllm:request_max_num_generation_tokens: 1, 2, 5, 10, 20, 50, ... up to max_model_len, +Inf
  • vllm:request_params_max_tokens: 1, 2, 5, 10, 20, 50, ... up to max_model_len, +Inf
  • vllm:request_params_n: 1, 2, 5, 10, 20, +Inf
  • vllm:iteration_tokens_total: 1, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, +Inf
  • vllm:request_prefill_kv_computed_tokens: 1, 2, 5, 10, 20, 50, ... up to max_model_len, +Inf

Speculative Decoding

MetricTypeUnitLabelsDescription
vllm:spec_decode_num_draftscounterdraftsmodel_name, engineNumber of spec decoding drafts.
vllm:spec_decode_num_draft_tokenscountertokensmodel_name, engineNumber of draft tokens.
vllm:spec_decode_num_accepted_tokenscountertokensmodel_name, engineNumber of accepted tokens.
vllm:spec_decode_num_accepted_tokens_per_poscountertokensmodel_name, engine, positionAccepted tokens per draft position.

Optional KV and Performance Metrics

MetricTypeUnitLabelsDescription
vllm:kv_block_lifetime_secondshistogramsecondsmodel_name, engineKV cache block lifetime from allocation to eviction. Only emitted when KV cache metrics are enabled.
vllm:kv_block_idle_before_evict_secondshistogramsecondsmodel_name, engineIdle time before KV cache block eviction. Only emitted when KV cache metrics are enabled.
vllm:kv_block_reuse_gap_secondshistogramsecondsmodel_name, engineTime gaps between consecutive KV cache block accesses. Only emitted when KV cache metrics are enabled.
vllm:kv_offload_sizehistogrambytesmodel_name, engine, transfer_typeKV offload transfer size, in bytes.
vllm:kv_offload_total_bytescounterbytesmodel_name, engine, transfer_typeNumber of bytes offloaded by KV connector.
vllm:kv_offload_total_timecountersecondsmodel_name, engine, transfer_typeTotal time measured by all KV offloading operations.
vllm:estimated_flops_per_gpu_totalcounteroperationsmodel_name, engineEstimated number of floating point operations per GPU for Model Flops Utilization calculations. Available via --enable-mfu-metrics.
vllm:estimated_read_bytes_per_gpu_totalcounterbytesmodel_name, engineEstimated number of bytes read from memory per GPU for Model Flops Utilization calculations. Available via --enable-mfu-metrics.
vllm:estimated_write_bytes_per_gpu_totalcounterbytesmodel_name, engineEstimated number of bytes written to memory per GPU for Model Flops Utilization calculations. Available via --enable-mfu-metrics.

Histogram buckets:

  • vllm:kv_block_lifetime_seconds: 0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 30, 60, 120, 300, 600, 1200, 1800, +Inf
  • vllm:kv_block_idle_before_evict_seconds: same as above
  • vllm:kv_block_reuse_gap_seconds: same as above
  • vllm:kv_offload_size: 1000000, 5000000, 10000000, 20000000, 40000000, 60000000, 80000000, 100000000, 150000000, 200000000, +Inf

Configuration Info

MetricTypeLabelsDescription
vllm:cache_config_infogaugeengine, cache config labels such as block_size, cache_dtype, enable_prefix_caching, gpu_memory_utilization, num_gpu_blocks, etc.Static cache configuration. Exposed as a gauge with value 1.0.
vllm:lora_requests_infogaugemax_lora, waiting_lora_adapters, running_lora_adaptersRunning stats on LoRA requests. Only emitted when LoRA is configured.

Common cache config labels:

  • block_size: KV cache block size in tokens (e.g., 16)
  • cache_dtype: Cache data type (e.g., auto)
  • enable_prefix_caching: Whether prefix caching is enabled (True/False)
  • gpu_memory_utilization: GPU memory utilization target (e.g., 0.9)
  • num_gpu_blocks: Total GPU blocks allocated (e.g., 71671)

SGLang

SGLang is a fast inference engine with RadixAttention for efficient prefix caching. These metrics provide visibility into SGLang’s scheduling, execution, token accounting, disaggregated inference, speculative decoding, and optional cache features.

Unless noted otherwise, scheduler metrics use labels model_name, engine_type, tp_rank, pp_rank, and moe_ep_rank. dp_rank is added when data parallel rank is present, priority is added when priority scheduling is enabled, and user-configured extra_metric_labels may add more labels.

Throughput, Tokens & Requests

MetricTypeUnitLabelsDescription
sglang:gen_throughputgaugetokens/sscheduler labelsGeneration throughput in tokens per second.
sglang:realtime_tokenscountertokensscheduler labels + modeTokens processed on each log interval. mode: prefill_compute, prefill_cache, decode.
sglang:dp_cooperation_realtime_tokenscountertokensscheduler labels + mode, num_prefill_ranksToken counts with DP cooperation labels.
sglang:prompt_tokenscountertokensmodel_name, engine_typeNumber of prefill tokens processed.
sglang:generation_tokenscountertokensmodel_name, engine_typeNumber of generation tokens processed.
sglang:cached_tokenscountertokensmodel_name, engine_type, cache_sourceCached prompt tokens split by source. cache_source values include device, host, storage_<backend>, and total.
sglang:prompt_tokens_histogramhistogramtokensmodel_name, engine_typePrompt token length distribution. Buckets can be overridden by server args.
sglang:uncached_prompt_tokens_histogramhistogramtokensmodel_name, engine_typeUncached prompt token length distribution.
sglang:generation_tokens_histogramhistogramtokensmodel_name, engine_typeGeneration token length distribution. Buckets can be overridden by server args.
sglang:num_requestscounterrequestsmodel_name, engine_typeNumber of requests processed.
sglang:num_aborted_requestscounterrequestsmodel_name, engine_typeNumber of requests aborted.
sglang:num_so_requestscounterrequestsmodel_name, engine_typeNumber of structured-output requests processed.
sglang:get_loads_duration_secondshistogramsecondsmodel_name, engine_typeTime spent serving /v1/loads requests.

Histogram buckets:

  • sglang:prompt_tokens_histogram: 100, 300, 500, 700, 1000, 1500, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 12500, 15000, 17500, 20000, 22500, 25000, 27500, 30000, 35000, 40000, 60000, 80000, 100000, 200000, 300000, 400000, 600000, 800000, 1000000, 1100000, +Inf
  • sglang:uncached_prompt_tokens_histogram: Same as sglang:prompt_tokens_histogram
  • sglang:generation_tokens_histogram: Same as sglang:prompt_tokens_histogram by default
  • sglang:get_loads_duration_seconds: 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, +Inf

Queue, Cache & Memory State

MetricTypeUnitLabelsDescription
sglang:num_running_reqsgaugerequestsscheduler labelsRequests currently executing in the batch. With priority scheduling, totals use priority="" and per-priority series use priority="<int>".
sglang:num_queue_reqsgaugerequestsscheduler labelsRequests in the waiting queue. High values indicate saturation.
sglang:num_grammar_queue_reqsgaugerequestsscheduler labelsRequests waiting for grammar processing.
sglang:num_used_tokensgaugetokensscheduler labelsNumber of used tokens; for hybrid-SWA models this is the max of full-attention and SWA pools, and it does not include the Mamba pool.
sglang:decode_sum_seq_lensgaugetokensscheduler labelsSum of all sequence lengths in decode.
sglang:cache_hit_rategaugeratioscheduler labelsPrefix cache hit rate. Higher = better prompt reuse via RadixAttention.
sglang:token_usagegaugeratioscheduler labelsBottleneck token usage ratio across full, SWA, and Mamba pools.
sglang:full_token_usagegaugeratioscheduler labelsFull-attention KV cache pool usage ratio.
sglang:swa_token_usagegaugeratioscheduler labelsSliding-window attention token pool usage ratio.
sglang:mamba_usagegaugeratioscheduler labelsMamba SSM state pool usage ratio.
sglang:kv_available_tokensgaugetokensscheduler labelsFree token slots in the KV cache pool.
sglang:kv_evictable_tokensgaugetokensscheduler labelsEvictable radix-cached token slots in the KV cache pool.
sglang:kv_used_tokensgaugetokensscheduler labelsActively used token slots in the KV cache pool.
sglang:swa_available_tokensgaugetokensscheduler labelsFree token slots in the SWA pool.
sglang:swa_evictable_tokensgaugetokensscheduler labelsEvictable radix-cached token slots in the SWA pool.
sglang:swa_used_tokensgaugetokensscheduler labelsActively used token slots in the SWA pool.
sglang:mamba_available_tokensgaugetokensscheduler labelsFree state slots in the Mamba SSM pool.
sglang:mamba_evictable_tokensgaugetokensscheduler labelsEvictable radix-cached state slots in the Mamba SSM pool.
sglang:mamba_used_tokensgaugetokensscheduler labelsActively used state slots in the Mamba SSM pool.
sglang:num_retracted_reqsgaugerequestsscheduler labelsCurrent number of retracted requests.
sglang:num_retracted_requestscounterrequestsscheduler labelsTotal retracted requests.
sglang:num_retracted_input_tokenscountertokensscheduler labelsTotal retracted input tokens.
sglang:num_retracted_output_tokenscountertokensscheduler labelsTotal retracted output tokens.
sglang:num_paused_reqsgaugerequestsscheduler labelsRequests paused by async weight sync.

Request Latency Breakdown

MetricTypeUnitLabelsDescription
sglang:time_to_first_token_secondshistogramsecondsmodel_name, engine_typeTime to first token. Buckets can be overridden by server args.
sglang:inter_token_latency_secondshistogramsecondsmodel_name, engine_typeInter-token latency. Buckets can be overridden by server args.
sglang:e2e_request_latency_secondshistogramsecondsmodel_name, engine_typeEnd-to-end request latency. Buckets can be overridden by server args.
sglang:queue_time_secondshistogramsecondsscheduler labelsTime spent in the waiting queue before execution starts.
sglang:per_stage_req_latency_secondshistogramsecondsscheduler labels + stagePer-stage latency breakdown. stage label identifies the phase.

Histogram buckets:

  • sglang:time_to_first_token_seconds: 0.1, 0.2, 0.4, 0.6, 0.8, 1, 2, 4, 6, 8, 10, 20, 40, 60, 80, 100, 200, 400, +Inf
  • sglang:inter_token_latency_seconds: 0.002, 0.004, 0.006, 0.008, 0.010, 0.015, 0.020, 0.025, 0.030, 0.035, 0.040, 0.060, 0.080, 0.100, 0.200, 0.400, 0.600, 0.800, 1.000, 2.000, 4.000, 6.000, 8.000, +Inf
  • sglang:e2e_request_latency_seconds: 0.1, 0.2, 0.4, 0.6, 0.8, 1, 2, 4, 6, 8, 10, 20, 40, 60, 80, 100, 200, 400, 600, 1200, 1800, 2400, +Inf
  • sglang:queue_time_seconds: 0.0, 0.001, 0.005, 0.010, 0.050, 0.100, 0.200, 0.500, 1, 2, 3, 4, 5, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1200, 1400, 1600, 1800, 2000, 2500, 3000, +Inf
  • sglang:per_stage_req_latency_seconds: (see below)

Histogram buckets for sglang:per_stage_req_latency_seconds:

0.001, 0.0016, 0.0026, 0.0043, 0.0069, 0.0112, 0.0181, 0.0293, 0.0474, 0.0768, 0.1245, 0.2017, 0.3267, 0.5293, 0.8575, 1.3891, 2.2503, 3.6455, 5.9057, 9.5672, 15.4989, 25.1082, 40.6753, 65.8939, 106.7481, 172.9319, 280.1497, 453.8426, 735.2250, 1191.0646, +Inf

Observed stage labels for sglang:per_stage_req_latency_seconds:

StageDescription
request_processUnified-mode request processing before queue entry
prefill_bootstrapPrefill bootstrap queue time in disaggregated prefill mode
prefill_forwardTime executing prefill forward pass
chunked_prefillTime executing a chunked-prefill slice
prefill_transfer_kv_cacheTime transferring KV cache from prefill to decode worker
decode_prepareDecode preallocation preparation time
decode_bootstrapDecode bootstrap/transfer setup time
decode_waitingTime waiting before decode forward execution
decode_transferredDecode-side transferred request processing before queue entry
fake_outputFake-output/prebuilt decode stage

Disaggregated Inference Queues and KV Transfer

For disaggregated prefill/decode deployments where prefill and decode run on separate instances.

MetricTypeUnitLabelsDescription
sglang:num_prefill_bootstrap_queue_reqsgaugerequestsscheduler labelsRequests in the prefill bootstrap queue.
sglang:num_prefill_inflight_queue_reqsgaugerequestsscheduler labelsRequests in the prefill inflight queue.
sglang:num_decode_prealloc_queue_reqsgaugerequestsscheduler labelsRequests in the decode preallocation queue.
sglang:num_decode_transfer_queue_reqsgaugerequestsscheduler labelsRequests in the decode transfer queue.
sglang:pending_prealloc_token_usagegaugeratioscheduler labelsToken usage for pending preallocated tokens.
sglang:kv_transfer_latency_mshistogrammillisecondsscheduler labelsKV cache transfer latency.
sglang:kv_transfer_speed_gb_shistogramGB/sscheduler labelsKV cache transfer throughput.
sglang:kv_transfer_total_mbhistogrammegabytesscheduler labelsKV cache transfer size.
sglang:kv_transfer_alloc_mshistogrammillisecondsscheduler labelsTime waiting for KV cache allocation.
sglang:kv_transfer_bootstrap_mshistogrammillisecondsscheduler labelsKV transfer bootstrap time.
sglang:num_bootstrap_failed_reqscounterrequestsscheduler labelsNumber of bootstrap-failed requests.
sglang:num_transfer_failed_reqscounterrequestsscheduler labelsNumber of transfer-failed requests.
sglang:num_prefill_retriescounterrequestsscheduler labelsTotal number of prefill retries.

Histogram buckets:

  • sglang:kv_transfer_latency_ms: 1, 2, 5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, +Inf
  • sglang:kv_transfer_speed_gb_s: 0.1, 0.5, 1, 5, 10, 25, 50, 100, 200, 400, +Inf
  • sglang:kv_transfer_total_mb: 1, 5, 10, 50, 100, 500, 1000, 5000, 10000, +Inf
  • sglang:kv_transfer_alloc_ms: 1, 2, 5, 10, 25, 50, 100, 250, 500, 1000, 2500, +Inf
  • sglang:kv_transfer_bootstrap_ms: 1, 2, 5, 10, 25, 50, 100, 250, 500, 1000, 2500, +Inf

Speculative Decoding

MetricTypeUnitLabelsDescription
sglang:spec_accept_rategaugeratioscheduler labelsSpeculative acceptance rate (accepted drafts / proposed drafts in batch).
sglang:spec_accept_lengthgaugetokensscheduler labelsMean acceptance length of speculative decoding (accepted drafts plus bonus token per forward).
sglang:spec_verify_callscountercallsmodel_name, engine_typeNumber of speculative decoding verification calls.

Execution, CUDA Graph, and Estimated Performance

MetricTypeUnitLabelsDescription
sglang:utilizationgaugeratioscheduler labelsScheduler utilization.
sglang:fwd_occupancygaugepercentscheduler labelsForward-pass GPU occupancy percentage.
sglang:new_token_ratiogaugeratioscheduler labelsNew-token ratio from the scheduler policy.
sglang:is_cuda_graphgauge—scheduler labelsWhether the batch is using CUDA graph (1=yes, 0=no).
sglang:cuda_graph_passescounterpassesscheduler labels + modeForward passes categorized by graph use. mode: decode_cuda_graph, decode_none, prefill_cuda_graph, prefill_none.
sglang:num_unique_running_routing_keysgaugekeysscheduler labelsUnique routing keys present in the running batch.
sglang:routing_key_running_req_counthistogramrequestsscheduler labelsDistribution of routing keys by running request count.
sglang:routing_key_all_req_counthistogramrequestsscheduler labelsDistribution of routing keys by running plus waiting request count.
sglang:forward_execution_secondscountersecondsscheduler labels + categoryTotal GPU-busy time executing model forward passes.
sglang:dp_cooperation_forward_execution_secondscountersecondsscheduler labels + category, num_prefill_ranksForward execution time with DP cooperation labels.
sglang:estimated_flops_per_gpucounterFLOPsscheduler labelsEstimated floating-point operations per GPU; requires --enable-mfu-metrics.
sglang:estimated_read_bytes_per_gpucounterbytesscheduler labelsEstimated bytes read from memory per GPU; requires --enable-mfu-metrics.
sglang:estimated_write_bytes_per_gpucounterbytesscheduler labelsEstimated bytes written to memory per GPU; requires --enable-mfu-metrics.

Optional Feature Metrics

These metric families are emitted only when the corresponding feature is enabled.

MetricTypeUnitLabelsDescription
sglang:lora_pool_slots_usedgaugeslotsscheduler labelsLoRA adapter slots currently occupied in GPU memory.
sglang:lora_pool_slots_totalgaugeslotsscheduler labelsTotal LoRA adapter slots available.
sglang:lora_pool_utilizationgaugeratioscheduler labelsLoRA pool utilization ratio.
sglang:hicache_host_used_tokensgaugetokensscheduler labelsTokens currently used in the host KV cache.
sglang:hicache_host_total_tokensgaugetokensscheduler labelsTotal host KV-cache capacity in tokens.
sglang:num_streaming_sessionsgaugesessionsscheduler labelsNumber of streaming sessions.
sglang:streaming_session_held_tokensgaugetokensscheduler labelsKV tokens held by streaming session slots.
sglang:grammar_compilation_time_secondshistogramsecondsscheduler labelsGrammar compilation time for structured-output requests.
sglang:num_grammar_cache_hitcounterrequestsscheduler labelsGrammar cache hits.
sglang:num_grammar_abortedcounterrequestsscheduler labelsGrammar-aborted requests.
sglang:num_grammar_timeoutcounterrequestsscheduler labelsGrammar timeouts.
sglang:num_grammar_totalcounterrequestsscheduler labelsTotal grammar requests.
sglang:grammar_schema_counthistogramschemasscheduler labelsNumber of grammar schemas.
sglang:grammar_ebnf_sizehistogrambytesscheduler labelsGrammar EBNF size.
sglang:grammar_tree_traversal_time_avghistogramsecondsscheduler labelsAverage grammar tree traversal time.
sglang:grammar_tree_traversal_time_maxhistogramsecondsscheduler labelsMaximum grammar tree traversal time.
sglang:prefill_delayer_wait_forward_passeshistogrampassesscheduler labelsForward passes spent waiting in the prefill delayer.
sglang:prefill_delayer_wait_secondshistogramsecondsscheduler labelsTime spent waiting in the prefill delayer.
sglang:prefill_delayer_outcomescounteroutcomesscheduler labels + input_estimation, output_allow, output_reason, actual_executionPrefill-delayer scheduling outcomes.
sglang:eplb_gpu_physical_counthistogramGPUsscheduler labels + layerPhysical GPU count distribution for expert-parallel load balancing.
sglang:prefetched_tokenscountertokensscheduler labelsPrompt tokens prefetched from storage.
sglang:backuped_tokenscountertokensscheduler labelsTokens backed up to storage.
sglang:prefetch_pgshistogrampagesscheduler labelsPrefetch pages per batch.
sglang:backup_pgshistogrampagesscheduler labelsBackup pages per batch.
sglang:prefetch_bandwidthhistogramGB/sscheduler labelsPrefetch bandwidth.
sglang:backup_bandwidthhistogramGB/sscheduler labelsBackup bandwidth.
sglang:eviction_duration_secondshistogramsecondsscheduler labelsTime to evict memory from GPU to CPU.
sglang:evicted_tokenscountertokensscheduler labelsTokens evicted from GPU to CPU.
sglang:load_back_duration_secondshistogramsecondsscheduler labelsTime to load memory back from CPU to GPU.
sglang:load_back_tokenscountertokensscheduler labelsTokens loaded back from CPU to GPU.

System Configuration

These are constant gauges emitted once at startup.

MetricTypeUnitLabelsDescription
sglang:max_total_num_tokensgaugetokensscheduler labelsMaximum total tokens in the KV cache pool.
sglang:max_running_requests_under_SLOgaugerequestsscheduler labelsMaximum running requests under SLO, when configured.
sglang:engine_startup_timegaugesecondsscheduler labelsEngine startup time.
sglang:engine_load_weights_timegaugesecondsscheduler labelsTime to load model weights.
sglang:page_sizegaugetokensscheduler labelsKV cache page size in tokens.
sglang:num_pagesgaugepagesscheduler labelsNumber of KV cache pages.
sglang:context_lengaugetokensscheduler labelsMaximum context length.
sglang:startup_available_gpu_memory_gbgaugeGBscheduler labelsAvailable GPU memory at startup.

Common label values:

  • engine_type: unified, prefill, or decode
  • model_name: Model identifier (e.g., Qwen/Qwen3-0.6B)
  • tp_rank: Tensor parallel rank (e.g., 0, 1, …)
  • pp_rank: Pipeline parallel rank (e.g., 0, 1, …)
  • moe_ep_rank: MoE expert-parallel rank
  • dp_rank: Data-parallel rank when present
  • priority: empty string for totals, or a priority value for per-priority queue gauges

TensorRT-LLM

TensorRT-LLM (trtllm) is NVIDIA’s high-performance inference engine optimized for NVIDIA GPUs. These metrics cover request latency, token accounting, queue/load state, KV cache behavior, memory usage, and optional speculative decoding stats. Dynamo-TRTLLM does not rename the engine’s native trtllm_ metrics, but it can emit additional Python-side metrics with the same trtllm_ prefix so they pass the same prefix filters.

TRT-LLM exposes Prometheus at a non-standard path. By default trtllm-serve serves an iteration-stats JSON array at /metrics (not Prometheus exposition format). The metrics below are only available when the server is launched with return_perf_metrics: true in extra_llm_api_options.yaml, which mounts the proper Prometheus exposition at /prometheus/metrics. Iteration-derived metrics additionally require iteration stats to be enabled (enable_iter_perf_stats: true for the PyTorch backend; TensorRT backend iteration stats are enabled by default). AIPerf detects the JSON response on /metrics, probes the alt path automatically, and swaps the collector’s URL on success — see Compatibility & auto-disable.

AIPerf records Prometheus family names as exposed by the server, with Prometheus counter samples grouped under the counter family name without the sample’s trailing _total suffix. For example, upstream trtllm_request_success_total samples appear under trtllm_request_success in AIPerf outputs.

Request Latency

MetricTypeUnitLabelsDescription
trtllm_e2e_request_latency_secondshistogramsecondsengine_type, model_nameEnd-to-end request latency in seconds.
trtllm_request_queue_time_secondshistogramsecondsengine_type, model_nameTime spent in the waiting phase before scheduling.
trtllm_time_to_first_token_secondshistogramsecondsengine_type, model_nameTime to first token in seconds.
trtllm_time_per_output_token_secondshistogramsecondsengine_type, model_nameTime per output token in seconds.
trtllm_request_prefill_time_secondshistogramsecondsengine_type, model_namePrefill/context phase duration (first_token_time - first_scheduled_time).
trtllm_request_decode_time_secondshistogramsecondsengine_type, model_nameDecode/generation phase duration (last_token_time - first_token_time).
trtllm_request_inference_time_secondshistogramsecondsengine_type, model_nameTotal inference duration (last_token_time - first_scheduled_time).

Histogram buckets:

  • trtllm_e2e_request_latency_seconds: 0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0, 960.0, 1920.0, 7680.0, +Inf
  • trtllm_request_queue_time_seconds: 0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0, 960.0, 1920.0, 7680.0, +Inf
  • trtllm_time_to_first_token_seconds: 0.001, 0.005, 0.01, 0.02, 0.04, 0.06, 0.08, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0, 20.0, 40.0, 80.0, 160.0, 640.0, 2560.0, +Inf
  • trtllm_time_per_output_token_seconds: 0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0, 20.0, 40.0, 80.0, +Inf
  • trtllm_request_prefill_time_seconds: 0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 20.0, 40.0, 80.0, 160.0, 640.0, 2560.0, +Inf
  • trtllm_request_decode_time_seconds: 0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0, 960.0, 1920.0, 7680.0, +Inf
  • trtllm_request_inference_time_seconds: 0.3, 0.5, 0.8, 1.0, 1.5, 2.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, 120.0, 240.0, 480.0, 960.0, 1920.0, 7680.0, +Inf

Request Completion and Tokens

MetricTypeUnitLabelsDescription
trtllm_request_successcounterrequestsengine_type, finished_reason, model_nameSuccessfully completed requests.
trtllm_prompt_tokenscountertokensengine_type, model_nameCumulative number of prompt/input tokens processed.
trtllm_generation_tokenscountertokensengine_type, model_nameCumulative number of generation/output tokens produced.

Common label values:

  • engine_type: pytorch, _autodeploy, or unknown from the configured backend (not always trtllm).
  • model_name: Model identifier (e.g., Qwen/Qwen3-0.6B).
  • finished_reason: stop, length, timeout, or cancelled. Upstream code does not emit error as a finished_reason value for trtllm_request_success.

Queue, Batch, and Memory State

MetricTypeUnitLabelsDescription
trtllm_num_requests_runninggaugerequestsengine_type, model_nameNumber of active requests.
trtllm_num_requests_waitinggaugerequestsengine_type, model_nameNumber of queued requests.
trtllm_num_requests_completedcounterrequestsengine_type, model_nameTotal completed requests reported by iteration stats.
trtllm_max_num_active_requestsgaugerequestsengine_type, model_nameMaximum number of active requests.
trtllm_iteration_latency_secondsgaugesecondsengine_type, model_nameIteration latency converted from milliseconds to seconds.
trtllm_gpu_memory_usage_bytesgaugebytesengine_type, model_nameGPU memory usage in bytes.
trtllm_cpu_memory_usage_bytesgaugebytesengine_type, model_nameCPU memory usage in bytes.
trtllm_pinned_memory_usage_bytesgaugebytesengine_type, model_namePinned memory usage in bytes.
trtllm_max_batch_size_staticgaugerequestsengine_type, model_nameStatic maximum batch size.
trtllm_max_batch_size_runtimegaugerequestsengine_type, model_nameRuntime maximum batch size.
trtllm_max_num_tokens_runtimegaugetokensengine_type, model_nameRuntime maximum number of tokens.
trtllm_num_context_requestsgaugerequestsengine_type, model_nameNumber of context/prefill requests.
trtllm_num_generation_requestsgaugerequestsengine_type, model_nameNumber of generation/decode requests.
trtllm_num_paused_requestsgaugerequestsengine_type, model_nameNumber of paused requests.
trtllm_num_scheduled_requestsgaugerequestsengine_type, model_nameNumber of scheduled requests.
trtllm_total_context_tokensgaugetokensengine_type, model_nameTotal context tokens in the current iteration stats.
trtllm_avg_decoded_tokens_per_itergaugetokensengine_type, model_nameAverage decoded tokens per iteration.

KV Cache Metrics

MetricTypeUnitLabelsDescription
trtllm_kv_cache_hit_rategaugeratioengine_type, model_nameKV cache hit rate.
trtllm_kv_cache_utilizationgaugeratioengine_type, model_nameUsed KV cache blocks divided by max KV cache blocks.
trtllm_kv_cache_host_utilizationgaugeratioengine_type, model_nameSecondary/host KV cache utilization.
trtllm_kv_cache_iter_reuse_rategaugeratioengine_type, model_namePer-iteration KV cache block reuse rate.
trtllm_kv_cache_reused_blockscounterblocksengine_type, model_nameCumulative reused KV cache blocks.
trtllm_kv_cache_missed_blockscounterblocksengine_type, model_nameCumulative missed KV cache blocks.
trtllm_kv_cache_iter_reused_blockscounterblocksengine_type, model_nameTotal reused KV cache blocks per iteration stats.
trtllm_kv_cache_iter_full_reused_blockscounterblocksengine_type, model_nameTotal fully reused KV cache blocks.
trtllm_kv_cache_iter_partial_reused_blockscounterblocksengine_type, model_nameTotal partially reused KV cache blocks.
trtllm_kv_cache_iter_missed_blockscounterblocksengine_type, model_nameTotal missed KV cache blocks in context phase.
trtllm_kv_cache_gen_alloc_blockscounterblocksengine_type, model_nameBlocks allocated during generation phase.
trtllm_kv_cache_onboard_bytescounterbytesengine_type, model_nameBytes transferred from host to GPU.
trtllm_kv_cache_offload_bytescounterbytesengine_type, model_nameBytes transferred from GPU to host.
trtllm_kv_cache_intra_device_copy_bytescounterbytesengine_type, model_nameBytes copied within GPU.
trtllm_kv_cache_max_blocksgaugeblocksengine_type, model_nameMaximum number of KV cache blocks.
trtllm_kv_cache_free_blocksgaugeblocksengine_type, model_nameNumber of free KV cache blocks.
trtllm_kv_cache_used_blocksgaugeblocksengine_type, model_nameNumber of used KV cache blocks.
trtllm_kv_cache_tokens_per_blockgaugetokensengine_type, model_nameNumber of tokens per KV cache block.

Speculative Decoding and Config Info

MetricTypeUnitLabelsDescription
trtllm_spec_decode_num_draft_tokenscountertokensengine_type, model_nameTotal draft tokens in speculative decoding.
trtllm_spec_decode_num_accepted_tokenscountertokensengine_type, model_nameTotal accepted tokens in speculative decoding.
trtllm_spec_decode_acceptance_lengthgaugetokensengine_type, model_nameAcceptance length in speculative decoding.
trtllm_spec_decode_draft_overheadgaugeratioengine_type, model_nameDraft overhead in speculative decoding.
trtllm_model_config_infogauge—engine_type, model_name, model, served_model_name, dtype, quantization, max_model_len, gpu_typeStatic model configuration as labels, value 1.
trtllm_parallel_config_infogauge—engine_type, model_name, tensor_parallel_size, pipeline_parallel_size, context_parallel_size, gpu_count, expert_parallel_sizeStatic parallelism configuration as labels, value 1.
trtllm_speculative_config_infogauge—engine_type, model_name, spec_enabled, spec_method, spec_num_tokens, spec_draft_modelStatic speculative-decoding configuration as labels, value 1; emitted only when speculative config exists.
trtllm_kv_cache_config_infogauge—engine_type, model_name, page_size, enable_block_reuse, enable_partial_reuse, free_gpu_memory_fraction, cache_dtypeStatic KV cache configuration as labels, value 1; emitted only when KV cache config exists.

Dynamo-TRTLLM Additional Metrics

These are emitted by Dynamo’s TRT-LLM worker integration in addition to the engine-native TensorRT-LLM metrics above. They intentionally use the trtllm_ prefix.

MetricTypeUnitLabelsDescription
trtllm_num_aborted_requestscounterrequestsDynamo-TRTLLM labels such as model_name, disaggregation_mode, engine_typeAborted or cancelled requests.
trtllm_request_type_imagecounterrequestsDynamo-TRTLLM labelsRequests containing image or multimodal content.
trtllm_request_type_structured_outputcounterrequestsDynamo-TRTLLM labelsRequests using guided or structured decoding.
trtllm_kv_transfer_successcountertransfersDynamo-TRTLLM labelsSuccessful KV cache transfers.
trtllm_kv_transfer_latency_secondshistogramsecondsDynamo-TRTLLM labelsKV cache transfer latency per request.
trtllm_kv_transfer_byteshistogrambytesDynamo-TRTLLM labelsKV cache transfer size per request.
trtllm_kv_transfer_speed_gb_shistogramGB/sDynamo-TRTLLM labelsKV cache transfer speed per request.

Triton Inference Server

Triton Inference Server exposes Prometheus text metrics on a dedicated metrics service, by default http://localhost:8002/metrics. The endpoint is enabled unless tritonserver --allow-metrics=false is set; --allow-gpu-metrics=false and --allow-cpu-metrics=false disable only those metric groups. Use --metrics-port, --metrics-address, and --metrics-interval-ms to change where interval metrics are served and how often they refresh.

Request Counts and Queue State

MetricTypeUnitLabelsDescription
nv_inference_request_successcounterrequestsmodel, versionSuccessful inference requests received by Triton. Each request counts as one, even when batched.
nv_inference_request_failurecounterrequestsmodel, reason, versionFailed inference requests. reason values include REJECTED, CANCELED, BACKEND, and OTHER.
nv_inference_countcounterinferencesmodel, versionInferences performed; a batch of n counts as n inferences and cached requests are excluded.
nv_inference_exec_countcounterexecutionsmodel, versionBackend batch executions. nv_inference_count / nv_inference_exec_count approximates average batch size.
nv_inference_pending_request_countgaugerequestsmodel, versionRequests received by Triton core but not yet executing in a backend. Use as Triton’s queue-depth signal.

Latency Counters and Optional Histograms

By default, Triton exposes cumulative latency counters in microseconds. AIPerf reports stats.total for the benchmark-window increase and stats.rate as microseconds accumulated per second. Optional histogram and summary latency families are controlled with --metrics-config; AIPerf exports histograms but skips Prometheus summary metrics. Model-level metrics use model and version labels, and can also include model_namespace, model tag labels prefixed with _, and gpu_uuid when configured by Triton.

MetricTypeUnitLabelsDescription
nv_inference_request_duration_uscountermicrosecondsmodel, versionCumulative end-to-end request handling time, including cached requests.
nv_inference_queue_duration_uscountermicrosecondsmodel, versionCumulative time requests spent waiting in Triton’s scheduling queue.
nv_inference_compute_input_duration_uscountermicrosecondsmodel, versionCumulative backend input-processing time, excluding cached requests.
nv_inference_compute_infer_duration_uscountermicrosecondsmodel, versionCumulative backend model execution time, excluding cached requests.
nv_inference_compute_output_duration_uscountermicrosecondsmodel, versionCumulative backend output-processing time, excluding cached requests.
nv_inference_first_response_histogram_mshistogrammillisecondsmodel, versionOptional first-response latency histogram. Enable with --metrics-config histogram_latencies=true; default buckets are 100, 500, 2000, 5000, +Inf unless overridden per model.

GPU, CPU, Pinned Memory, and Response Cache

MetricTypeUnitLabelsDescription
nv_gpu_power_usagegaugewattsgpu_uuidInstantaneous GPU power.
nv_gpu_power_limitgaugewattsgpu_uuidGPU power limit.
nv_energy_consumptioncounterjoulesgpu_uuidGPU energy consumption since Triton started.
nv_gpu_utilizationgaugeratiogpu_uuidGPU utilization from 0.0 to 1.0.
nv_gpu_memory_total_bytesgaugebytesgpu_uuidTotal GPU memory.
nv_gpu_memory_used_bytesgaugebytesgpu_uuidUsed GPU memory.
nv_cpu_utilizationgaugeratio—Total CPU utilization from 0.0 to 1.0. Linux only.
nv_cpu_memory_total_bytesgaugebytes—Total system memory. Linux only.
nv_cpu_memory_used_bytesgaugebytes—Used system memory. Linux only.
nv_pinned_memory_pool_total_bytesgaugebytes—Total pinned-memory pool capacity.
nv_pinned_memory_pool_used_bytesgaugebytes—Used pinned-memory pool.

Response-cache metrics are emitted only when Triton’s response cache is enabled.

MetricTypeUnitLabelsDescription
nv_cache_num_hits_per_modelcounterrequestsmodel, versionResponse-cache hits per model.
nv_cache_num_misses_per_modelcounterrequestsmodel, versionResponse-cache misses per model.
nv_cache_hit_duration_per_modelcountermicrosecondsmodel, versionCumulative cache-hit lookup duration.
nv_cache_miss_duration_per_modelcountermicrosecondsmodel, versionCumulative cache-miss lookup/insert duration.

TensorRT-LLM Triton Backend Custom Metrics

When TensorRT-LLM runs as a Triton backend, the backend can expose additional custom families using the nv_trt_llm_* and nv_llm_* prefixes.

MetricTypeUnitLabelsDescription
nv_trt_llm_request_metricsgaugerequestsmodel, version, request_typeTensorRT-LLM backend request counts by request type.
nv_trt_llm_runtime_memory_metricsgaugebytesmodel, version, memory_typeRuntime memory usage by memory type.
nv_trt_llm_kv_cache_block_metricsgaugeblocksmodel, version, kv_cache_block_typeKV-cache block counts by block type.
nv_trt_llm_disaggregated_serving_metricsgauge—model, version, disaggregated_serving_typeDisaggregated-serving state and transfer metrics.
nv_trt_llm_v1_metricsgauge—model, version, metric-specific labelsTensorRT-LLM v1 backend metrics.
nv_trt_llm_inflight_batcher_metricsgauge—model, version, metric-specific labelsTensorRT-LLM inflight-batcher backend metrics.
nv_trt_llm_general_metricsgauge—model, version, metric-specific labelsGeneral TensorRT-LLM backend metrics.
nv_llm_output_token_lenhistogramtokensmodel, versionOutput-token length distribution.
nv_llm_input_token_lenhistogramtokensmodel, versionInput-token length distribution.

KVBM (KV Block Manager)

Note: These metrics are only available with Dynamo deployments using the KV Block Manager feature for advanced KV cache management.

Block Transfer Operations

All metrics are counters tracking cumulative block movement operations.

MetricTypeUnitDescription
kvbm_matched_tokenscountertokensThe number of matched tokens (prefix cache hits).
kvbm_host_cache_hit_rategaugeratioHost cache hit rate from the sliding window.
kvbm_disk_cache_hit_rategaugeratioDisk cache hit rate from the sliding window.
kvbm_object_cache_hit_rategaugeratioObject-storage cache hit rate from the sliding window.
kvbm_offload_blocks_d2dcounterblocksThe number of offload blocks from device to disk (bypassing host memory).
kvbm_offload_blocks_d2hcounterblocksThe number of offload blocks from device to host memory.
kvbm_offload_blocks_h2dcounterblocksThe number of offload blocks from host memory to disk.
kvbm_offload_blocks_d2ocounterblocksThe number of blocks offloaded from device to object storage.
kvbm_onboard_blocks_d2dcounterblocksThe number of onboard blocks from disk to device (bypassing host memory).
kvbm_onboard_blocks_h2dcounterblocksThe number of onboard blocks from host memory to device.
kvbm_onboard_blocks_o2dcounterblocksThe number of blocks onboarded from object storage to device.
kvbm_object_read_failurescounterblocksFailed object-storage read operations.
kvbm_object_write_failurescounterblocksFailed object-storage write operations.

Block transfer patterns:

  • d2d: Device ↔ Disk (direct, fast path)
  • d2h: Device → Host (offload to CPU memory)
  • h2d: Host → Device (onboard from CPU memory) or Host → Disk for offload persistence
  • d2o: Device → Object storage
  • o2d: Object storage → Device

Logical Pool Metrics

Dynamo’s logical KVBM pool collector also exports pool-scoped counters and gauges. These carry a pool label and may include external deployment labels such as instance_id.

MetricTypeUnitDescription
kvbm_allocations_totalcounterallocationsBlocks allocated from logical pools.
kvbm_allocations_from_reset_totalcounterallocationsBlocks allocated from the reset pool.
kvbm_evictions_totalcounterevictionsBlocks evicted from the inactive pool.
kvbm_registrations_totalcounterregistrationsCompleteBlock to ImmutableBlock registrations.
kvbm_duplicate_blocks_totalcounterblocksDuplicate blocks created by the allow-duplicates policy.
kvbm_registration_dedup_totalcounterregistrationsRegistrations deduplicated by the reject-duplicates policy.
kvbm_stagings_totalcounterstagingsMutableBlock to CompleteBlock transitions.
kvbm_match_hashes_requested_totalcounterhashesHashes requested in match_blocks.
kvbm_match_blocks_returned_totalcounterblocksBlocks returned from match_blocks.
kvbm_scan_hashes_requested_totalcounterhashesHashes requested in scan_matches.
kvbm_scan_blocks_returned_totalcounterblocksBlocks returned from scan_matches.
kvbm_eager_primary_to_inactive_totalcountertransitionsLookup-driven Primary-to-Inactive race-window transitions.
kvbm_allocate_atomic_rollback_totalcounterrollbacksAllocation rollbacks after inactive backend under-allocation.
kvbm_release_primary_noop_totalcounterreleasesPrimary drop no-ops after concurrent transition or resurrection.
kvbm_release_duplicate_noop_totalcounterreleasesDuplicate drop no-ops due to slot identity mismatch.
kvbm_inflight_mutablegaugeblocksMutable blocks currently held outside the pool.
kvbm_inflight_immutablegaugeblocksImmutable blocks currently held outside the pool.
kvbm_reset_pool_sizegaugeblocksCurrent reset-pool size.
kvbm_inactive_pool_sizegaugeblocksCurrent inactive-pool size.

Appendix

Common Metric Labels

Labels that appear across multiple metrics:

LabelDescriptionExample Values
modelModel identifier (Dynamo/Triton)qwen/qwen3-0.6b
model_namespaceTriton model namespacenamespace configured in Triton
_custom_tagTriton model tag labeltag labels are prefixed with _
gpu_uuidTriton GPU UUIDGPU UUID string
model_nameModel identifier (backends)Qwen/Qwen3-0.6B
endpointAPI endpointchat_completions, completions
request_typeRequest typestream, unary
statusRequest outcomesuccess, error
engineEngine identifier (vLLM)0, 1, …
engine_typeEngine typepytorch, _autodeploy, unified, prefill, decode
tp_rankTensor parallel rank0, 1, …
pp_rankPipeline parallel rank0, 1, …
moe_ep_rankSGLang MoE expert-parallel rank0, 1, …
dp_rankData-parallel rank0, 1, …
prioritySGLang priority scheduling valueempty string, 0, 1, …
stageProcessing stage (SGLang)prefill_forward, decode_transferred
finished_reasonCompletion reasonstop, length, abort, error, repetition, timeout, cancelled
versionTriton model version1, …
reasonvLLM waiting reason or Triton failure reasoncapacity, deferred, REJECTED, CANCELED, BACKEND, OTHER
sourcevLLM prompt-token sourcelocal_compute, local_cache_hit, external_kv_transfer
sleep_statevLLM engine sleep stateawake, weights_offloaded, discard_all
positionSpeculative-decoding draft position0, 1, …
transfer_typeKV offload transfer typeBackend-specific transfer type
cache_sourceSGLang cache sourcedevice, host, storage_<backend>, total
forward_modeSGLang forward modeBackend-specific forward mode
layerSGLang model layer0, 1, …
dynamo_componentComponent identifierWorker name/ID
dynamo_endpointInternal endpointInternal routing info
dynamo_namespaceNamespaceDeployment namespace
worker_idDynamo worker identifierWorker ID
worker_typeDynamo worker typeprefill, decode
router_idDynamo router identifierRouter ID
operationDynamo operation nametokenize, detokenize
migration_typeDynamo request migration typenew_request, ongoing_request
event_typeDynamo KV publisher event typeEvent kind
workerTokio worker index0, 1, …
poolDynamo KVBM logical pool namePool identifier
instance_idDynamo KVBM external instance labelDeployment instance ID
error_typeError classificationError category
service_nameNATS service nameService identifier

Notes on Metric Usage

  1. Dynamo vs backend metrics: Dynamo metrics measure at the HTTP/routing layer (user-facing), while vLLM/SGLang/TensorRT-LLM metrics measure inside the inference engine. Triton metrics measure Triton core/backend scheduling plus system telemetry. Use Dynamo for user-facing SLAs, backend/Triton metrics for debugging performance.

  2. Counter vs Gauge interpretation:

    • Counters: Use stats.total for total change during benchmark, stats.rate for rate of change (per second)
    • Gauges: Use stats.avg for typical value, stats.max for peak, stats.p99 for tail behavior
  3. Histogram percentiles: Histogram percentiles (stats.p50_estimate, stats.p90_estimate, stats.p95_estimate, stats.p99_estimate) are estimated from bucket boundaries. Exact values depend on bucket configuration.

  4. Multiple endpoints: When scraping multiple instances, each series includes an endpoint_url label to identify the source.

  5. Backend-specific capabilities:

    • vLLM: Most comprehensive metrics including full request phase breakdown, cache statistics, and batch efficiency
    • SGLang: RadixAttention cache metrics, disaggregated inference support, speculative decoding stats, per-stage latency breakdowns
    • TensorRT-LLM: Core latency, queue, token, KV-cache, memory, and speculative decoding metrics when Prometheus output is enabled
    • Triton: Triton core request counts, queue depth, cumulative latency counters, optional first-response histograms, GPU/CPU/pinned-memory telemetry, and response-cache metrics

For detailed implementation and usage examples, see the Server Metrics Tutorial. For aggregated statistics, see the JSON Schema Reference. For raw time-series analysis, see the Parquet Schema Reference.