Dynamo provides built-in metrics capabilities through the Dynamo metrics API, which is automatically available whenever you use the DistributedRuntime framework. This document serves as a reference for all available metrics in Dynamo.
For visualization setup instructions, see the Prometheus and Grafana Setup Guide.
For creating custom metrics, see the Metrics Developer Guide.
This is a single machine example.
For visualizing metrics with Prometheus and Grafana, start the observability stack. See Observability Getting Started for instructions.
Launch a frontend and vLLM backend to test metrics:
Wait for the vLLM worker to start, then send requests and check metrics:
Dynamo exposes metrics in Prometheus Exposition Format text at the /metrics HTTP endpoint. All Dynamo-generated metrics use the dynamo_* prefix and include labels (dynamo_namespace, dynamo_component, dynamo_endpoint) to identify the source component.
Example Prometheus Exposition Format text:
Dynamo exposes several categories of metrics:
dynamo_frontend_*) - Request handling, token processing, and latency measurementsdynamo_component_*) - Request counts, processing times, byte transfers, and system uptimedynamo_preprocessor_*) - Component-specific metricsvllm:*), SGLang (sglang:*), TensorRT-LLM (trtllm_*)The Dynamo metrics API is available on DistributedRuntime, Namespace, Component, and Endpoint, providing a hierarchical approach to metric collection that matches Dynamo’s distributed architecture:
DistributedRuntime: Global metrics across the entire runtimeNamespace: Metrics scoped to a specific dynamo_namespaceComponent: Metrics for a specific dynamo_component within a namespaceEndpoint: Metrics for individual dynamo_endpoint within a componentThis hierarchical structure allows you to create metrics at the appropriate level of granularity for your monitoring needs.
Backend workers (python -m dynamo.vllm, python -m dynamo.sglang, etc.) expose dynamo_component_* metrics on port 8081 by default (configurable via DYN_SYSTEM_PORT).
The core Dynamo backend system automatically exposes metrics on the system status port (default: 8081, configurable via DYN_SYSTEM_PORT) at the /metrics endpoint with the dynamo_component_* prefix for all components that use the DistributedRuntime framework:
dynamo_component_inflight_requests: Requests currently being processed (gauge)dynamo_component_request_bytes_total: Total bytes received in requests (counter)dynamo_component_request_duration_seconds: Request processing time (histogram)dynamo_component_requests_total: Total requests processed (counter)dynamo_component_response_bytes_total: Total bytes sent in responses (counter)dynamo_component_uptime_seconds: DistributedRuntime uptime (gauge)Access backend component metrics:
KV router statistics are automatically exposed by LLM workers and KV router components on the backend system status port (port 8081) with the dynamo_component_kvstats_* prefix. These metrics provide insights into GPU memory usage and cache efficiency:
dynamo_component_kvstats_active_blocks: Number of active KV cache blocks currently in use (gauge)dynamo_component_kvstats_total_blocks: Total number of KV cache blocks available (gauge)dynamo_component_kvstats_gpu_cache_usage_percent: GPU cache usage as a percentage (0.0-1.0) (gauge)dynamo_component_kvstats_gpu_prefix_cache_hit_rate: GPU prefix cache hit rate as a percentage (0.0-1.0) (gauge)These metrics are published by:
Some components expose additional metrics specific to their functionality:
dynamo_preprocessor_*: Metrics specific to preprocessor componentsThe Dynamo HTTP Frontend (python -m dynamo.frontend) exposes dynamo_frontend_* metrics on port 8000 by default (configurable via --http-port or DYN_HTTP_PORT) at the /metrics endpoint. Most metrics include model labels containing the model name:
dynamo_frontend_inflight_requests: Inflight requests (gauge)dynamo_frontend_queued_requests: Number of requests in HTTP processing queue (gauge)dynamo_frontend_disconnected_clients: Number of disconnected clients (gauge)dynamo_frontend_input_sequence_tokens: Input sequence length (histogram)dynamo_frontend_cached_tokens: Number of cached tokens (prefix cache hits) per request (histogram)dynamo_frontend_inter_token_latency_seconds: Inter-token latency (histogram)dynamo_frontend_output_sequence_tokens: Output sequence length (histogram)dynamo_frontend_output_tokens_total: Total number of output tokens generated (counter)dynamo_frontend_request_duration_seconds: LLM request duration (histogram)dynamo_frontend_requests_total: Total LLM requests (counter)dynamo_frontend_time_to_first_token_seconds: Time to first token (histogram)dynamo_frontend_model_migration_total: Total number of request migrations due to worker unavailability (counter, labels: model, migration_type)Access frontend metrics:
Note: The dynamo_frontend_inflight_requests metric tracks requests from HTTP handler start until the complete response is finished, while dynamo_frontend_queued_requests tracks requests from HTTP handler start until first token generation begins (including prefill time). HTTP queue time is a subset of inflight time.
The frontend also exposes model configuration metrics (on port 8000 /metrics endpoint) with the dynamo_frontend_model_* prefix. These metrics are populated from the worker backend registration service when workers register with the system. All model configuration metrics include a model label.
Runtime Config Metrics (from ModelRuntimeConfig): These metrics come from the runtime configuration provided by worker backends during registration.
dynamo_frontend_model_total_kv_blocks: Total KV blocks available for a worker serving the model (gauge)dynamo_frontend_model_max_num_seqs: Maximum number of sequences for a worker serving the model (gauge)dynamo_frontend_model_max_num_batched_tokens: Maximum number of batched tokens for a worker serving the model (gauge)MDC Metrics (from ModelDeploymentCard): These metrics come from the Model Deployment Card information provided by worker backends during registration. Note that when multiple worker instances register with the same model name, only the first instance’s configuration metrics (runtime config and MDC metrics) will be populated. Subsequent instances with duplicate model names will be skipped for configuration metric updates.
dynamo_frontend_model_context_length: Maximum context length for a worker serving the model (gauge)dynamo_frontend_model_kv_cache_block_size: KV cache block size for a worker serving the model (gauge)dynamo_frontend_model_migration_limit: Request migration limit for a worker serving the model (gauge)This section explains the distinction between two key metrics used to track request processing:
Example Request Flow:
Timeline:
Concurrency Example: Suppose the backend allows 3 concurrent requests and there are 10 clients continuously hitting the frontend:
Key Differences: