For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
  • Kubernetes Deployment
    • Deployment Guide
  • User Guides
    • KV Cache Aware Routing
    • Disaggregated Serving
    • KV Cache Offloading
    • Dynamo Benchmarking
    • Multimodal
    • Diffusion (Preview)
    • Tool Calling
    • LoRA Adapters
    • Agents
    • Observability (Local)
    • Fault Tolerance
    • Writing Python Workers
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • SGLang HiCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
    • Blog
  • Documentation
    • Dynamo Docs Guide
  • Additional Resources
      • Observability
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Overview
  • Environment Variables and Flags
  • Getting Started Quickly
  • Start Observability Stack
  • Launch Dynamo Components
  • Exposed Metrics
  • Metric Categories
  • Available Metrics
  • LMCache Metrics
  • Troubleshooting
  • Implementation Details
  • Related Documentation
  • vLLM Metrics
  • Dynamo Metrics
Additional ResourcesvLLM Details

Prometheus

||View as Markdown|
Edit this page
Previous

Dynamo Docs Guide

Overview

When running vLLM through Dynamo, vLLM engine metrics are automatically passed through and exposed on Dynamo’s /metrics endpoint (default port 8081). This allows you to access both vLLM engine metrics (prefixed with vllm:) and Dynamo runtime metrics (prefixed with dynamo_*) from a single worker backend endpoint.

For the complete and authoritative list of all vLLM metrics, always refer to the official vLLM Metrics Design documentation.

For LMCache metrics and integration, see the LMCache Integration Guide.

For Dynamo runtime metrics, see the Dynamo Metrics Guide.

For visualization setup instructions, see the Prometheus and Grafana Setup Guide.

Environment Variables and Flags

VariableDescriptionDefaultExample
DYN_SYSTEM_PORTSystem metrics/health port. Required to expose /metrics endpoint.-1 (disabled)8081

Getting Started Quickly

This is a single machine example.

Start Observability Stack

For visualizing metrics with Prometheus and Grafana, start the observability stack. See Observability Getting Started for instructions.

Launch Dynamo Components

The launch scripts in examples/backends/vllm/launch/ already enable metrics on port 8081 by default. For example:

$cd $DYNAMO_HOME/examples/backends/vllm
$bash launch/agg.sh

Once the deployment is running, send a request and check metrics:

$curl -s localhost:8081/metrics | grep "^vllm:"

Exposed Metrics

vLLM exposes metrics in Prometheus Exposition Format text at the /metrics HTTP endpoint. All vLLM engine metrics use the vllm: prefix and include labels (e.g., model_name, finished_reason, scheduling_event) to identify the source.

Example Prometheus Exposition Format text:

# HELP vllm:request_success_total Number of successfully finished requests.
# TYPE vllm:request_success_total counter
vllm:request_success_total{finished_reason="length",model_name="meta-llama/Llama-3.1-8B"} 15.0
vllm:request_success_total{finished_reason="stop",model_name="meta-llama/Llama-3.1-8B"} 150.0
# HELP vllm:time_to_first_token_seconds Histogram of time to first token in seconds.
# TYPE vllm:time_to_first_token_seconds histogram
vllm:time_to_first_token_seconds_bucket{le="0.001",model_name="meta-llama/Llama-3.1-8B"} 0.0
vllm:time_to_first_token_seconds_bucket{le="0.005",model_name="meta-llama/Llama-3.1-8B"} 5.0
vllm:time_to_first_token_seconds_count{model_name="meta-llama/Llama-3.1-8B"} 165.0
vllm:time_to_first_token_seconds_sum{model_name="meta-llama/Llama-3.1-8B"} 89.38

Note: The specific metrics shown above are examples and may vary depending on your vLLM version. Always inspect your actual /metrics endpoint or refer to the official documentation for the current list.

Metric Categories

vLLM provides metrics in the following categories (all prefixed with vllm:):

  • Request metrics - Request success, failure, and completion tracking
  • Performance metrics - Latency, throughput, and timing measurements
  • Resource usage - System resource consumption
  • Scheduler metrics - Scheduling and queue management
  • Disaggregation metrics - Metrics specific to disaggregated deployments (when enabled)

Note: Specific metrics are subject to change between vLLM versions. Always refer to the official documentation or inspect the /metrics endpoint for your vLLM version.

Available Metrics

The official vLLM documentation includes complete metric definitions with:

  • Detailed explanations and design rationale
  • Counter, Gauge, and Histogram metric types
  • Metric labels (e.g., model_name, finished_reason, scheduling_event)
  • Information about v1 metrics migration
  • Future work and deprecated metrics

For the complete and authoritative list of all vLLM metrics, see the official vLLM Metrics Design documentation.

LMCache Metrics

When LMCache is enabled, LMCache metrics (prefixed with lmcache:) are automatically exposed via Dynamo’s /metrics endpoint alongside vLLM and Dynamo metrics.

To try it out, use the LMCache launch script:

$cd $DYNAMO_HOME/examples/backends/vllm
$bash launch/agg_lmcache.sh

Send a request and view LMCache metrics:

$curl -s localhost:8081/metrics | grep "^lmcache:"

Troubleshooting

Troubleshooting LMCache-related metrics and logs (including PrometheusLogger instance already created with different metadata and PROMETHEUS_MULTIPROC_DIR warnings) is documented in:

  • LMCache Integration Guide

For complete LMCache configuration and metric details, see:

  • LMCache Integration Guide - Setup and configuration
  • LMCache Observability Documentation - Complete metrics reference

Implementation Details

  • vLLM v1 uses multiprocess metrics collection via prometheus_client.multiprocess
  • PROMETHEUS_MULTIPROC_DIR: (optional). By default, Dynamo automatically manages this environment variable, setting it to a temporary directory where multiprocess metrics are stored as memory-mapped files. Each worker process writes its metrics to separate files in this directory, which are aggregated when /metrics is scraped. Users only need to set this explicitly where complete control over the metrics directory is required.
  • Dynamo uses MultiProcessCollector to aggregate metrics from all worker processes
  • Metrics are filtered by the vllm: and lmcache: prefixes before being exposed (when LMCache is enabled)
  • The integration uses Dynamo’s register_engine_metrics_callback() function with the global REGISTRY
  • Metrics appear after vLLM engine initialization completes
  • vLLM v1 metrics are different from v0 - see the official documentation for migration details

Related Documentation

vLLM Metrics

  • Official vLLM Metrics Design Documentation
  • vLLM Production Metrics User Guide
  • vLLM GitHub - Metrics Implementation

Dynamo Metrics

  • Dynamo Metrics Guide - Complete documentation on Dynamo runtime metrics
  • Prometheus and Grafana Setup - Visualization setup instructions
  • Dynamo runtime metrics (prefixed with dynamo_*) are available at the same /metrics endpoint alongside vLLM metrics
    • Implementation: lib/runtime/src/metrics.rs (Rust runtime metrics)
    • Metric names: lib/runtime/src/metrics/prometheus_names.rs (metric name constants)
    • Integration code: components/src/dynamo/common/utils/prometheus.py - Prometheus utilities and callback registration