vLLM Prometheus Metrics#

Overview#

When running vLLM through Dynamo, vLLM engine metrics are automatically passed through and exposed on Dynamo’s /metrics endpoint (default port 8081). This allows you to access both vLLM engine metrics (prefixed with vllm:) and Dynamo runtime metrics (prefixed with dynamo_*) from a single worker backend endpoint.

For the complete and authoritative list of all vLLM metrics, always refer to the official vLLM Metrics Design documentation.

For LMCache metrics and integration, see the LMCache Integration Guide.

For Dynamo runtime metrics, see the Dynamo Metrics Guide.

For visualization setup instructions, see the Prometheus and Grafana Setup Guide.

Environment Variables and Flags#

Variable/Flag

Description

Default

Example

DYN_SYSTEM_PORT

System metrics/health port. Required to expose /metrics endpoint.

-1 (disabled)

8081

--connector

KV connector to use. Use lmcache to enable LMCache metrics.

nixl

--connector lmcache

Getting Started Quickly#

This is a single machine example.

Start Observability Stack#

For visualizing metrics with Prometheus and Grafana, start the observability stack. See Observability Getting Started for instructions.

Launch Dynamo Components#

Launch a frontend and vLLM backend to test metrics:

# Start frontend (default port 8000, override with --http-port or DYN_HTTP_PORT env var)
$ python -m dynamo.frontend

# Enable system metrics server on port 8081
$ DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model <model_name> \
   --enforce-eager --no-enable-prefix-caching --max-num-seqs 3

Wait for the vLLM worker to start, then send requests and check metrics:

# Send a request
curl -H 'Content-Type: application/json' \
-d '{
  "model": "<model_name>",
  "max_completion_tokens": 100,
  "messages": [{"role": "user", "content": "Hello"}]
}' \
http://localhost:8000/v1/chat/completions

# Check metrics from the worker
curl -s localhost:8081/metrics | grep "^vllm:"

Exposed Metrics#

vLLM exposes metrics in Prometheus Exposition Format text at the /metrics HTTP endpoint. All vLLM engine metrics use the vllm: prefix and include labels (e.g., model_name, finished_reason, scheduling_event) to identify the source.

Example Prometheus Exposition Format text:

# HELP vllm:request_success_total Number of successfully finished requests.
# TYPE vllm:request_success_total counter
vllm:request_success_total{finished_reason="length",model_name="meta-llama/Llama-3.1-8B"} 15.0
vllm:request_success_total{finished_reason="stop",model_name="meta-llama/Llama-3.1-8B"} 150.0

# HELP vllm:time_to_first_token_seconds Histogram of time to first token in seconds.
# TYPE vllm:time_to_first_token_seconds histogram
vllm:time_to_first_token_seconds_bucket{le="0.001",model_name="meta-llama/Llama-3.1-8B"} 0.0
vllm:time_to_first_token_seconds_bucket{le="0.005",model_name="meta-llama/Llama-3.1-8B"} 5.0
vllm:time_to_first_token_seconds_count{model_name="meta-llama/Llama-3.1-8B"} 165.0
vllm:time_to_first_token_seconds_sum{model_name="meta-llama/Llama-3.1-8B"} 89.38

Note: The specific metrics shown above are examples and may vary depending on your vLLM version. Always inspect your actual /metrics endpoint or refer to the official documentation for the current list.

Metric Categories#

vLLM provides metrics in the following categories (all prefixed with vllm:):

  • Request metrics - Request success, failure, and completion tracking

  • Performance metrics - Latency, throughput, and timing measurements

  • Resource usage - System resource consumption

  • Scheduler metrics - Scheduling and queue management

  • Disaggregation metrics - Metrics specific to disaggregated deployments (when enabled)

Note: Specific metrics are subject to change between vLLM versions. Always refer to the official documentation or inspect the /metrics endpoint for your vLLM version.

Available Metrics#

The official vLLM documentation includes complete metric definitions with:

  • Detailed explanations and design rationale

  • Counter, Gauge, and Histogram metric types

  • Metric labels (e.g., model_name, finished_reason, scheduling_event)

  • Information about v1 metrics migration

  • Future work and deprecated metrics

For the complete and authoritative list of all vLLM metrics, see the official vLLM Metrics Design documentation.

LMCache Metrics#

When LMCache is enabled with --connector lmcache and DYN_SYSTEM_PORT is set, LMCache metrics (prefixed with lmcache:) are automatically exposed via Dynamo’s /metrics endpoint alongside vLLM and Dynamo metrics.

Minimum Requirements#

To access LMCache metrics, both of these are required:

  1. --connector lmcache - Enables LMCache in vLLM

  2. DYN_SYSTEM_PORT=8081 - Enables Dynamo’s metrics HTTP endpoint

Example:

DYN_SYSTEM_PORT=8081 \
python -m dynamo.vllm --model Qwen/Qwen3-0.6B --connector lmcache

Viewing LMCache Metrics#

# View all LMCache metrics
curl -s localhost:8081/metrics | grep "^lmcache:"

For complete LMCache configuration and metric details, see:

Implementation Details#

  • vLLM v1 uses multiprocess metrics collection via prometheus_client.multiprocess

  • PROMETHEUS_MULTIPROC_DIR: (optional). By default, Dynamo automatically manages this environment variable, setting it to a temporary directory where multiprocess metrics are stored as memory-mapped files. Each worker process writes its metrics to separate files in this directory, which are aggregated when /metrics is scraped. Users only need to set this explicitly where complete control over the metrics directory is required.

  • Dynamo uses MultiProcessCollector to aggregate metrics from all worker processes

  • Metrics are filtered by the vllm: and lmcache: prefixes before being exposed (when LMCache is enabled)

  • The integration uses Dynamo’s register_engine_metrics_callback() function with the global REGISTRY

  • Metrics appear after vLLM engine initialization completes

  • vLLM v1 metrics are different from v0 - see the official documentation for migration details