For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
  • Kubernetes Deployment
    • Deployment Guide
  • User Guides
    • KV Cache Aware Routing
    • Disaggregated Serving
    • KV Cache Offloading
    • Dynamo Benchmarking
    • Multimodal
    • Diffusion (Preview)
    • Tool Calling
    • LoRA Adapters
    • Agents
    • Observability (Local)
      • Prometheus + Grafana Setup
      • Metrics
      • Metrics Developer Guide
      • Health Checks
      • Tracing
      • Logging
    • Fault Tolerance
    • Writing Python Workers
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • SGLang HiCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
    • Blog
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Getting Started Quickly
  • Prerequisites
  • Starting the Observability Stack
  • Observability Documentation
  • Developer Guides
  • Kubernetes
  • Topology
  • Service Relationship Diagram
  • Configuration Files
User Guides

Observability (Local)

Monitor Dynamo deployments with metrics, logging, and tracing
||View as Markdown|
Edit this page
Previous

SGLang for Agentic Workloads

Next

Prometheus + Grafana Setup

Getting Started Quickly

This is an example to get started quickly on a single machine.

Prerequisites

Install these on your machine:

  • Docker
  • Docker Compose

Starting the Observability Stack

Dynamo provides a Docker Compose-based observability stack that includes Prometheus, Grafana, Tempo, Loki, an OpenTelemetry Collector, and various exporters for metrics, tracing, logging, and visualization.

From the Dynamo root directory:

$# Start infrastructure (NATS, etcd)
$docker compose -f deploy/docker-compose.yml up -d
$
$# Start observability stack (Prometheus, Grafana, Tempo, DCGM GPU exporter, NATS exporter)
$docker compose -f deploy/docker-observability.yml up -d

For detailed setup instructions and configuration, see Prometheus + Grafana Setup.

Observability Documentation

GuideDescriptionEnvironment Variables to Control
MetricsAvailable metrics referenceDYN_SYSTEM_PORT†
Operator Metrics (Kubernetes)Operator controller and webhook metrics for KubernetesN/A (configured via Helm)
Health ChecksComponent health monitoring and readiness probesDYN_SYSTEM_PORT†, DYN_SYSTEM_STARTING_HEALTH_STATUS, DYN_SYSTEM_HEALTH_PATH, DYN_SYSTEM_LIVE_PATH, DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS
TracingDistributed tracing with OpenTelemetry and TempoDYN_LOGGING_JSONL†, OTEL_EXPORT_ENABLED†, OTEL_EXPORTER_OTLP_TRACES_ENDPOINT†, OTEL_SERVICE_NAME†
LoggingStructured logging and OTLP log export to LokiDYN_LOGGING_JSONL†, DYN_LOG, DYN_LOG_USE_LOCAL_TZ, DYN_LOGGING_CONFIG_PATH, OTEL_SERVICE_NAME†, OTEL_EXPORT_ENABLED†, OTEL_EXPORTER_OTLP_TRACES_ENDPOINT†, OTEL_EXPORTER_OTLP_LOGS_ENDPOINT†

Variables marked with † are shared across multiple observability systems.

Developer Guides

GuideDescriptionEnvironment Variables to Control
Metrics Developer GuideCreating custom metrics in Rust and PythonDYN_SYSTEM_PORT†

Kubernetes

For Kubernetes-specific setup and configuration, see docs/kubernetes/observability/.

Operator Metrics: The Dynamo Operator running in Kubernetes exposes its own set of metrics for monitoring controller reconciliation, webhook validation, and resource inventory. See the Operator Metrics Guide.


Topology

This provides:

  • Prometheus on http://localhost:9090 - metrics collection and querying
  • Grafana on http://localhost:3000 - visualization dashboards (username: dynamo, password: dynamo)
  • Tempo on http://localhost:3200 - distributed tracing backend
  • Loki on http://localhost:3100 - log aggregation backend
  • OpenTelemetry Collector on http://localhost:4317 (gRPC) / http://localhost:4318 (HTTP) - receives OTLP signals and routes traces to Tempo and logs to Loki
  • DCGM Exporter on http://localhost:9401/metrics - GPU metrics
  • NATS Exporter on http://localhost:7777/metrics - NATS messaging metrics

Service Relationship Diagram

The dcgm-exporter service in the Docker Compose network is configured to use port 9401 instead of the default port 9400. This adjustment is made to avoid port conflicts with other dcgm-exporter instances that may be running simultaneously. Such a configuration is typical in distributed systems like SLURM.

Configuration Files

The following configuration files are located in the deploy/observability/ directory:

  • docker-compose.yml: Defines NATS and etcd services
  • docker-observability.yml: Defines Prometheus, Grafana, Tempo, and exporters
  • prometheus.yml: Contains Prometheus scraping configuration
  • grafana-datasources.yml: Contains Grafana datasource configuration
  • otel-collector.yaml: OpenTelemetry Collector configuration (routes traces to Tempo, logs to Loki)
  • loki.yaml: Loki log aggregation configuration
  • loki-datasource.yml: Grafana Loki datasource with trace ID linking to Tempo
  • grafana_dashboards/dashboard-providers.yml: Contains Grafana dashboard provider configuration
  • grafana_dashboards/dynamo.json: A general Dynamo Dashboard for both SW and HW metrics
  • grafana_dashboards/dcgm-metrics.json: Contains Grafana dashboard configuration for DCGM GPU metrics
  • grafana_dashboards/kvbm.json: Contains Grafana dashboard configuration for KVBM metrics