Observability (Local)

Monitor Dynamo deployments with metrics, logging, and tracing

Getting Started Quickly

This is an example to get started quickly on a single machine.

Prerequisites

Install these on your machine:

Starting the Observability Stack

Dynamo provides a Docker Compose-based observability stack that includes Prometheus, Grafana, Tempo, Loki, an OpenTelemetry Collector, and various exporters for metrics, tracing, logging, and visualization.

From the Dynamo root directory:

$ # Start infrastructure (NATS, etcd)
$ docker compose -f deploy/docker-compose.yml up -d
$ 
$ # Start observability stack (Prometheus, Grafana, Tempo, DCGM GPU exporter, NATS exporter)
$ docker compose -f deploy/docker-observability.yml up -d

For detailed setup instructions and configuration, see Prometheus + Grafana Setup.

Observability Documentation

Guide	Description	Environment Variables to Control
Metrics	Available metrics reference	`DYN_SYSTEM_PORT`†
Operator Metrics (Kubernetes)	Operator controller and webhook metrics for Kubernetes	N/A (configured via Helm)
Health Checks	Component health monitoring and readiness probes	`DYN_SYSTEM_PORT`†, `DYN_SYSTEM_STARTING_HEALTH_STATUS`, `DYN_SYSTEM_HEALTH_PATH`, `DYN_SYSTEM_LIVE_PATH`, `DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS`
Tracing	Distributed tracing with OpenTelemetry and Tempo	`DYN_LOGGING_JSONL`†, `OTEL_EXPORT_ENABLED`†, `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT`†, `OTEL_SERVICE_NAME`†
Logging	Structured logging and OTLP log export to Loki	`DYN_LOGGING_JSONL`†, `DYN_LOG`, `DYN_LOG_USE_LOCAL_TZ`, `DYN_LOGGING_CONFIG_PATH`, `OTEL_SERVICE_NAME`†, `OTEL_EXPORT_ENABLED`†, `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT`†, `OTEL_EXPORTER_OTLP_LOGS_ENDPOINT`†

Variables marked with † are shared across multiple observability systems.

Developer Guides

Guide	Description	Environment Variables to Control
Metrics Developer Guide	Creating custom metrics in Rust and Python	`DYN_SYSTEM_PORT`†

Kubernetes

For Kubernetes-specific setup and configuration, see docs/kubernetes/observability/.

Operator Metrics: The Dynamo Operator running in Kubernetes exposes its own set of metrics for monitoring controller reconciliation, webhook validation, and resource inventory. See the Operator Metrics Guide.

Topology

This provides:

Prometheus on http://localhost:9090 - metrics collection and querying
Grafana on http://localhost:3000 - visualization dashboards (username: dynamo, password: dynamo)
Tempo on http://localhost:3200 - distributed tracing backend
Loki on http://localhost:3100 - log aggregation backend
OpenTelemetry Collector on http://localhost:4317 (gRPC) / http://localhost:4318 (HTTP) - receives OTLP signals and routes traces to Tempo and logs to Loki
DCGM Exporter on http://localhost:9401/metrics - GPU metrics
NATS Exporter on http://localhost:7777/metrics - NATS messaging metrics

Service Relationship Diagram

The dcgm-exporter service in the Docker Compose network is configured to use port 9401 instead of the default port 9400. This adjustment is made to avoid port conflicts with other dcgm-exporter instances that may be running simultaneously. Such a configuration is typical in distributed systems like SLURM.

Configuration Files

The following configuration files are located in the deploy/observability/ directory:

docker-compose.yml: Defines NATS and etcd services
docker-observability.yml: Defines Prometheus, Grafana, Tempo, and exporters
prometheus.yml: Contains Prometheus scraping configuration
grafana-datasources.yml: Contains Grafana datasource configuration
otel-collector.yaml: OpenTelemetry Collector configuration (routes traces to Tempo, logs to Loki)
loki.yaml: Loki log aggregation configuration
loki-datasource.yml: Grafana Loki datasource with trace ID linking to Tempo
grafana_dashboards/dashboard-providers.yml: Contains Grafana dashboard provider configuration
grafana_dashboards/dynamo.json: A general Dynamo Dashboard for both SW and HW metrics
grafana_dashboards/dcgm-metrics.json: Contains Grafana dashboard configuration for DCGM GPU metrics
grafana_dashboards/kvbm.json: Contains Grafana dashboard configuration for KVBM metrics

$	# Start infrastructure (NATS, etcd)
$	docker compose -f deploy/docker-compose.yml up -d
$
$	# Start observability stack (Prometheus, Grafana, Tempo, DCGM GPU exporter, NATS exporter)
$	docker compose -f deploy/docker-observability.yml up -d