Observability (Local)

Monitor Dynamo deployments with metrics, logging, and tracing
View as MarkdownOpen in Claude

Required environment variables

Set these on every Dynamo process (frontend, router, workers) for metrics, traces, and logs to flow:

VariablePurposeRequired
DYN_SYSTEM_PORT=8081Unified system port (metrics + health).Yes for metrics.
OTEL_EXPORT_ENABLED=trueEnable OpenTelemetry export. Without this, traces and logs never leave the process — Loki and Tempo will show nothing even if they are healthy.Yes for traces/logs.
OTEL_EXPORTER_OTLP_TRACES_ENDPOINTOTLP gRPC endpoint for traces (e.g. http://tempo:4317). Must be a gRPC listener — Dynamo’s exporter does not speak OTLP/HTTP, even though the OTel Collector also listens on :4318.Yes for traces.
OTEL_EXPORTER_OTLP_LOGS_ENDPOINTOTLP gRPC endpoint for logs (e.g. http://loki-otlp:4317). Same gRPC-only constraint as the traces endpoint above.Yes for logs.
DYN_LOGGING_JSONL=trueStructured JSON log output (recommended for Loki).Optional.

Source of truth: lib/runtime/src/logging.rs setup_logging().

Passing --enable-metrics on an individual backend only exposes metrics per backend. The unified frontend metrics surface (scraped by Prometheus) requires DYN_SYSTEM_PORT to be set on the frontend process as well — setting it on workers alone is not enough.

Prometheus metric families in Dynamo are registered lazily: each label set is created the first time it fires, so a freshly-started process shows empty metric families until the first relevant request. This is expected — an idle cluster does not mean scraping is broken.

Getting Started Quickly

This is an example to get started quickly on a single machine.

Prerequisites

Install these on your machine:

Starting the Observability Stack

Dynamo provides a Docker Compose-based observability stack that includes Prometheus, Grafana, Tempo, Loki, an OpenTelemetry Collector, and various exporters for metrics, tracing, logging, and visualization.

From the Dynamo root directory:

$# Start infrastructure (NATS, etcd)
$docker compose -f dev/docker-compose.yml up -d
$
$# Start observability stack (Prometheus, Grafana, Tempo, DCGM GPU exporter, NATS exporter)
$docker compose -f dev/docker-observability.yml up -d

For detailed setup instructions and configuration, see Prometheus + Grafana Setup.

Observability Documentation

GuideDescriptionEnvironment Variables to Control
MetricsAvailable metrics referenceDYN_SYSTEM_PORT
Operator Metrics (Kubernetes)Operator controller and webhook metrics for KubernetesN/A (configured via Helm)
Health ChecksComponent health monitoring and readiness probesDYN_SYSTEM_PORT†, DYN_SYSTEM_STARTING_HEALTH_STATUS, DYN_SYSTEM_HEALTH_PATH, DYN_SYSTEM_LIVE_PATH, DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS
TracingDistributed tracing with OpenTelemetry and TempoDYN_LOGGING_JSONL†, OTEL_EXPORT_ENABLED†, OTEL_EXPORTER_OTLP_TRACES_ENDPOINT†, OTEL_SERVICE_NAME
LoggingStructured logging and OTLP log export to LokiDYN_LOGGING_JSONL†, DYN_LOG, DYN_LOG_USE_LOCAL_TZ, DYN_LOGGING_CONFIG_PATH, OTEL_SERVICE_NAME†, OTEL_EXPORT_ENABLED†, OTEL_EXPORTER_OTLP_TRACES_ENDPOINT†, OTEL_EXPORTER_OTLP_LOGS_ENDPOINT

Variables marked with † are shared across multiple observability systems.

Developer Guides

GuideDescriptionEnvironment Variables to Control
Metrics Developer GuideCreating custom metrics in Rust and PythonDYN_SYSTEM_PORT
Local Resource MonitorPer-process VRAM / PCIe / CPU exporter for engine-startup profiling (200 ms scrape, profile-gated)N/A (host-side script)

Kubernetes

For Kubernetes-specific setup and configuration, see docs/kubernetes/observability/.

Operator Metrics: The Dynamo Operator running in Kubernetes exposes its own set of metrics for monitoring controller reconciliation, webhook validation, and resource inventory. See the Operator Metrics Guide.


Topology

This provides:

  • Prometheus on http://localhost:9090 - metrics collection and querying
  • Grafana on http://localhost:3000 - visualization dashboards (username: dynamo, password: dynamo)
  • Tempo on http://localhost:3200 - distributed tracing backend
  • Loki on http://localhost:3100 - log aggregation backend
  • OpenTelemetry Collector on http://localhost:4317 (gRPC) / http://localhost:4318 (HTTP) - receives OTLP signals and routes traces to Tempo and logs to Loki
  • DCGM Exporter on http://localhost:9401/metrics - GPU metrics
  • NATS Exporter on http://localhost:7777/metrics - NATS messaging metrics

Service Relationship Diagram

The dcgm-exporter service in the Docker Compose network is configured to use port 9401 instead of the default port 9400. This adjustment is made to avoid port conflicts with other dcgm-exporter instances that may be running simultaneously. Such a configuration is typical in distributed systems like SLURM.

Configuration Files

The following configuration files are located in the dev/observability/ directory: