Dynamo Observability#

Getting Started Quickly#

This is an example to get started quickly on a single machine.

Prerequisites#

Install these on your machine:

Starting the Observability Stack#

Dynamo provides a Docker Compose-based observability stack that includes Prometheus, Grafana, Tempo, and various exporters for metrics, tracing, and visualization.

From the Dynamo root directory:

# Start infrastructure (NATS, etcd)
docker compose -f deploy/docker-compose.yml up -d

# Start observability stack (Prometheus, Grafana, Tempo, DCGM GPU exporter, NATS exporter)
docker compose -f deploy/docker-observability.yml up -d

For detailed setup instructions and configuration, see Prometheus + Grafana Setup.

Observability Documentations#

Guide

Description

Environment Variables to Control

Metrics

Available metrics reference

DYN_SYSTEM_PORT

Health Checks

Component health monitoring and readiness probes

DYN_SYSTEM_PORT†, DYN_SYSTEM_STARTING_HEALTH_STATUS, DYN_SYSTEM_HEALTH_PATH, DYN_SYSTEM_LIVE_PATH, DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS

Tracing

Distributed tracing with OpenTelemetry and Tempo

DYN_LOGGING_JSONL†, OTEL_EXPORT_ENABLED†, OTEL_EXPORTER_OTLP_TRACES_ENDPOINT†, OTEL_SERVICE_NAME

Logging

Structured logging configuration

DYN_LOGGING_JSONL†, DYN_LOG, DYN_LOG_USE_LOCAL_TZ, DYN_LOGGING_CONFIG_PATH, OTEL_SERVICE_NAME†, OTEL_EXPORT_ENABLED†, OTEL_EXPORTER_OTLP_TRACES_ENDPOINT

Variables marked with † are shared across multiple observability systems.

Developer Guides#

Guide

Description

Environment Variables to Control

Metrics Developer Guide

Creating custom metrics in Rust and Python

DYN_SYSTEM_PORT

Kubernetes#

For Kubernetes-specific setup and configuration, see docs/kubernetes/observability/.


Topology#

This provides:

  • Prometheus on http://localhost:9090 - metrics collection and querying

  • Grafana on http://localhost:3000 - visualization dashboards (username: dynamo, password: dynamo)

  • Tempo on http://localhost:3200 - distributed tracing backend

  • DCGM Exporter on http://localhost:9401/metrics - GPU metrics

  • NATS Exporter on http://localhost:7777/metrics - NATS messaging metrics

Service Relationship Diagram#

        graph TD
    BROWSER[Browser] -->|:3000| GRAFANA[Grafana :3000]
    subgraph DockerComposeNetwork [Network inside Docker Compose]
        NATS_PROM_EXP[nats-prom-exp :7777 /metrics] -->|:8222/varz| NATS_SERVER[nats-server :4222, :6222, :8222]
        PROMETHEUS[Prometheus server :9090] -->|:2379/metrics| ETCD_SERVER[etcd-server :2379, :2380]
        PROMETHEUS -->|:9401/metrics| DCGM_EXPORTER[dcgm-exporter :9401]
        PROMETHEUS -->|:7777/metrics| NATS_PROM_EXP
        PROMETHEUS -->|:8000/metrics| DYNAMOFE[Dynamo HTTP FE :8000]
        PROMETHEUS -->|:8081/metrics| DYNAMOBACKEND[Dynamo backend :8081]
        DYNAMOFE --> DYNAMOBACKEND
        GRAFANA -->|:9090/query API| PROMETHEUS
    end
    

The dcgm-exporter service in the Docker Compose network is configured to use port 9401 instead of the default port 9400. This adjustment is made to avoid port conflicts with other dcgm-exporter instances that may be running simultaneously. Such a configuration is typical in distributed systems like SLURM.

Configuration Files#

The following configuration files are located in the deploy/observability/ directory: