> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/dsx/llms.txt.
> For full documentation content, see https://docs.nvidia.com/dsx/llms-full.txt.

# Telemetry and Observability

AI data center observability follows a similar model to traditional data
centers, but the scale and complexity grow meaningfully. Specific
challenges include GPUs and accelerators generating high-volume
telemetry, diverse high-speed networks (Ethernet, InfiniBand, NVLink)
requiring unified monitoring, higher data throughput demanding efficient
collection and storage, and large-scale training coordination requiring
correlated signals across thousands of GPUs. An AI data center requires
a unified observability solution that provides visibility into
operations, enabling high performance at scale and meeting tenant SLAs.

![Observability Implementation Architecture](https://files.buildwithfern.com/nvidia-dsx.docs.buildwithfern.com/dsx/056004fdfb0043113b71c51dfd78e5db01b7d67883649722404f50218dd8b19e/_dot_dot_/docs/guides/ncp-software-reference-guide/assets/images/ncp-srg-observability-arch.png)

The architecture relies on three signal types: **logs** (event records
from applications, system services, and hardware), **metrics**
(measurements of system behavior such as latency, throughput, and GPU
utilization), and **traces** (end-to-end path of requests or operations
across distributed systems). These signals are harvested across the data
center, normalized, and correlated through an OpenTelemetry (OTel)
pipeline. Correlation via timestamps, resource IDs, and trace IDs is
critical to relate signals to each other and accurately attribute them
to tenants or services. The architecture is designed to be
vendor-neutral, allowing partners to instrument their systems while
balancing operational speed with long-term storage costs.

* **Data Generation and Collection**: Data originates from three
  sources: applications (metrics and traces via OpenTelemetry SDK),
  infrastructure (system logs and GPU telemetry via DCGM Exporter), and
  network equipment (fabric health via gNMI/OpenConfig). An OTel
  Collector Agent runs on every node, performing local batching and
  enrichment before forwarding to the gateway via OTLP.
* **Ingestion and Processing**: The OTel Collector Gateway provides
  centralized processing: filtering, sampling, transformations, and
  fan-out routing to multiple backends. A stream processor (Kafka)
  buffers data between the gateway and storage, preventing backend
  overload during traffic spikes from large GPU clusters.
* **Storage Backends**: Storage splits into hot and cold paths. The hot
  path uses specialized stores (Loki for logs, Tempo for traces,
  Prometheus for metrics) for real-time monitoring, alerting, and
  incident response, with data retained for one to two weeks. The cold
  path writes to a Telemetry Data Lake (Parquet on object storage) for
  long-term analytics, capacity planning, and historical investigations,
  with data retained for months to years.