About GPU Telemetry

Monitoring stacks usually consist of a collector, a time-series database to store metrics and a visualization layer. A popular open-source stack is Prometheus used along with Grafana as the visualization tool to create rich dashboards. Prometheus also includes an Alertmanager, to create and manage alerts. Prometheus is deployed along with kube-state-metrics and node_exporter to expose cluster-level metrics for Kubernetes API objects and node-level metrics such as CPU utilization.

An architecture of Prometheus is shown in the figure below:

https://boxboat.com/2019/08/08/monitoring-kubernetes-with-prometheus/prometheus-architecture.png

To gather GPU telemetry in Kubernetes, its recommended to use DCGM Exporter. DCGM Exporter, based on DCGM exposes GPU metrics for Prometheus and can be visualized using Grafana. DCGM Exporter is architected to take advantage of KubeletPodResources API and exposes GPU metrics in a format that can be scraped by Prometheus. A ServiceMonitor is also included to expose endpoints.