About GPU Telemetry
Monitoring stacks usually consist of a collector, a time-series database to store metrics and a visualization layer. A popular open-source stack is Prometheus used along with Grafana as the visualization tool to create rich dashboards. Prometheus also includes an Alertmanager, to create and manage alerts. Prometheus is deployed along with kube-state-metrics and node_exporter to expose cluster-level metrics for Kubernetes API objects and node-level metrics such as CPU utilization.
An architecture of Prometheus is shown in the figure below:
To gather GPU telemetry in Kubernetes, its recommended to use DCGM Exporter. DCGM Exporter, based on DCGM exposes
GPU metrics for Prometheus and can be visualized using Grafana. DCGM Exporter is architected to take advantage of
KubeletPodResources
API and exposes GPU metrics in a format that can be
scraped by Prometheus. A ServiceMonitor
is also included to expose endpoints.