The GPU Health Monitor continuously monitors the health of NVIDIA GPUs in your Kubernetes cluster using DCGM (Data Center GPU Manager). It detects GPU hardware issues like ECC errors, thermal problems, and PCIe failures before they impact your workloads.
Think of it as a heart rate monitor for your GPUs - constantly checking vitals and alerting when something goes wrong.
GPU failures are expensive and can cause silent data corruption or job crashes:
Without GPU monitoring, failures often go unnoticed until jobs crash or produce incorrect results, wasting valuable compute time and requiring manual troubleshooting.
The GPU Health Monitor runs as a DaemonSet on GPU nodes:
DCGM provides comprehensive GPU health monitoring including memory errors, thermal violations, PCIe problems, and more. The monitor translates these DCGM health checks into standardized health events that NVSentinel can act upon.
Configure the GPU Health Monitor through Helm values:
The monitor checks multiple GPU health aspects through DCGM. Below are some of the key health watches - this is not an exhaustive list and may evolve over time as DCGM capabilities expand:
Memory Errors: Single-bit and double-bit ECC errors Thermal Issues: Temperature violations and throttling events PCIe Problems: PCIe replay errors and link issues Power Issues: Power violations and power capping events InfoROM Errors: GPU InfoROM corruption NVLink Errors: NVLink connectivity and error detection
Note: The specific health watches available depend on your DCGM version and GPU model. NVIDIA regularly adds new health checks and monitoring capabilities to DCGM. Consult your DCGM documentation for the complete list of supported health watches for your environment.
Leverages NVIDIA’s Data Center GPU Manager for comprehensive GPU health monitoring with proven reliability.
Maintains entity-level cache to track reported issues and avoid sending duplicate events within a single boot session.
Maps DCGM health checks to categorized events (fatal, warning, info) for appropriate response levels.
Includes GPU serial numbers, UUIDs, and other metadata in health events for precise identification and tracking.
Detects and reports DCGM connectivity issues to ensure monitoring remains functional.