Validate the GPU status/health#
Using cmsh run the following command
[bcm10-headnode1->device]% pexec -c dgx-h100 "nvsm show gpus \| grep -e \\"GPU.\" -e \\"Health\""
[dgx-01] :
/systems/localhost/gpus/GPU0
Inventory_UUID = GPU-3a713db1-ff94-3f28-de11-2d6449b8d35f
Stats_UtilGPU = 0%
Status_Health = OK
/systems/localhost/gpus/GPU0/health
Health = OK
/systems/localhost/gpus/GPU1
Inventory_UUID = GPU-bc00e4ef-fb74-10c8-fa6a-a8c243304e14
Stats_UtilGPU = 0%
Status_Health = OK
/systems/localhost/gpus/GPU1/health
Health = OK
/systems/localhost/gpus/GPU2
Inventory_UUID = GPU-4f2e5158-c8be-b50e-2d99-e5380b4a8236
Stats_UtilGPU = 0%
Status_Health = OK
/systems/localhost/gpus/GPU2/health
Health = OK
/systems/localhost/gpus/GPU3
Inventory_UUID = GPU-0a8a7416-4c04-5038-19ca-345db5a1a0ad
Stats_UtilGPU = 0%
Status_Health = OK
/systems/localhost/gpus/GPU3/health
Health = OK
/systems/localhost/gpus/GPU4
Inventory_UUID = GPU-bbd31c7c-7845-d48a-df3b-1523adf86ee6
Stats_UtilGPU = 0%
Status_Health = OK
/systems/localhost/gpus/GPU4/health
Health = OK
/systems/localhost/gpus/GPU5
output omitted for brevity
Ensure all GPUs are healthy.