Power Supply Replacement
This topic describes how to replace the power supplies (PSUs) of the NVIDIA DGX™ H100/H200 system.
Power Supply Replacement Overview
This is a high-level overview of the steps needed to replace a power supply.
Identify the broken power supply either by the amber color LED or by the power supply number
Request a replacement from NVIDIA Enterprise Support.
Remove the locking power cord from the power supply
Replace the power supply
Install the locking power cord
Confirm that both LEDs light up green on the power supply
Make sure the BMC reports no power supply failures
If requested, ship back the failed unit to NVIDIA Enterprise Support using the packaging provided
Identifying the Failed Power Supply
You can identify a failed power supply using any of the following methods:
Visually inspect the the LEDs on the power supplies from the rear of the system when the system is powered on.
Run the
nvsm show psus
command and view the command output.Access the BMC web user interface and view the sensor data.
NVIDIA Enterprise Support might ask for this or similar information to confirm the power supply needs to be replaced.
The nvsm
command output and the BMC web user interface identify each power supply as PSUx
,
where x is 0
to 5
.
The following diagram shows the physical location of each PSU.
Viewing the Power Supply LEDs
Access the rear of the system and view the status LEDs while the system is powered on.
Both LEDs are solid green if the PSU is good. If either of the LEDs are not green or they blink, contact NVIDIA Enterprise Support to troubleshoot the issue.
Running the Show PSUs Command
Run the following command to display information about the PSUs:
sudo nvsm show psus
The output shows information for each PSU. Look for any that do not report
Status_Health=OK
.
Viewing PSUs from the BMC web user interface
Access the BMC web user interface and select Sensors from the left hand column.
Confirm PSU presence:
Confirm power output:
Confirm fan speeds:
Confirm the PSU temperature readings:
Run the
ipmitool
command to view information about the PSUs:sudo ipmitool sdr | grep -i psu
Look for power supplies with no temperature reading or an output reading that is close to, or equal to, zero.
Determining the Manufacturer
Important
All PSUs in the system must be from the same manufacturer.
Run the following
nvsm
command to determine the PSU manufacturer:sudo nvsm show /chassis/localhost/power/PSUx
Replace x in the preceding command with the PSU identifier.
Example Output
The following output is for PSU0 and shows that the manufacturer is Delta.
/chassis/localhost/power/PSU0 Properties: FirmwareVersion = 02.02.02.01.02.02 LastPowerOutputWatts = 0 Manufacturer = Delta MemberId = PSU0 Model = ECD16020137 Name = PSU0 Oem_PSU_Error = Presence detected| Power Supply AC Lost| AC Lost or out-of-range PowerSupplyType = AC SerialNumber = DTHTCT2233078 Status_Health = Critical Status_State = Present Targets: Verbs: cd show
Obtain the replacement PSU (of the same manufacturer) from NVIDIA Enterprise Support.
Preparing the Power Supply for Replacement
If the system is on, make sure at least 4 other power supplies are working by confirming the IN and OUT LEDs are lit green:
Note
If insufficient PSUs are present and working, power off the system.
Unplug the power cord from the failed power supply. Refer to Locking Power Cords for more information.
After the new power supply arrives, look at the system and identify which one needs to be replaced. The system is capable of operating at full capacity with four fully working power supplies. If the system is on, make sure that at least four power supplies are fully functional.
Replacing the Power Supply
Remove the power supply by pressing the green tab to unlock the unit. Then pull on the black handle.
Caution
Once the power supply is out of the chassis, replace it with the new power supply in less than 30 seconds to avoid airflow disruptions in the system - especially if it is up and running.
Replace the power supply with the new unit making sure the green tab locks into place.
After inserting the new power supply, plug in and lock the power cord and confirm that both the IN and OUT LEDs light up green on the new power supply.
From the BMC web user interface, confirm the power supply sensors are OK.
Run the
nvsm show health
command and confirm the output does not report any errors.After the replacement is complete, return the broken power supply to NVIDIA Enterprise Support.
Locking Power Cords
How to use the twisting locking power cords that ship with the system.
To insert the PDU side of the power cord, insert the cable into the plug. To remove it, press the clips on both sides at the same time to unlock the power cord and pull it out of the plug.
On the power supply side, insert the cable by first making sure the cable’s gray band or locking ring is set to unlock. Then, insert the cable to the power supply plug and then twist the locking ring to the locked position.
To remove the cable from the power supply, twist the locking ring to the unlocked position and pull the cable out of the plug.