Power Supply Replacement

This chapter describes how to replace one of the DGX A100 system power supplies (PSUs).

Power Supply Replacement Overview

This is a high-level overview of the steps needed to replace a power supply.
  1. Identify failed power supply through the BMC and submit a service ticket.
  2. Get replacement power supply from NVIDIA Enterprise Support.
  3. Identify the power supply using the diagram as a reference and the indicator LEDs.
  4. Remove the power cord from the power supply that will be replaced.
  5. Replace the failed power supply with the new power supply.
  6. Insert the power cord and make sure both LEDs light up green (IN/OUT).
  7. Use the BMC to confirm that the power supply is working correctly.
  8. Ship back the failed unit to NVIDIA Enterprise Support using the packaging provided.

Identifying the Failed Power Supply

Identifying the Failed Power Supply from the Back

If physical access to the system is available, you can identify a failed PSU by the inspecting the LEDs on the power supply when the system is powered on.

Both LEDs should be solid green. If either of the LEDs are not green or if they are blinking, contact NVIDIA Enterprise Support to troubleshoot the issue.

Identifying the Failed Power Supply from the Console

There are several ways to identify the failed PSU from the DGX A100 console.
  • Use the NVSM CLI as follows.
    $ sudo nvsm show psus

    The output shows information for each PSU. Look for any that do not report Status_Health=OK.

  • View the PSU status from the BMC.

    Click Sensor from the left side menu and inspect the PSU information from the Normal Sensors section.

  • Use ipmitool.
    $ sudo ipmitool sdr |grep -i psu

    Look for power supplies with no temperature reading or an output reading close to or equal to zero.

Both NVSM and the BMC identify each power supply as PSUx, where x is from 0 to 5. The following diagram shows the physical location of each PSU.

Determining the Manufacturer

Important: All PSUs in the system must be from the same manufacturer.

Issue the following to determine the PSU manufacturer.

$ sudo nvsm show /chassis/localhost/power/PSUX
Where X corresponds to the PSU identifier. The following examples uses PSU0, and shows that the manufacturer is "Delta".
$ sudo nvsm show /chassis/localhost/power/PSU0

FirmwareVersion =
LastPowerOutputWatts = 312
Manufacturer = Delta
MemberId = PSU0
Model = ECD16010092
Name = PSU0
Oem_PSU_Error = <NOT_SET>
PowerSupplyType = AC
SerialNumber = DTHTCP200807M
Status_Health = OK
Status_State = Present

Obtain the replacement PSU (of the same manufacturer) from NVIDIA Enterprise Support.

Replacing the Power Supply

  1. Be sure you have obtained the replacement PSU and that you have saved the packaging to use when sending back the failed PSU.
  2. Determine whether you need to shut down the system.
    • If the three remaining PSUs are working and energized, then you do not need to shut down power to the DGX A100 system.
    • If fewer than three PSUs are working and energized, then shut down power to the DGX A100 system.
  3. Unlock the power cord and then unplug it from the PSU to be replaced. You may need to dislodge the power cord from the retaining clip.
  4. Remove the PSU.
    1. Push on the green tab to release the lock.

    2. Pull on the black handle to remove the PSU from the chassis.

  5. Install the new power supply.
    1. Insert the new power supply into the chassis and push it all the way in, making sure that the green locking mechanism engages.
    2. Plug in the power cord and lock it in place.
    3. If needed, power on the system.
  6. Confirm the installation by
    • Viewing the PSU status from the BMC dashboard->Sensors page.
    • Running nvsm show health to confirm all power supplies are healthy.
Pack the old power supply and ship it back to NVIDIA Enterprise Support.