Power Supply Replacement#

Power Supply Replacement Overview#

This is a high-level overview of the procedure to replace a power supply on the NVIDIA DGX™ B300 system.

  1. Identify the broken power supply by the amber color LED or the power supply number.

  2. Request a replacement from NVIDIA Enterprise Support.

  3. Remove the locking power cord from the power supply.

  4. Replace the power supply.

  5. Install the locking power cord.

  6. Confirm that both LEDs light up green on the power supply.

  7. Ensure the BMC reports no power supply failures.

  8. If requested, send the failed unit to NVIDIA Enterprise Support using the packaging provided.

Identifying the Failed Power Supply#

You can identify a failed power supply using any of the following methods:

  • Run the sudo nvsm show health command to identify the failed power supply.

  • Access the BMC Web User Interface and select Sensors from the left-side navigation bar.

  • From the console, run the ipmitool sdr | grep -i psu command.

    Note which power supply has no temperature reading or an irregular output reading (close to or equal to zero).

Contact NVIDIA Enterprise Support to request a replacement. The team might ask for this or similar information to confirm that the power supply needs to be replaced.

The nvsm command output and the BMC web user interface identify each power supply as PSUx, where x is 1 to 12. The following diagram shows the physical location of each PSU.

_images/dgx-b300-psu-id.png

Viewing the Power Supply LEDs

  • Access the rear of the system and view the status LEDs while the system is powered on.

    _images/dgx-b300-psu.png

    LED Color

    Description

    Status

    Green

    AC input OK

    Blinking 4Hz

    DC output OK

    Solid

    No AC input

    Off

    Amber

    Bootloading

    Blinking 4Hz

    Fault

    Solid

    Healthy

    Off

Running the show psus Command

  • Run the following command to display information about the PSUs:

    sudo nvsm show psus
    

    The output shows information for each PSU. Look for any that does not report Status_Health=OK.

Viewing PSUs from the BMC Web User Interface

  1. Access the BMC web user interface and select Sensors from the left-side navigation bar.

    • Confirm PSU presence:

      _images/b200-bmc-health-1.png
    • Confirm power output:

      _images/b200-bmc-health-2.png
    • Confirm the PSU temperature readings:

      _images/b200-bmc-health-3.png
  2. Run the ipmitool command to view information about the PSUs:

    sudo ipmitool sdr | grep -i psu
    

    Look for power supplies with no temperature or output reading close to or equal to zero.

Determining the Manufacturer

Important

All PSUs in the system must be from the same manufacturer.

  • Run the following nvsm command to determine the PSU manufacturer:

    sudo nvsm show /chassis/localhost/power/PSUx
    

    Replace x in the preceding command with the PSU identifier.

    Example output:

    The following output is for PSU0 and shows that the manufacturer is Delta.

    /chassis/localhost/power/PSU0
    Properties:
        FirmwareVersion = 02.02.02.01.02.02
        LastPowerOutputWatts = 0
        Manufacturer = Delta
        MemberId = PSU0
        Model = ECD16020137
        Name = PSU0
        Oem_PSU_Error = Presence detected| Power Supply AC Lost| AC Lost or out-of-range
        PowerSupplyType = AC
        SerialNumber = DTHTCT2233078
        Status_Health = Critical
        Status_State = Present
    Targets:
    Verbs:
        cd
        show
    

Obtain the replacement PSU (of the same manufacturer) from NVIDIA Enterprise Support.

Preparing the Power Supply for Replacement#

  1. After the new power supply arrives, look at the system and identify which power supply needs to be replaced.

  2. If the system is on, ensure at least six power supplies are working by confirming the green LEDs are lit solid.

    The system can operate at full capacity with six fully functional power supplies.

    Note

    If insufficient power supplies are present and working, power off the system.

  3. Unplug the power cord from the failed power supply, following the instructions described in Locking Power Cords.

    Before replacing the power supply, remove the locking power cable.

Replacing the Power Supply#

  1. Remove the power supply by pressing the tab to unlock the unit, and then pull on the black handle.

    Caution

    Once the power supply is out of the chassis, replace it with the new power supply in less than 30 seconds to avoid airflow disruptions in the system - especially if it is up and running.

    _images/dgx-b300-chassis-closed.png
  2. Replace the power supply with the new unit making sure the tab locks into place.

    _images/dgx-b300-power-supply.png
  3. Confirm that the green LED lights up green on the new power supply.

  4. Ensure that the BMC Web UI reports no power supply failures.

  5. Run the sudo nvsm show health command and confirm that all power supplies are healthy.

  6. After the replacement is complete, return the failed power supply to NVIDIA Enterprise Support using the packaging provided.

Locking Power Cords#

To use the twisting locking power cords that ship with the system:

  • On the power distribution unit (PDU) side

    1. To insert, push the cable into the PDU socket.

    2. To remove, press the clips on both sides simultaneously to pull the cord out of the socket.

  • On the power supply side

    1. Ensure the cable is unlocked:

      • To insert, push the cable into the socket.

      • To remove, pull the cable out of the socket.

    2. To unlock the power cord, twist the gray locking ring to the unlocked position.

      The indicator will show an unlocked padlock.

    3. To lock the power cord, twist the gray locking ring to the locked position.

      The indicator will show a closed padlock.

    _images/locking-pwr-cord.png