DGX Station A100 Service Manual

Documentation for administrators of the NVIDIA® DGX Station™ A100 system that explains how to service the DGX Station A100 system, including how to replace select components.

1. Introduction

This document contains instructions for replacing NVIDIA® DGX Station™ A100 system components. Be sure to familiarize yourself with the NVIDIA Terms & Conditions documents before attempting to perform any modification or repair to the DGX Station A100 system. These Terms & Conditions for the DGX Station A100 system can be found through the NVIDIA DGX Systems Support page.

Contact NVIDIA Enterprise Support to obtain an RMA number for any system or component that needs to be returned for repair or replacement. When replacing a component, use  only  the replacement supplied to you by NVIDIA.

1.1. Components

Here is a list of the DGX Station A100 components that are described in this service manual.

1.2. Customer Support

Contact NVIDIA Enterprise Support for assistance in reporting, troubleshooting, or diagnosing problems with your DGX Station A100 system.  Also contact NVIDIA Enterprise Support for assistance in installing or moving the DGX Station A100 system.

For details on how to obtain support, visit the NVIDIA Enterprise Support web site (https://www.nvidia.com/en-us/support/enterprise/ ).

2. System Memory Replacement

This section provides information about how to replace the system memory (DIMM).

2.1. System Memory Replacement

This is a high-level overview of the process to replace the system memory (DIMM).

  1. Identify the failed DIMM.
  2. Contact NVIDIA Enterprise Support to obtain a replacement.
  3. Power off the system and turn off the power supply switch.
  4. Open the left cover (motherboard side).
  5. Remove the air baffle.
  6. Identify the failed DIMM on the motherboard.
  7. Replace the DIMM for new component.
  8. Install the air baffle.
  9. Install the system cover.
  10. Power on the system.
  11. Test the memory and overall system health.

2.2. Identify the Failed DIMM

  1. To identify the failed DIMM from the output, run the following command:
    $ sudo nvsm show health
  2. Contact NVIDIA Enterprise Support, provide the output requested and obtain a replacement DIMM.
  3. After the new DIMM arrives, power off the DGX Station A100 and switch off the power supply.
  4. Remove the left system cover by pressing on the button that releases it.





  5. Pull the cover off and set aside.





  6. Remove the air baffle.





2.3. Locate and Replace the Failed DIMM

  1. Use this diagram, or the label located on the back of the system cover, to locate the DIMM that needs to be replaced.





  2. Use the ejector lever on the DIMM to push it out of its socket.

  3. Install the new DIMM and press down until the ejector lever returns to its locked position.

After you replace the failed DIMM, see Close the System and Check the Memory.

2.4. Close the System and Check the Memory

After replacing the failed DIMM, you need to close the DGX Station A100 and check the memory.

  1. Install the air baffle by inserting it in the holes on the right side and then allowing the magnets on the left side to secure it in place .
  2. Close the system cover by inserting the bottom of the cover to the chassis, and then rotating it until it locks.

  3. To confirm that DGX Station A100 is healthy and that the memory is working properly, power on the system, and run the following command:
    $ sudo nvsm show health

3. Display GPU Replacement

This section provides information about how to replace the Display GPU.

3.1. Display GPU Replacement

This is a high-level overview of the process to replace the display GPU.

  1. Contact NVIDIA Enterprise Support to obtain a replacement GPU.
  2. Power off the DGX Station A100 and turn off power supply switch.
  3. Open the left cover (motherboard side).
  4. Remove the air baffle.
  5. Use a Phillips #2 screwdriver to release the GPU from the PCIe slot.
  6. Replace the Display GPU.
  7. Use a Phillips #2 screwdriver to secure the GPU to the PCIe slot.
  8. Install the air baffle.
  9. Install the system cover.
  10. Power on the system.
  11. Test the display and overall system health.
After you review these steps, see Obtain a New Display GPU and Open the System.

3.2. Obtain a New Display GPU and Open the System

  1. Contact NVIDIA Enterprise Support, obtain a replacement GPU.
  2. Once the new Display GPU arrives, power off the system and switch off the power supply.
  3. Unplug all monitor cables from the Display GPU.
  4. Remove the left system cover by pressing on the button that releases it.





  5. Pull the cover off and set aside.





  6. Remove the air baffle.





3.3. Remove the Display GPU

  1. Using a Philips #2 screwdriver, carefully remove the screw that secures the Display GPU to the motherboard.
    Note: Be careful not to drop the screw as it comes off.
  2. Pull the Display GPU off the PCIe slot.

3.4. Install the New Display GPU

  1. Install the Display GPU on PCIe slot 3.
  2. Install and tighten the screw that holds the card in place.

After you install the new Display GPU, see Close the System and Check the Display.

3.5. Close the System and Check the Display

  1. Install the air baffle by inserting it in the holes on the right side and allowing the magnets on the left side to secure it in place.
  2. Close the system cover by inserting the bottom of the cover to the chassis and rotating it until it locks.

  3. Plug monitor cables in.
  4. To confirm the system is healthy, power on the system and run the following command:
    $ sudo nvsm show health

4. U.2 Cache Drive Replacement

This section provides information about how to replace the U.2 cache drive.

4.1. U.2 Cache Drive Replacement

This is a high-level overview of the process to replace the system memory (DIMM).

  1. Contact NVIDIA Enterprise Support to obtain a replacement.
  2. Power off the system and turn off the power supply switch.
  3. Open the left cover (motherboard side).
  4. Replace U.2 cache drive.
  5. Install the system cover.
  6. Power on the system.
  7. Initialize the new cache drive and test the overall system health.
After you reveiew these steps, see Open the System.

4.2. Open the System

  1. Contact NVIDIA Enterprise Support and obtain a replacement NVMe.
  2. Once the new NVMe arrives, power off the system and switch off the power supply.
  3. Remove the left system cover by pressing on the button that releases it as shown in the diagram below





  4. Pull the cover off and set aside.





Here is an example of an opened system:





4.3. Replace the NVMe Drive

  1. Press the button to release the lever.





  2. Pull the lever out to release the NVMe.





  3. Replace old NVMe with the new NVMe.





  4. Insert the drive with the lever open until it is completely in the slot and press the lever to lock it in place.





  5. Confirm the drive is flush with the drive cage.





After you replace the NVMe drive, see Close the System and Rebuild the Cache Drive.

4.4. Close the System and Rebuild the Cache Drive

  1. Close the system cover by inserting the bottom of the cover to the chassis and then rotating the cover until it locks.

  2. Power on the system and initialize the cache drive by running the following command:
    $ sudo configure_raid_array.py -c -f
  3. To confirm the system is healthy, run the following command:
    $ sudo nvsm show health

5. M.2 Boot Drive Replacement

This section provides information about how you can replace the M.2 Boot Drive.

5.1. M.2 Boot Drive Replacement

This is a high-level overview of the process to replace the M.2 Boot Drive.

  1. Contact NVIDIA Enterprise Support to obtain a replacement M.2 NVMe.
  2. Power off the system and turn off the power supply switch.
  3. Open the left cover (motherboard side).
  4. Remove the air baffle.
  5. Use a Phillips #1 screwdriver to release the M.2 drive from the slot.
  6. Replace the M.2 drive.
  7. Use a Phillips #1 screwdriver to secure the M.2 drive to the slot.
  8. Install the air baffle.
  9. Install the system cover.
  10. Power on the system.
  11. Reinstall the DGX operating system. See Installing DGX OS in the NVIDIA DGX OS 5 User Guide for more information.
  12. Test the boot drive and overall system health.

5.2. Open the System

  1. Contact NVIDIA Enterprise Support and obtain a replacement M.2 NVMe.
  2. Power off the system and turn off the power supply switch.
  3. Remove the left system cover by pressing on the button that releases it.





  4. Pull the cover off and set aside.









  5. Remove the air baffle.





5.4. Replace the M.2 Boot Drive

  1. Carefully lift the M.2 drive at a slight angle, enough to be able to pull it out of the slot, which is about 10˚.
  2. Remove the M.2 drive.
  3. To insert the new drive, make sure the M.2 connector engages with the left M.2 slot on the motherboard. As you can see in the graphic, this is more visible from above the PCI cables.
  4. Rest the M.2 drive on the motherboard and align the heatsink with the screw holes

5.6. Close the System and Reinstall the DGX OS

  1. Install the air baffle.
  2. Install the system cover.

  3. Power on the system.
  4. To reinstall the DGX OS, refer to Install DGX OS in the NVIDIA DGX OS 5 User Guide.
  5. To test boot drive and overall system health, run the following command:
    $ sudo nvsm show health

6. TPM Replacement

This section provides information about how to replace the TPM.

6.1. TPM Replacement

This is a high-level overview of the process to replace the TPM.

  1. Contact NVIDIA Enterprise Support to obtain a replacement TPM.
  2. Power off the system and turn off the power supply switch.
  3. Open the left cover (motherboard side).
  4. Remove the air baffle.
  5. Replace the TPM.
  6. Install the air baffle.
  7. Install the system cover.
  8. Power on the system.
  9. Test the TPM and the overall system health.

6.2. Open the System

  1. Contact NVIDIA Enterprise Support and obtain a replacement TPM.
  2. Power off the system and turn off the power supply switch.
  3. Remove the left system cover by pressing on the button that releases it.





  4. Pull the cover off and set aside.









  5. Remove the air baffle.





6.3. Replace the TPM

Pull the TPM straight out and install the new one.
Note: The connector is keyed, so the TPM only goes in one way.

6.4. Close the System and Test the Replacement

Here are the steps to close the system and test the replacement.

  1. Install the air baffle.
  2. Install the system cover.

  3. Power on the system.
  4. Follow instructions in the DGX Station A100 User Guide to update the necessary information about the new TPM.
  5. To test overall system health, run the following command:
    $ sudo nvsm show health

7. Battery Replacement

This section provides information about replacing the battery.

7.1. Battery Replacement

This is a high-level overview of the process to replace the TPM.

  1. Identify the failed battery.
  2. Obtain the CR2032 battery.
  3. Power off the system and turn off the power supply switch.
  4. Open the left cover (motherboard side).
  5. Remove the air baffle.
  6. Use a thin tool to remove the old battery.
  7. Replace the battery.
  8. Install the air baffle.
  9. Install the system cover.
  10. Power on the system.
  11. Configure the clock and synchronize the BMC clock.
  12. Test the overall system health.

7.2. Identify a Failed Battery

When the battery fails, some of the following symptoms might occur:

  • Invalid configuration will appear on your screen.
  • Setup appears on your screen before booting.
  • Press F1 to continue appears on the console.
  • A clock error or clock message appears on your screen.
  • The system clock loses the time and the date.

Contact NVIDIA Enterprise Support and confirm whether the battery is the correct component to replace. The CR2032 battery is not provided by NVIDIA, but it is easy to find at a convenience store. After you purchase the battery, see Replace the Battery.

7.3. Open the System

  1. Obtain a replacement CR2032 battery.
  2. Power off the system and turn off the power supply switch.
  3. Remove the left system cover by pressing on the button that releases it.





  4. Pull the cover off and set aside.





  5. Remove the air baffle.





7.4. Replace the Battery

  1. Pull battery out of the socket by rotating it.





  2. Replace the battery with new CR2032 unit.





7.5. Close the System

  1. Install the air baffle.
  2. Install the system cover.

  3. Power on the system.

7.6. Reset the System Clock

Here are the steps to reset the system clock.

  1. Configure the clock and synchronize the BMC clock.
    1. To set the time and date on the system, complete one of the following tasks:
      • Use NTP
      • To set the date manually on the system, run the following command:
        $ sudo date [MMDDhhmm[[CC]YY][.ss]]
    2. Sync the date and time to the hardware real time clock.
      $ sudo hwclock -w
    3. Reset the BMC.
      $ sudo ipmitool mc reset cold
    4. Confirm the time and date on the system are updated.
  2. Reprogram any other BIOS settings that might have been lost.
  3. To test overall system health, run the following command.
    $ sudo nvsm show health

Notices