DGX Station A100 Service Manual

Documentation for administrators of the NVIDIA® DGX Station™ A100 system that explains how to service the DGX Station A100 system, including how to replace select components.

1. Introduction

This document contains instructions for replacing NVIDIA® DGX Station™ A100 system components. Be sure to familiarize yourself with the NVIDIA Terms & Conditions documents before attempting to perform any modification or repair to the DGX Station A100 system. These Terms & Conditions for the DGX Station A100 system can be found through the NVIDIA DGX Systems Support page.

Contact NVIDIA Enterprise Support to obtain an RMA number for any system or component that needs to be returned for repair or replacement. When replacing a component, use  only  the replacement supplied to you by NVIDIA.

1.1. Components

Here is a list of the DGX Station A100 components that are described in this service manual.

1.2. Customer Support

Contact NVIDIA Enterprise Support for assistance in reporting, troubleshooting, or diagnosing problems with your DGX Station A100 system.  Also contact NVIDIA Enterprise Support for assistance in installing or moving the DGX Station A100 system.

For details on how to obtain support, visit the NVIDIA Enterprise Support web site (https://www.nvidia.com/en-us/support/enterprise/ ).

2. System Memory Replacement

This section provides information about how to replace the system memory (DIMM).

2.1. System Memory Replacement

This is a high-level overview of the process to replace the system memory (DIMM).

  1. Identify the failed DIMM.
  2. Contact NVIDIA Enterprise Support to obtain a replacement.
  3. Power off the system and turn off the power supply switch.
  4. Open the left cover (motherboard side).
  5. Remove the air baffle.
  6. Identify the failed DIMM on the motherboard.
  7. Replace the DIMM for new component.
  8. Install the air baffle.
  9. Install the system cover.
  10. Power on the system.
  11. Test the memory and overall system health.

2.2. Identify the Failed DIMM

Here are the steps to identify the failed DIMM.

  1. To identify the failed DIMM from the output, run the following command:
    $ sudo nvsm show health
  2. Contact NVIDIA Enterprise Support, provide the output requested and obtain a replacement DIMM.
  3. After the new DIMM arrives, power off the DGX Station A100 and switch off the power supply.
  4. Remove the left system cover by pressing on the button that releases it.





  5. Pull the cover off and set aside.





  6. Remove the air baffle.





2.3. Locate and Replace the Failed DIMM

Here are the steps to locate and remove the failed DIMM.

  1. Use this diagram, or the label located on the back of the system cover, to locate the DIMM that needs to be replaced.





  2. Use the ejector lever on the DIMM to push it out of its socket.

  3. Install the new DIMM and press down until the ejector lever returns to its locked position.

After you replace the failed DIMM, see Close the System and Check the Memory.

2.4. Close the System and Check the Memory

After replacing the failed DIMM, you need to close the DGX Station A100 and check the memory.

  1. Install the air baffle by inserting it in the holes on the right side and then allowing the magnets on the left side to secure it in place .
  2. Close the system cover by inserting the bottom of the cover to the chassis, and then rotating it until it locks.

  3. To confirm that DGX Station A100 is healthy and that the memory is working properly, power on the system, and run the following command:
    $ sudo nvsm show health

3. Display GPU Replacement

This section provides information about how to replace the Display GPU.

3.1. Display GPU Replacement

This is a high-level overview of the process to replace the display GPU.

  1. Contact NVIDIA Enterprise Support to obtain a replacement GPU.
  2. Power off the DGX Station A100 and turn off power supply switch.
  3. Open the left cover (motherboard side).
  4. Remove the air baffle.
  5. Use a Phillips #2 screwdriver to release the GPU from the PCIe slot.
  6. Replace the Display GPU.
  7. Use a Phillips #2 screwdriver to secure the GPU to the PCIe slot.
  8. Install the air baffle.
  9. Install the system cover.
  10. Power on the system.
  11. Test the display and overall system health.
After you review these steps, see Obtain a New Display GPU and Open the System.

3.2. Obtain a New Display GPU and Open the System

Here are the steps that describe how to obtain a new Display GPU and open the DGX Station A100.
  1. Contact NVIDIA Enterprise Support, obtain a replacement GPU.
  2. Once the new Display GPU arrives, power off the system and switch off the power supply.
  3. Unplug all monitor cables from the Display GPU.
  4. Remove the left system cover by pressing on the button that releases it.





  5. Pull the cover off and set aside.





  6. Remove the air baffle.





3.3. Remove the Display GPU

Here are the steps that describe how to remove the Display GPU.

  1. Using a Philips #2 screwdriver, carefully remove the screw that secures the Display GPU to the motherboard.
    Note: Be careful not to drop the screw as it comes off.
  2. Pull the Display GPU off the PCIe slot.

3.4. Install the New Display GPU

Here are the steps that describe how to install the new Display GPU.

  1. Install the Display GPU on PCIe slot 3.
  2. Install and tighten the screw that holds the card in place.

After you install the new Display GPU, see Close the System and Check the Display.

3.5. Close the System and Check the Display

Here are the steps to close the system.

  1. Install the air baffle by inserting it in the holes on the right side and allowing the magnets on the left side to secure it in place.
  2. Close the system cover by inserting the bottom of the cover to the chassis and rotating it until it locks.

  3. Plug monitor cables in.
  4. To confirm the system is healthy, power on the system and run the following command:
    $ sudo nvsm show health

4. U.2 Cache Drive Replacement

This section provides information about how to replace the U.2 cache drive.

4.1. U.2 Cache Drive Replacement

This is a high-level overview of the process to replace the system memory (DIMM).

  1. Contact NVIDIA Enterprise Support to obtain a replacement.
  2. Power off the system and turn off the power supply switch.
  3. Open the left cover (motherboard side).
  4. Replace U.2 cache drive.
  5. Install the system cover.
  6. Power on the system.
  7. Initialize the new cache drive and test the overall system health.
After you reveiew these steps, see Open the System.

4.2. Open the System

Here are the steps to open the system.

  1. Contact NVIDIA Enterprise Support and obtain a replacement NVMe.
  2. Once the new NVMe arrives, power off the system and switch off the power supply.
  3. Remove the left system cover by pressing on the button that releases it as shown in the diagram below





  4. Pull the cover off and set aside.





Here is an example of an opened system:





4.3. Replace the NVMe Drive

Here are the steps to replacing the NVMe drive.

  1. Press the button to release the lever.





  2. Pull the lever out to release the NVMe.





  3. Replace old NVMe with the new NVMe.





  4. Insert the drive with the lever open until it is completely in the slot and press the lever to lock it in place.





  5. Confirm the drive is flush with the drive cage.





After you replace the NVMe drive, see Close the System and Rebuild the Cache Drive.

4.4. Close the System and Rebuild the Cache Drive

Here are the steps to close the system and rebuild the cache drive.

  1. Close the system cover by inserting the bottom of the cover to the chassis and then rotating the cover until it locks.

  2. Power on the system and initialize the cache drive by running the following command:
    $ sudo configure_raid_array.py -c -f
  3. To confirm the system is healthy, run the following command:
    $ sudo nvsm show health

5. M.2 Boot Drive Replacement

This section provides information about how you can replace the M.2 Boot Drive.

5.1. M.2 Boot Drive Replacement

This is a high-level overview of the process to replace the M.2 Boot Drive.

  1. Contact NVIDIA Enterprise Support to obtain a replacement M.2 NVMe.
  2. Power off the system and turn off the power supply switch.
  3. Open the left cover (motherboard side).
  4. Remove the air baffle.
  5. Remove the cables from PCIe bus extender cards.
  6. Use a Phillips #1 screwdriver to release the M.2 drive from the slot.
  7. Replace the M.2 drive.
  8. Use a Phillips #1 screwdriver to secure the M.2 drive to the slot.
  9. Attach cables to PCIe bus extender cards.
  10. Install the air baffle.
  11. Install the system cover.
  12. Power on the system.
  13. Reinstall the DGX operating system. See Installing DGX OS in the NVIDIA DGX OS 5 User Guide for more information.
  14. Test the boot drive and overall system health.
After you review these steps, see Open the System.

5.2. Open the System

Here are the steps to open the system.

  1. Contact NVIDIA Enterprise Support and obtain a replacement NVMe.
  2. Once the new NVMe arrives, power off the system and switch off the power supply.
  3. Remove the left system cover by pressing on the button that releases it as shown in the diagram below





  4. Pull the cover off and set aside.





  5. Remove the air baffle.
Here is an example of an opened system:





5.3. Move Cables Out of the Way

Here are the steps to move the cables out of the way.

  1. Release the cables from PCIe bus extender cards by pressing on the release mechanism and pulling the cable out of the socket.
  2. To provide access to the M.2 drive, complete the folloiwng steps to move the cables out of the way:
    1. Release the cable by pressing the connector release mechanism.
    2. Pull the cable away from the connector.

5.4. Replace the M.2 Boot Drive

Here are the steps to replace the M.2 Boot Drive.

  1. Use a Phillips #1 screwdriver to release the M.2 drive from the slot.
    Note: Be careful not to drop the screw as it comes off.
  2. Replace M.2 drive.
  3. Use a Phillips #1 screwdriver to secure the M.2 drive to the slot.

5.5. Close the System and Reinstall the DGX OS

Here are the steps to close the system and reinstall the DGX OS.

  1. Install air baffle.
  2. Install system cover.

  3. Power on the system.
  4. To reinstall the DGX OS, refer to Install DGX OS in the NVIDIA DGX OS 5 User Guide.
  5. To test boot drive and overall system health, run the following command:
    $ sudo nvsm show health

6. TPM Replacement

This section provides information about how to replace the TPM.

6.1. TPM Replacement

This is a high-level overview of the process to replace the TPM.

  1. Contact NVIDIA Enterprise Support to obtain a replacement.
  2. Power off the system and turn off the power supply switch.
  3. Open the left cover (motherboard side).
  4. Remove the air baffle.
  5. Replace the TPM.
  6. Install the air baffle.
  7. Install the system cover.
  8. Power on the system.
  9. Test the TPM and the overall system health.
After you review these steps, see index.html#open-the-system.

6.2. Open the System

Here are the steps to open the system.

  1. Contact NVIDIA Enterprise Support and obtain a replacement NVMe.
  2. Power off the system and turn off the power supply switch.
  3. Remove the left system cover by pressing on the button that releases it as shown in the diagram below.





  4. Pull the cover off and set aside.





  5. Remove the air baffle.





6.3. Replace the TPM

Here is the process to replace the TPM.

Pull the TPM straight out and install the new one.
Note: The connector is keyed, so the TPM only goes in one way.

After you replace the system, see Close the System and Test the Replacement.

6.4. Close the System and Test the Replacement

Here are the steps to close the system and test the replacement.

  1. Install air baffle.
  2. Install system cover.

  3. Power on system.
  4. Follow instructions in the User Guide to update the necessary information on the new TPM.
  5. To test overall system health, run the following command:
    $ sudo nvsm show health

7. Battery Replacement

This section provides information about replacing the battery.

7.1. Battery Replacement

This is a high-level overview of the process to replace the TPM.

  1. Identify failed battery.
  2. Obtain CR2032 battery.
  3. Power off system and turn off power supply switch.
  4. Open left cover (motherboard side).
  5. Remove air baffle.
  6. Use a thin tool to aid in the removal of the old battery.
  7. Replace battery.
  8. Install air baffle.
  9. Install system cover.
  10. Power on the system.
  11. Configure the clock and synchronize the BMC clock.
  12. Test the overall system health.

7.2. Identify a Failed Battery

Here is some information about how you can identify a failed battery.

When the battery fails, some of these symptoms may occur:
  • Invalid configuration will appear on your screen.
  • Setup appears on your screen before booting.
  • Press F1 to continue appears on the console.
  • A clock error or clock message appears on your screen.
  • The system clock loses time and date.

Call NVIDIA Enterprise Support to confirm that the battery is the correct component to replace. The CR2032 battery is not provided by NVIDIA, but it is easy to find at a convenience store. After you purchase the battery, to replace the failed battery, see Replace the Battery.

After you identify the failed battery, see Open the System.

7.3. Open the System

Here are the steps to open the system.

  1. Obtain a replacement CR2032 battery.
  2. Power off system and turn off power supply switch.
  3. Remove the left system cover by pressing on the button that releases it as shown in the diagram below.





  4. Pull the cover off and set aside.





  5. Remove the air baffle.





7.4. Replace the Battery

Here are the steps to replace the battery.

  1. Pull battery out of the socket by rotating it.
  2. Replace battery with new CR2032 unit.
After you replace the battery, see Close the System.

7.5. Close the System

Here are the steps to close the system.

Context for the current task.

  1. Install air baffle.
  2. Install system cover.

  3. Power on the system.

7.6. Reset the System Clock

Here are the steps to reset the system clock.

Context for the current task.

  1. Configure the clock and synchronize the BMC clock.
    1. To set the time and date on the system, use NTP or set the date manually on the system, run the following command:
      $ sudo date [MMDDhhmm[[CC]YY][.ss]]
    2. Sync the date and time to the hardware real time clock.
      $ sudo hwclock -w
    3. Reset the BMC.
      $ sudo ipmitool mc reset cold
    4. Confirm the time and date on the system are updated.
  2. Reprogram any other BIOS settings that might have been lost.
  3. To test overall system health, run the following command.
    $ sudo nvsm show health

Notices

Notice

This document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality, condition, or quality of a product. NVIDIA Corporation (“NVIDIA”) makes no representations or warranties, expressed or implied, as to the accuracy or completeness of the information contained in this document and assumes no responsibility for any errors contained herein. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents or other rights of third parties that may result from its use. This document is not a commitment to develop, release, or deliver any Material (defined below), code, or functionality.

NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this document, at any time without notice.

Customer should obtain the latest relevant information before placing orders and should verify that such information is current and complete.

NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgement, unless otherwise agreed in an individual sales agreement signed by authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations are formed either directly or indirectly by this document.

NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military, aircraft, space, or life support equipment, nor in applications where failure or malfunction of the NVIDIA product can reasonably be expected to result in personal injury, death, or property or environmental damage. NVIDIA accepts no liability for inclusion and/or use of NVIDIA products in such equipment or applications and therefore such inclusion and/or use is at customer’s own risk.

NVIDIA makes no representation or warranty that products based on this document will be suitable for any specified use. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to evaluate and determine the applicability of any information contained in this document, ensure the product is suitable and fit for the application planned by customer, and perform the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this document. NVIDIA accepts no liability related to any default, damage, costs, or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii) customer product designs.

No license, either expressed or implied, is granted under any NVIDIA patent right, copyright, or other NVIDIA intellectual property right under this document. Information published by NVIDIA regarding third-party products or services does not constitute a license from NVIDIA to use such products or services or a warranty or endorsement thereof. Use of such information may require a license from a third party under the patents or other intellectual property rights of the third party, or a license from NVIDIA under the patents or other intellectual property rights of NVIDIA.

Reproduction of information in this document is permissible only if approved in advance by NVIDIA in writing, reproduced without alteration and in full compliance with all applicable export laws and regulations, and accompanied by all associated conditions, limitations, and notices.

THIS DOCUMENT AND ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF NVIDIA HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the products described herein shall be limited in accordance with the Terms of Sale for the product.

Trademarks

NVIDIA, the NVIDIA logo, DGX, DGX-1, DGX-2, DGX A100, DGX Station, and DGX Station A100 are trademarks and/or registered trademarks of NVIDIA Corporation in the Unites States and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.