Introduction

This document contains instructions for replacing NVIDIA DGX™ A100 system components. Be sure to familiarize yourself with the NVIDIA Terms & Conditions documents before attempting to perform any modification or repair to the DGX A100 system. These Terms & Conditions for the DGX A100 system can be found through the NVIDIA DGX Systems Support page.

Contact NVIDIA Enterprise Support to obtain an RMA number for any system or component that needs to be returned for repair or replacement. When replacing a component, use  only  the replacement supplied to you by NVIDIA.

Customer-replaceable Components

List of customer-replaceable components in the NVIDIA DGX A100.
You can obtain the following components for replacement in your data center.
  • Fan Modules

  • Power Supplies

  • Cache (U.2) NVMe Drives

  • Boot (M.2) NVMe Drive

  • Boot (M.2) Riser Assembly

  • DIMMs

  • Network Card (vertical, single or dual port)

  • Network Card (horizontal, dual port)

  • Front Console Board

  • Battery
  • Trusted Platform Module (TPM)
  • Bezel
  • Rack Mount Kit

Contact NVIDIA Enterprise Support for replacement instructions and guidance for specific components if those instructions are not included in this document.

Customer Support

Contact NVIDIA Enterprise Support for assistance in reporting, troubleshooting, or diagnosing problems with your DGX A100 system.  Also contact NVIDIA Enterprise Support for assistance in installing or moving the DGX A100 system.

For details on how to obtain support, visit the NVIDIA Enterprise Support web site (https://www.nvidia.com/en-us/support/enterprise/ ).

Running the Pre-flight Test

Instructions for running the DGX stress test.

NVIDIA recommends running the pre-flight stress test before putting a system into a production environment or after servicing. You can specify running the test on the GPUs, CPU, memory, and storage, and also specify the duration of the tests.

To run the tests, use NVSM.

Syntax:

$ sudo nvsm stress-test [--usage] [--force] [--no-prompt] [<test>...] [DURATION] 

For help on running the test, issue the following.

~$ sudo nvsm stress-test --usage 

Recommended Command

The following command runs the test on all supported components (GPU, CPU, memory, and storage), and takes approximately 20 minutes.

~$ sudo nvsm stress-test --force