Introduction

This document contains instructions for replacing NVIDIA DGX H100 system components. Be sure to familiarize yourself with the NVIDIA Terms and Conditions documents before attempting to perform any modification or repair to the DGX H100 system. These Terms and Conditions for the DGX H100 system can be found through the NVIDIA DGX Systems Support page.

Contact NVIDIA Enterprise Support to obtain an RMA number for any system or component that needs to be returned for repair or replacement. When replacing a component, use  only  the replacement supplied to you by NVIDIA.

Customer-replaceable Components

List of customer-replaceable components in the NVIDIA DGX H100.

Be sure to familiarize yourself with the NVIDIA Terms & Conditions documents before attempting to perform any modification or repair to the DGX H100 system. These Terms & Conditions for the DGX H100 system can be found through the NVIDIA DGX Systems Support page.

Customer Replaceable Units

Contact NVIDIA Enterprise Support to obtain an RMA number for any system or component that needs to be returned for repair or replacement. When replacing a component, use only the replacement supplied to you by NVIDIA.

You can obtain the following components for replacement in your data center.

  • Bezel

  • Locking power cords

  • Power supply

  • Fan module

  • Front Console Board

  • U.2 data drive

  • M.2 boot (OS) storage drive

  • Riser assembly with 2 M.2 drives

  • ConnectX-7 PCI card (Storage Network)

  • 50 Gb Ethernet NIC replacement

  • DIMMs

  • Rackmount kit

  • Trusted Platform Module

  • Battery

Contact NVIDIA Enterprise Support for replacement instructions and guidance for specific components if those instructions are not included in this document.

Customer Support

Contact NVIDIA Enterprise Support for assistance in reporting, troubleshooting, or diagnosing problems with your DGX H100 system.  Also contact NVIDIA Enterprise Support for assistance in installing or moving the DGX H100 system.

For details on how to obtain support, visit the NVIDIA Enterprise Support web site (https://www.nvidia.com/en-us/support/enterprise/ ).

Running the Pre-flight Test

Instructions for running the DGX stress test.

NVIDIA recommends running the pre-flight stress test before putting a system into a production environment or after servicing. You can specify running the test on the GPUs, CPU, memory, and storage, and also specify the duration of the tests.

To run the tests, use NVSM.

Syntax:

sudo nvsm stress-test [--usage] [--force] [--no-prompt] [<test>...] [DURATION]

For help on running the test, issue the following.

sudo nvsm stress-test --usage

Recommended Command

The following command runs the test on all supported components (GPU, CPU, memory, and storage), and takes approximately 20 minutes.

sudo nvsm stress-test --force