Quickstart and Basic Operation
This topic provides basic requirements and instructions for using the NVIDIA DGX™ H100/H200 Systems, including how to perform a preliminary health check and how to prepare for running containers. Refer to the DGX documentation for additional product documentation.
Installation and Configuration
Before you install DGX H100/H200, ensure you have given all relevant site information to your Installation Partner.
Important
Your DGX H100/H200 system must be installed by NVIDIA partner network personnel or NVIDIA field service engineers. If not performed accordingly, your hardware warranty will be voided.
Registration
To obtain support for your DGX H100/H200, follow the instructions for registration in the Entitlement Certification email that was sent as part of the purchase.
Registration allows you to access the NVIDIA Enterprise Support Portal, obtain technical support, get software updates, and set up an NGC for DGX systems account. If you did not receive the information, open a case with the NVIDIA Enterprise Support Team at https://www.nvidia.com/en-us/support/enterprise/.
To obtain support for your DGX H100/H200 system, follow the instructions for registration in the Entitlement Certification email that was sent as part of the purchase.
Registration allows you to access the NVIDIA Enterprise Support Portal, obtain technical support, get software updates, and set up an NGC for DGX systems account. If you did not receive the information, open a case with the NVIDIA Enterprise Support Team at https://www.nvidia.com/en-us/support/enterprise/.
Refer to Customer Support for contact information.
Obtaining an NGC Account
NVIDIA NGC provides access to GPU-optimized software for deep learning, machine learning, and high-performance computing (HPC). An NGC account grants you access to these tools and gives you the ability to set up a private registry to manage your customized software.
If you are the organization administrator for your DGX system purchase, work with NVIDIA Enterprise Support to set up an NGC enterprise account. Refer to the NGC Private Registry User Guide for more information about getting an NGC enterprise account.
Turning DGX H100/H200 On and Off
DGX H100/H200 is a complex system, integrating a large number of cutting-edge components with specific startup and shutdown sequences. Observe the following startup and shutdown instructions.
Startup Considerations
To keep your DGX H100/H200 running smoothly, allow up to a minute of idle time after reaching the login prompt. This ensures that all components can complete their initialization.
Shutdown Considerations
When shutting down DGX H100/H200, always initiate the shutdown from the operating system, momentary press of the power button, or by using Graceful Shutdown from the BMC, and wait until the system enters a powered-off state before performing any maintenance.
Warning
Risk of Danger - Removing power cables or using Power Distribution Units (PDUs) to shut off the system while the Operating System is running may cause damage to sensitive components in the DGX H100/H200 server.
Verifying Functionality - Quick Health Check
NVIDIA provides customers a diagnostics and management tool called NVIDIA System Management, or NVSM.
The nvsm
command can be used to determine the system’s health, identify component issues and alerts,
or run a stress test to make sure all components are in working order while under load. The use of
Docker is key to getting the most performance out of the system since NVIDIA has optimized containers
for all the major frameworks and workloads used on DGX systems.
The following are the steps for performing a health check on the DGX H100/H200 System and verifying the Docker and NVIDIA driver installation.
Establish an SSH connection to the DGX H100/H200 System.
Run a basic system check.
sudo nvsm show health
Verify that the output summary shows that all checks are Healthy and that the overall system status is Healthy.
Verify that Docker is installed by viewing the installed Docker version.
sudo docker --version
On success, the command returns the version as
Docker version xx.yy.zz
, where the actual version may differ depending on the specific release of the DGX OS Server software.Verify connection to the NVIDIA repository and that the NVIDIA Driver is installed.
sudo docker run --gpus all --rm nvcr.io/nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi
The preceding command pulls the
nvidia/cuda
container image layer by layer, then runs thenvidia-smi
command.When complete, the output shows the NVIDIA Driver version and a description of each installed GPU.
For more information, refer to Containers For Deep Learning Frameworks User Guide.
Running the Pre-flight Test
Instructions for running the DGX stress test.
NVIDIA recommends running the pre-flight stress test before putting a system into a production environment or after servicing. You can specify running the test on the GPUs, CPU, memory, and storage, and also specify the duration of the tests.
To run the tests, use NVSM.
Syntax
sudo nvsm stress-test [--usage] [--force] [--no-prompt] [<test>...] [DURATION]
For help on running the test, issue the following.
sudo nvsm stress-test --usage
Recommended Command
The following command runs the test on all supported components (GPU, CPU, memory, and storage), and takes approximately 20 minutes.
sudo nvsm stress-test --force
Running NGC Containers with GPU Support
To obtain the best performance when running NGC containers on DGX H100/H200 systems, the following methods of providing GPU support for Docker containers are available:
Native GPU support (included in Docker 20.10.18 and later)
The method implemented in your system depends on the DGX OS version installed.
DGX OS Releases |
Method Included |
---|---|
6.0 |
|
Each method is invoked by using specific Docker commands, described as follows.
Using Native GPU Support
Use docker run --gpus
to run GPU-enabled containers.
Example using all GPUs
sudo docker run --gpus all ...
Example using two GPUs
sudo docker run --gpus 2 ...
Examples using specific GPUs
sudo docker run --gpus '"device=1,2"' ... sudo docker run --gpus '"device=UUID-ABCDEF,1"' ...
Using the NVIDIA Container Runtime for Docker
If you need to use nvidia-docker2, install it using sudo apt install nvidia-docker2
, then run:
sudo systemctl restart docker
The DGX OS also includes the NVIDIA Container Runtime for Docker (nvidia- docker2) which lets you run GPU-accelerated containers in one of the following ways:
Use docker run and specify runtime=nvidia.
docker run --runtime=nvidia ...
Use nvidia-docker run.
nvidia-docker run ...
The nvidia-docker2 package provides backward compatibility with the previous nvidia-docker package, so you can run GPU-accelerated containers using this command and the new runtime will be used.
Use docker run with nvidia as the default runtime.
You can set
nvidia
as the default runtime, for example, by adding the following line to the/ etc/docker/daemon.json
configuration file as the first entry."default-runtime": "nvidia",
Here is an example of how the added line appears in the JSON file. Do not remove any pre-existing content when making this change.
{ "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "args": [] } } }
You can then use docker run to run GPU-accelerated containers.
docker run ...
Caution
If you build Docker images while nvidia
is set as the default runtime, make sure the build scripts executed by the Dockerfile specify the GPU architectures that the container will need. Failure to do so might result in the container being optimized only for the GPU architecture on which it was built. Instructions for specifying the GPU architecture depend on the application and are beyond the scope of this document. Consult the specific application build process.
For more information, refer to the NVIDIA DGX OS 6 User Guide.
Managing CPU Mitigations
DGX OS Server includes security updates to mitigate CPU speculative side-channel vulnerabilities. These mitigations can decrease the performance of deep learning and machine learning workloads.
If your installation of DGX systems incorporates other measures to mitigate these vulnerabilities, such as measures at the cluster level, you can disable the CPU mitigations for individual DGX nodes and thereby increase performance. This capability is available starting with DGX OS Server release 4.4.0.
Determining the CPU Mitigation State of the DGX System
If you do not know whether CPU mitigations are enabled or disabled, issue the following.
cat /sys/devices/system/cpu/vulnerabilities/*
CPU mitigations are enabled if the output consists of multiple lines prefixed with
Mitigation:
.Example
KVM: Mitigation: Split huge pages
Mitigation: PTE Inversion; VMX: conditional cache flushes, SMT vulnerable
Mitigation: Clear CPU buffers; SMT vulnerable
Mitigation: PTI
Mitigation: Speculative Store Bypass disabled via prctl and seccomp
Mitigation: usercopy/swapgs barriers and __user pointer sanitization
Mitigation: Full generic retpoline, IBPB: conditional, IBRS_FW, STIBP: conditional, RSB filling
Mitigation: Clear CPU buffers; SMT vulnerable
CPU mitigations are disabled if the output consists of multiple lines prefixed with
Vulnerable
.Example
KVM: Vulnerable
Mitigation: PTE Inversion; VMX: vulnerable
Vulnerable; SMT vulnerable
Vulnerable
Vulnerable
Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerable, IBPB: disabled, STIBP: disabled
Vulnerable
Disabling CPU Mitigations
Caution
Performing the following instructions will disable the CPU mitigations provided by the DGX OS Server software.
Install the
nv-mitigations-off
package.sudo apt install nv-mitigations-off -y
Reboot the system.
Verify CPU mitigations are disabled.
cat /sys/devices/system/cpu/vulnerabilities/*
The output should include several
Vulnerable
lines. See Determining the CPU Mitigation State of the DGX System for example output.
Re-enabling CPU Mitigations
Remove the
nv-mitigations-off
package.sudo apt purge nv-mitigations-off
Reboot the system.
Verify CPU mitigations are enabled.
cat /sys/devices/system/cpu/vulnerabilities/*
The output should include several
Mitigations
lines. See Determining the CPU Mitigation State of the DGX System for example output.