Quickstart and Basic Operation

This chapter provides basic requirements and instructions for using the DGX H100 system, including how to perform a preliminary health check and how to prepare for running containers. Go to the DGX documentation for additional product documentation.

Installation and Configuration

Before you install DGX H100, ensure you have given all relevant site information to your Installation Partner.

Important

Your DGX H100 System must be installed by NVIDIA partner network personnel or NVIDIA field service engineers. If not performed accordingly, your hardware warranty will be voided.

Registration

To obtain support for your DGX H100, follow the instructions for registration in the Entitlement Certification email that was sent as part of the purchase.

Registration allows you to access the NVIDIA Enterprise Support Portal, obtain technical support, get software updates, and set up an NGC for DGX systems account. If you did not receive the information, open a case with the NVIDIA Enterprise Support Team at https://www.nvidia.com/en-us/support/enterprise/.

To obtain support for your DGX H100 system, follow the instructions for registration in the Entitlement Certification email that was sent as part of the purchase.

Registration allows you to access the NVIDIA Enterprise Support Portal, obtain technical support, get software updates, and set up an NGC for DGX systems account. If you did not receive the information, open a case with the NVIDIA Enterprise Support Team at https://www.nvidia.com/en-us/support/enterprise/.

Refer to Customer Support for contact information.

Obtaining an NGC Account

NVIDIA NGC provides access to GPU-optimized software for deep learning, machine learning, and high-performance computing (HPC). An NGC account grants you access to these tools and gives you the ability to set up a private registry to manage your customized software.

If you are the organization administrator for your DGX system purchase, work with NVIDIA Enterprise Support to set up an NGC enterprise account. Refer to the NGC Private Registry User Guide for more information about getting an NGC enterprise account.

Turning DGX H100 On and Off

DGX H100 is a complex system, integrating a large number of cutting-edge components with specific startup and shutdown sequences. Observe the following startup and shutdown instructions.

Startup Considerations

To keep your DGX H100 running smoothly, allow up to a minute of idle time after reaching the login prompt. This ensures that all components can complete their initialization.

Shutdown Considerations

When shutting down DGX H100, always initiate the shutdown from the operating system, momentary press of the power button, or by using Graceful Shutdown from the BMC, and wait until the system enters a powered-off state before performing any maintenance.

Warning

Risk of Danger - Removing power cables or using Power Distribution Units (PDUs) to shut off the system while the Operating System is running may cause damage to sensitive components in the DGX H100 server.

Verifying Functionality - Quick Health Check

NVIDIA provides customers a diagnostics and management tool called NVIDIA System Management, or NVSM. The nvsm command can be used to determine the system’s health, identify component issues and alerts, or run a stress test to make sure all components are in working order while under load. The use of Docker is key to getting the most performance out of the system since NVIDIA has optimized containers for all the major frameworks and workloads used on DGX systems.

The following are the steps for performing a health check on the DGX H100 System, and verifying the Docker and NVIDIA driver installation.

  1. Establish an SSH connection to the DGX H100 System.

  2. Run a basic system check.

    sudo nvsm show health
    
  3. Verify that the output summary shows that all checks are Healthy and that the overall system status is Healthy.

  4. Verify that Docker is installed by viewing the installed Docker version.

    sudo docker --version
    

    On success, the command returns the version as Docker version xx.yy.zz, where the actual version may differ depending on the specific release of the DGX OS Server software.

  5. Verify connection to the NVIDIA repository and that the NVIDIA Driver is installed.

    sudo docker run --gpus all --rm nvcr.io/nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi
    

    The preceding command pulls the nvidia/cuda container image layer by layer, then runs the nvidia-smi command.

    When complete, the output shows the NVIDIA Driver version and a description of each installed GPU.

For more information, refer to Containers For Deep Learning Frameworks User Guide.

Running the Pre-flight Test

Instructions for running the DGX stress test.

NVIDIA recommends running the pre-flight stress test before putting a system into a production environment or after servicing. You can specify running the test on the GPUs, CPU, memory, and storage, and also specify the duration of the tests.

To run the tests, use NVSM.

Syntax

sudo nvsm stress-test [--usage] [--force] [--no-prompt] [<test>...] [DURATION]

For help on running the test, issue the following.

sudo nvsm stress-test --usage

Recommended Command

The following command runs the test on all supported components (GPU, CPU, memory, and storage), and takes approximately 20 minutes.

sudo nvsm stress-test --force

Running NGC Containers with GPU Support

To obtain the best performance when running NGC containers on DGX H100 systems, the following methods of providing GPU support for Docker containers are available:

  • Native GPU support (included in Docker 20.10.18 and later)

The method implemented in your system depends on the DGX OS version installed.

DGX OS Releases

Method Included

6.0

  • Native GPU support

  • NVIDIA Container Runtime for Docker (deprecated - availability to be removed in a future DGX OS release)

Each method is invoked by using specific Docker commands, described as follows.

Using Native GPU Support

Use docker run --gpus to run GPU-enabled containers.

  • Example using all GPUs

    sudo docker run --gpus all ...
    
  • Example using two GPUs

    sudo docker run --gpus 2 ...
    
  • Examples using specific GPUs

    sudo docker run --gpus '"device=1,2"' ...
    sudo docker run --gpus '"device=UUID-ABCDEF,1"' ...
    

Using the NVIDIA Container Runtime for Docker

If you need to use nvidia-docker2, install it using sudo apt install nvidia-docker2, then run:

sudo systemctl restart docker

The DGX OS also includes the NVIDIA Container Runtime for Docker (nvidia- docker2) which lets you run GPU-accelerated containers in one of the following ways:

  • Use docker run and specify runtime=nvidia.

    docker run --runtime=nvidia ...
    
  • Use nvidia-docker run.

    nvidia-docker run ...
    

The nvidia-docker2 package provides backward compatibility with the previous nvidia-docker package, so you can run GPU-accelerated containers using this command and the new runtime will be used.

  • Use docker run with nvidia as the default runtime.

    You can set nvidia as the default runtime, for example, by adding the following line to the / etc/docker/daemon.json configuration file as the first entry.

    "default-runtime": "nvidia",
    

    Here is an example of how the added line appears in the JSON file. Do not remove any pre-existing content when making this change.

    {
      "default-runtime": "nvidia",
      "runtimes": {
         "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "args": []
          }
       }
    }
    

    You can then use docker run to run GPU-accelerated containers.

    docker run ...
    

Caution

If you build Docker images while nvidia is set as the default runtime, make sure the build scripts executed by the Dockerfile specify the GPU architectures that the container will need. Failure to do so might result in the container being optimized only for the GPU architecture on which it was built. Instructions for specifying the GPU architecture depend on the application and are beyond the scope of this document. Consult the specific application build process.

For more information, refer to the NVIDIA DGX OS 6 User Guide.

Managing CPU Mitigations

DGX OS Server includes security updates to mitigate CPU speculative side-channel vulnerabilities. These mitigations can decrease the performance of deep learning and machine learning workloads.

If your installation of DGX systems incorporates other measures to mitigate these vulnerabilities, such as measures at the cluster level, you can disable the CPU mitigations for individual DGX nodes and thereby increase performance. This capability is available starting with DGX OS Server release 4.4.0.

Determining the CPU Mitigation State of the DGX System

If you do not know whether CPU mitigations are enabled or disabled, issue the following.

cat /sys/devices/system/cpu/vulnerabilities/*
  • CPU mitigations are enabled if the output consists of multiple lines prefixed with Mitigation:.

    Example

KVM: Mitigation: Split huge pages
Mitigation: PTE Inversion; VMX: conditional cache flushes, SMT vulnerable
Mitigation: Clear CPU buffers; SMT vulnerable
Mitigation: PTI
Mitigation: Speculative Store Bypass disabled via prctl and seccomp
Mitigation: usercopy/swapgs barriers and __user pointer sanitization
Mitigation: Full generic retpoline, IBPB: conditional, IBRS_FW, STIBP: conditional, RSB filling
Mitigation: Clear CPU buffers; SMT vulnerable
  • CPU mitigations are disabled if the output consists of multiple lines prefixed with Vulnerable.

    Example

KVM: Vulnerable
Mitigation: PTE Inversion; VMX: vulnerable
Vulnerable; SMT vulnerable
Vulnerable
Vulnerable
Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerable, IBPB: disabled, STIBP: disabled
Vulnerable

Disabling CPU Mitigations

Caution

Performing the following instructions will disable the CPU mitigations provided by the DGX OS Server software.

  1. Install the nv-mitigations-off package.

    sudo apt install nv-mitigations-off -y
    
  2. Reboot the system.

  3. Verify CPU mitigations are disabled.

    cat /sys/devices/system/cpu/vulnerabilities/*
    

    The output should include several Vulnerable lines. See Determining the CPU Mitigation State of the DGX System for example output.

Re-enabling CPU Mitigations

  1. Remove the nv-mitigations-off package.

    sudo apt purge nv-mitigations-off
    
  2. Reboot the system.

  3. Verify CPU mitigations are enabled.

    cat /sys/devices/system/cpu/vulnerabilities/*
    

    The output should include several Mitigations lines. See Determining the CPU Mitigation State of the DGX System for example output.