Quick Start and Basic Operation

This chapter provides basic requirements and instructions for using the DGX A100 system, including how to perform a preliminary health check and how to prepare for running containers. Go to the DGX documentation for additional product documentation.

Installation and Configuration

Before you install DGX A100, ensure you have given all relevant site information to your Installation Partner.

Important

Your DGX A100 System must be installed by NVIDIA partner network personnel or NVIDIA field service engineers. If not performed accordingly, your DGX A100 hardware warranty will be voided.

Registering Your DGX A100

To obtain support for your DGX A100, follow the instructions for registration in the Entitlement Certification email that was sent as part of the purchase.

Registration allows you to access the NVIDIA Enterprise Support Portal, obtain technical support, get software updates, and set up an NGC for DGX systems account.

If you did not receive the information, open a case with the NVIDIA Enterprise Support Team by going to the NVIDIA Enterprise Support Portal. The site provides ways of contacting the NVIDIA Enterprise Services team for support without requiring an NVIDIA Enterprise Support account. Also refer to Customer Support.

Obtaining an NGC Account

Here is some information about how you can obtain an NGC account.

NVIDIA NGC provides simple access to GPU-optimized software for deep learning, machine learning , and high-performance computing (HPC). An NGC account grants you access to these tools and gives you the ability to set up a private registry to manage your customized software.

If you are the organization administrator for your DGX system purchase, work with NVIDIA Enterprise Support to set up an NGC enterprise account. Refer to the NGC Private Registry User Guide for more information about getting an NGC enterprise account.

Turning DGX A100 On and Off

DGX A100 is a complex system, integrating a large number of cutting-edge components with specific startup and shutdown sequences. Observe the following startup and shutdown instructions.

Startup Considerations

To keep your DGX A100 running smoothly, allow up to a minute of idle time after reaching the login prompt. This ensures that all components can complete their initialization.

Shutdown Considerations

When shutting down DGX A100, always initiate the shutdown from the operating system, momentary press of the power button, or by using Graceful Shutdown from the BMC, and wait until the system enters a powered-off state before performing any maintenance.

Warning

Risk of Danger - Removing power cables or using Power Distribution Units (PDUs) to shut off the system while the Operating System is running may cause damage to sensitive components in the DGX A100 server.

Verifying Functionality - Quick Health Check

NVIDIA provides customers a diagnostics and management tool called NVIDIA System Management, or NVSM. The nvsm command can be used to determine the system’s health, identify component issues and alerts, or run a stress test to make sure all components are in working order while under load. The use of Docker is key to getting the most performance out of the system since NVIDIA has optimized containers for all the major frameworks and workloads used on DGX systems.

The following are the steps for performing a health check on the DGX A100 System, and verifying the Docker and NVIDIA driver installation.

  1. Establish an SSH connection to the DGX A100 System.

  2. Run a basic system check.

    $ sudo nvsm show health
    
  3. Verify that the output summary shows that all checks are Healthy and that the overall system status is Healthy.

  4. Verify that Docker is installed by viewing the installed Docker version.

    $ sudo docker --version
    

    This should return the version as “Docker version 19.03.5-ce”, where the actual version may differ depending on the specific release of the DGX OS Server software.

  5. Verify connection to the NVIDIA repository and that the NVIDIA Driver is installed.

    $ sudo docker run --gpus all --rm nvcr.io/nvidia/cuda:11.0-base nvidia-smi
    

    Docker pulls the nvidia/cuda container image layer by layer, then runs nvidia-smi.

    When completed, the output should show the NVIDIA Driver version and a description of each installed GPU.

    See the NVIDIA Containers and Deep Learning Frameworks User Guide at https:// docs.nvidia.com/deeplearning/dgx/user-guide/index.html for additional instructions, including an example of logging into the NGC container registry and launching a deep learning container.

Running the Pre-flight Test

NVIDIA recommends running the pre-flight stress test before putting a system into a production environment or after servicing. You can specify running the test on the GPUs, CPU, memory, and storage, and also specify the duration of the tests.

To run the tests, use NVSM.

Syntax

$ sudo nvsm stress-test [--usage] [--force] [--no-prompt] [<test>...] [DURATION]

Getting Help

For help on running the test, issue the following.

$ sudo nvsm stress-test --usage

Recommended Test to Run

The following command runs the test on all supported components (GPU, CPU, memory, and storage), and takes approximately 20 minutes.

$ sudo nvsm stress-test --force

Running NGC Containers with GPU Support

To obtain the best performance when running NGC containers on DGX A100 systems, the following methods of providing GPU support for Docker containers are available:

  • Native GPU support (included in Docker 19.03 and later)

  • NVIDIA Container Runtime for Docker (nvidia-docker2 package)

The method implemented in your system depends on the DGX OS version installed.

DGX OS Releases

Method Included

5.0

  • Native GPU support

  • NVIDIA Container Runtime for Docker (deprecated - availability to be removed in a future DGX OS release)

Each method is invoked by using specific Docker commands, described as follows.

Using Native GPU Support

Here is some information about using native GPU support.

Use docker run --gpus to run GPU-enabled containers.

  • Example using all GPUs

    $ sudo docker run --gpus all ...
    
  • Example using two GPUs

    $ sudo docker run --gpus 2 ...
    
  • Examples using specific GPUs

    $ sudo docker run --gpus '"device=1,2"' ...
    $ sudo docker run --gpus '"device=UUID-ABCDEF,1"' ...
    

Using the NVIDIA Container Runtime for Docker

Currently, the DGX OS also includes the NVIDIA Container Runtime for Docker (nvidia- docker2) which lets you run GPU-accelerated containers in one of the following ways.

  • Use docker run and specify runtime=nvidia.

    $ docker run --runtime=nvidia ...
    
  • Use nvidia-docker run.

    $ nvidia-docker run ...
    

    The nvidia-docker2 package provides backward compatibility with the previous nvidia-docker package, so you can run GPU-accelerated containers using this command and the new runtime will be used.

  • Use docker run with nvidia as the default runtime.

You can set nvidia as the default runtime, for example, by adding the following line to the / etc/docker/daemon.json configuration file as the first entry.

"default-runtime": "nvidia",

Here is an example of how the added line appears in the JSON file. Do not remove any pre-existing content when making this change.

{
"default-runtime": "nvidia",
  "runtimes": {
       "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
}

You can then use docker run to run GPU-accelerated containers.

$ docker run ...

Caution

If you build Docker images while nvidia is set as the default runtime, make sure the build scripts executed by the Dockerfile specify the GPU architectures that the container will need. Failure to do so might result in the container being optimized only for the GPU architecture on which it was built.

Instructions for specifying the GPU architecture depend on the application and are beyond the scope of this document. Consult the specific application build process.

Managing CPU Mitigations

DGX OS Server includes security updates to mitigate CPU speculative side-channel vulnerabilities. These mitigations can decrease the performance of deep learning and machine learning workloads.

If your installation of DGX systems incorporates other measures to mitigate these vulnerabilities, such as measures at the cluster level, you can disable the CPU mitigations for individual DGX nodes and thereby increase performance.

Determining the CPU Mitigation State of the DGX System

If you do not know whether CPU mitigations are enabled or disabled, issue the following.

$ cat /sys/devices/system/cpu/vulnerabilities/*
  • CPU mitigations are enabled if the output consists of multiple lines prefixed with Mitigation:.

    Example

    KVM: Mitigation: Split huge pages
    Mitigation: PTE Inversion; VMX: conditional cache flushes, SMT vulnerable
    Mitigation: Clear CPU buffers; SMT vulnerable
    Mitigation: PTI
    Mitigation: Speculative Store Bypass disabled via prctl and seccomp
    Mitigation: usercopy/swapgs barriers and __user pointer sanitization
    Mitigation: Full generic retpoline, IBPB: conditional, IBRS_FW, STIBP: conditional, RSB filling
    Mitigation: Clear CPU buffers; SMT vulnerable
    
  • CPU mitigations are disabled if the output consists of multiple lines prefixed with Vulnerable.

    Example

    KVM: Vulnerable
    Mitigation: PTE Inversion; VMX: vulnerable
    Vulnerable; SMT vulnerable
    Vulnerable
    Vulnerable
    Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
    Vulnerable, IBPB: disabled, STIBP: disabled
    Vulnerable
    

Disabling CPU Mitigations

Caution

Performing the following instructions will disable the CPU mitigations provided by the DGX OS Server software.

  1. Install the nv-mitigations-off package.

    $ sudo apt install nv-mitigations-off -y
    
  2. Reboot the system.

  3. Verify CPU mitigations are disabled.

    $ cat /sys/devices/system/cpu/vulnerabilities/*
    

    The output should include several Vulnerable lines. See Determining the CPU Mitigation State of the DGX System for example output.

Re-enabling CPU Mitigations

  1. Remove the nv-mitigations-off package.

    $ sudo apt purge nv-mitigations-off
    
  2. Reboot the system.

  3. Verify CPU mitigations are enabled.

    $ cat /sys/devices/system/cpu/vulnerabilities/*
    

    The output should include several Mitigations lines. See Determining the CPU Mitigation State of the DGX System for example output.