NVIDIA HGX A100 Software User Guide

This edition of the user guide describes how to get started with the NVIDIA® HGX A100.

Changelog

  • 10/23/2020: Initial Version

Introduction

NVIDIA HGX A100 combines NVIDIA A100 Tensor Core GPUs with next generation NVIDIA® NVLink® and NVSwitch™ high-speed interconnects to create the world’s most powerful servers. HGX A100 is available in single baseboards with four or eight A100 GPUs. The four-GPU configuration (HGX A100 4-GPU) is fully interconnected with NVIDIA NVLink, and the eight-GPU configuration (HGX A100 8-GPU) is interconnected with NVSwitch. Two NVIDIA HGX A100 8-GPU baseboards can also be combined using an NVSwitch interconnect to create a powerful 16-GPU single node.

More information is available on the product website.

This document provides an overview of the base software that NVIDIA provides to get started with using a system with NVIDIA HGX A100.

Software Configuration

The diagram below shows an architecture overview of the software components of the NVIDIA HGX A100. To ensure that you have a functional HGX A100 8-GPU system ready to run CUDA applications, these software components should be installed (from the lowest part of the software stack):
  1. NVIDIA datacenter drivers

  2. NVIDIA Fabric Manager (FM)

Note that the HGX A100 4-GPU system does not include NVSwitch, so FM is not a required component on this system configuration.

The following components are optional and not required for a fully functional HGX system. However it is strongly recommended that these be installed:
  1. NVIDIA NVSwitch Configuration and Query Library (NSCQ)

  2. NVIDIA DCGM

  3. CUDA Toolkit

For convenience, NVIDIA provides packages on a network repository for installation using Linux package managers (apt/dnf/zypper) and uses package dependencies to install these software components in order.

Figure 1. NVIDIA GPU Management Software on HGX A100

HGX A100 Software Stack.


NVIDIA Datacenter Drivers

NVIDIA releases drivers that are qualified for enterprise and datacenter GPUs. The documentation portal includes release notes, software lifecycle (including active drivers branches), installation and user guides.

According to the software lifecycle, the minimum recommended driver for production use with NVIDIA HGX A100 is R450. Refer to the lifecyle for active and supported driver branches.

The table lists the current actively supported datacenter drivers.

Table 1.
  R418 R440 R450
Branch Designation Long Term Service Branch New Feature Branch Long Term Service Branch
End of Life March 2022 November 2020 July 2023
Maximum CUDA Version Supported CUDA 10.1.

This driver branch supports CUDA 10.2 and CUDA 11.0 (through CUDA compatibility platform).

CUDA 10.2.

This driver branch supports CUDA 11.0 through CUDA compatibility platform.

CUDA 11.0
Architectures Supported Turing and below. Turing and below. NVIDIA Ampere and below.
Note: All other previous driver branches not listed in the table above (e.g. R410, R396, R390) are end of life.

For A100 (NVIDIA Ampere architecture) based systems such as HGX A100, the R450 driver is a minimum requirement. Before setting up the HGX A100 system, ensure that you have completed the pre-requisites - specifically, you’re running a supported Linux distribution, the system has build tools (e.g. gcc/make) and kernel headers. More information is available here.

Note: Note that these steps are for Ubuntu LTS distributions - such as 18.04 LTS and 20.04 LTS. The instructions can be customized easily for RHEL.

To get started with installing drivers and the NVIDIA Fabric Manager (FM), first set up the CUDA network repository and the repository priority:


distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g') \
&& wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-$distribution.pin \
&& sudo mv cuda-$distribution.pin /etc/apt/preferences.d/cuda-repository-pin-600

            

Setup the GPG keys:


sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/7fa2af80.pub \
&& echo "deb http://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64 /" | sudo tee /etc/apt/sources.list.d/cuda.list \
&& sudo apt-get update

            

Since FM and NSCQ are revlocked to the driver, NVIDIA provides a meta-package called cuda-drivers-fabricmanager-<branch-number> to ensure that FM and drivers are installed together using dependencies.

Since we’re on HGX A100, we will pick the R450 driver branch. For example, the dependency tree is for this package is shown below:

            
             ├─ cuda-drivers-fabricmanager-450
             │    ├─ cuda-drivers-450 (= 450.80.02-1)
             │    └─ nvidia-fabricmanager-450 (= 450.80.02-1) 
            
            

The available package versions can be seen using apt-cache:


sudo apt-cache madison cuda-drivers-fabricmanager-450

cuda-drivers-fabricmanager-450 | 450.80.02-1 | http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Packages
cuda-drivers-fabricmanager-450 | 450.51.06-1 | http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Packages
cuda-drivers-fabricmanager-450 | 450.51.06-1 | http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Packages

            

Now install the drivers using the cuda-drivers-fabricmanager-450:


sudo apt-get -y cuda-drivers-fabricmanager-450

            

Once the driver install is complete, you may need to reboot the system. Once the system is available, run the nvidia-smi command to observe all 8-GPUs and 6 NVSwitches in the system:


nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      On   | 00000000:07:00.0 Off |                  Off |
| N/A   22C    P0    52W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  A100-SXM4-40GB      On   | 00000000:0F:00.0 Off |                  Off |
| N/A   22C    P0    49W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  A100-SXM4-40GB      On   | 00000000:47:00.0 Off |                  Off |
| N/A   21C    P0    49W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  A100-SXM4-40GB      On   | 00000000:4E:00.0 Off |                  Off |
| N/A   23C    P0    53W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  A100-SXM4-40GB      On   | 00000000:87:00.0 Off |                  Off |
| N/A   24C    P0    51W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  A100-SXM4-40GB      On   | 00000000:90:00.0 Off |                  Off |
| N/A   23C    P0    49W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  A100-SXM4-40GB      On   | 00000000:B7:00.0 Off |                  Off |
| N/A   23C    P0    51W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  A100-SXM4-40GB      On   | 00000000:BD:00.0 Off |                  Off |
| N/A   25C    P0    52W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+


            

NVIDIA Fabric Manager

Fabric Manager is an agent that configures the NVSwitches to form a single memory fabric among all participating GPUs and monitors NVLinks that support the memory fabric. For more information on using and configuring (including advanced options), refer to the Fabric Manager User Guide.

After installing the package in the previous section, check the version of FM installed:


/usr/bin/nv-fabricmanager --version

Fabric Manager version is : 450.80.02
        

Start the FM service using the provided systemd service file:


sudo systemctl status nvidia-fabricmanager.service

● nvidia-fabricmanager.service - NVIDIA fabric manager service
   Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2020-10-12 11:23:25 PDT; 11min ago
  Process: 10981 ExecStart=/usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg (code=exited, status=0/SUCCESS)
 Main PID: 10992 (nv-fabricmanage)
    Tasks: 18 (limit: 39321)
   CGroup: /system.slice/nvidia-fabricmanager.service
           └─10992 /usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg

Oct 12 11:23:09 ubuntu1804 systemd[1]: Starting NVIDIA fabric manager service...
Oct 12 11:23:25 ubuntu1804 nv-fabricmanager[10992]: Successfully configured all the available GPUs and NVSwitches.
Oct 12 11:23:25 ubuntu1804 systemd[1]: Started NVIDIA fabric manager service.
        

Check the status of the service:


sudo systemctl start nvidia-fabricmanager.service
        

Ensure Fabric Manager logs (under /var/log/fabricmanager.log) do not include any errors.

Now review the topology to ensure that “NV12” appears between peer GPUs. This indicates that all 12 NVLinks are trained and available for full bi-directional bandwidth. This can be done with the nvidia-smi topo -m command.


nvidia-smi topo -m

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  mlx5_2  mlx5_3  mlx5_4  mlx5_5  mlx5_6  mlx5_7  mlx5_8  mlx5_9  CPU Affinity    NUMA Affinity
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     48-63,176-191   3
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     48-63,176-191   3
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     16-31,144-159   1
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     16-31,144-159   1
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     112-127,240-255 7
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     112-127,240-255 7
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     80-95,208-223   5
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     80-95,208-223   5
mlx5_0  PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
mlx5_1  PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
mlx5_2  SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     SYS     SYS     SYS     SYS     SYS     SYS
mlx5_3  SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      SYS     SYS     SYS     SYS     SYS     SYS
mlx5_4  SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     SYS     SYS     SYS     SYS
mlx5_5  SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      SYS     SYS     SYS     SYS
mlx5_6  SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     SYS     SYS
mlx5_7  SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      SYS     SYS
mlx5_8  SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PIX
mlx5_9  SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
        

NVSwitch Configuration and Query Library (NSCQ)

The NSCQ library currently provides topology information of the NVSwitches and GPUs to clients of the library such as DCGM.

Note that currently, DCGM is the only client of NSCQ. In the future, NSCQ will include a public API for gathering NVSwitch information. To allow clients such as DCGM to access NSCQ, the library should be installed on the system - note that in the near future, the library package will be installed as part of the driver similar to FM.

Setup the library using the libnvidia-nscq-450 package.


sudo apt-get install -y libnvidia-nscq-450
        

Once the package is installed, you should be able to verify the libraries in the standard installation path on your system:


ls -ol /usr/lib/x86_64-linux-gnu/libnvidia-nscq*

lrwxrwxrwx 1 root      24 Sep 29 13:01 /usr/lib/x86_64-linux-gnu/libnvidia-nscq-dcgm.so -> libnvidia-nscq-dcgm.so.1
lrwxrwxrwx 1 root      26 Sep 29 13:01 /usr/lib/x86_64-linux-gnu/libnvidia-nscq-dcgm.so.1 -> libnvidia-nscq-dcgm.so.1.0
lrwxrwxrwx 1 root      32 Sep 29 13:01 /usr/lib/x86_64-linux-gnu/libnvidia-nscq-dcgm.so.1.0 -> libnvidia-nscq-dcgm.so.450.51.06
-rwxr-xr-x 1 root 1041416 Sep 29 12:56 /usr/lib/x86_64-linux-gnu/libnvidia-nscq-dcgm.so.450.51.06
lrwxrwxrwx 1 root      19 Sep 22 21:48 /usr/lib/x86_64-linux-gnu/libnvidia-nscq.so -> libnvidia-nscq.so.1
lrwxrwxrwx 1 root      21 Sep 22 21:48 /usr/lib/x86_64-linux-gnu/libnvidia-nscq.so.1 -> libnvidia-nscq.so.1.0
lrwxrwxrwx 1 root      27 Sep 22 21:48 /usr/lib/x86_64-linux-gnu/libnvidia-nscq.so.1.0 -> libnvidia-nscq.so.450.80.02
-rw-r--r-- 1 root 1041416 Sep 22 21:48 /usr/lib/x86_64-linux-gnu/libnvidia-nscq.so.450.80.02
        

CUDA Toolkit

After installing the NVIDIA driver, Fabric Manager and NSCQ, you can proceed to install the CUDA Toolkit on the system to build CUDA applications. Note that if you are deploying CUDA applications only, then the CUDA Toolkit is not necessary as the CUDA application should include the dependencies it needs.

To install CUDA Toolkit, let’s use the cuda-toolkit-11-0 meta-package. For other meta-packages, review this table in the documentation. This meta-package installs only the CUDA Toolkit (and does not install the NVIDIA driver).

Check the meta-packages available using the following command:


sudo apt-cache madison cuda-toolkit-11-

cuda-toolkit-11-0  cuda-toolkit-11-1
        

APT shows that two CUDA versions are available. Let’s choose CUDA 11.0 for the purposes of this document:


sudo apt-cache madison cuda-toolkit-11-0

cuda-toolkit-11-0 |   11.0.3-1 | http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Packages
cuda-toolkit-11-0 |   11.0.3-1 | http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Packages
cuda-toolkit-11-0 |   11.0.3-1 | http://international.download.nvidia.com/dgx/repos/bionic bionic-4.99/multiverse amd64 Packages
cuda-toolkit-11-0 |   11.0.2-1 | http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Packages
cuda-toolkit-11-0 |   11.0.2-1 | http://international.download.nvidia.com/dgx/repos/bionic bionic-4.99/multiverse amd64 Packages
cuda-toolkit-11-0 |   11.0.1-1 | http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Packages
cuda-toolkit-11-0 |   11.0.1-1 | http://international.download.nvidia.com/dgx/repos/bionic bionic-4.99/multiverse amd64 Packages
cuda-toolkit-11-0 |   11.0.0-1 | http://international.download.nvidia.com/dgx/repos/bionic bionic-4.99/multiverse amd64 Packages
        

You can now proceed to install CUDA Toolkit::


sudo apt-get install -y cuda-toolkit-11-0
        

Once CUDA is installed, let’s build the included CUDA p2pBandwidthLatencyTest sample. Once the binary is available, we can run it to check the unidirectional and bidirectional bandwidth.


./bin/x86_64/linux/release/p2pBandwidthLatencyTest

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, A100-SXM4-40GB, pciBusID: 7, pciDeviceID: 0, pciDomainID:0
Device: 1, A100-SXM4-40GB, pciBusID: f, pciDeviceID: 0, pciDomainID:0
Device: 2, A100-SXM4-40GB, pciBusID: 47, pciDeviceID: 0, pciDomainID:0
Device: 3, A100-SXM4-40GB, pciBusID: 4e, pciDeviceID: 0, pciDomainID:0
Device: 4, A100-SXM4-40GB, pciBusID: 87, pciDeviceID: 0, pciDomainID:0
Device: 5, A100-SXM4-40GB, pciBusID: 90, pciDeviceID: 0, pciDomainID:0
Device: 6, A100-SXM4-40GB, pciBusID: b7, pciDeviceID: 0, pciDomainID:0
Device: 7, A100-SXM4-40GB, pciBusID: bd, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=0 CAN Access Peer Device=4
Device=0 CAN Access Peer Device=5
Device=0 CAN Access Peer Device=6
Device=0 CAN Access Peer Device=7
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=4
Device=1 CAN Access Peer Device=5
Device=1 CAN Access Peer Device=6
Device=1 CAN Access Peer Device=7
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=4
Device=2 CAN Access Peer Device=5
Device=2 CAN Access Peer Device=6
Device=2 CAN Access Peer Device=7
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2
Device=3 CAN Access Peer Device=4
Device=3 CAN Access Peer Device=5
Device=3 CAN Access Peer Device=6
Device=3 CAN Access Peer Device=7
Device=4 CAN Access Peer Device=0
Device=4 CAN Access Peer Device=1
Device=4 CAN Access Peer Device=2
Device=4 CAN Access Peer Device=3
Device=4 CAN Access Peer Device=5
Device=4 CAN Access Peer Device=6
Device=4 CAN Access Peer Device=7
Device=5 CAN Access Peer Device=0
Device=5 CAN Access Peer Device=1
Device=5 CAN Access Peer Device=2
Device=5 CAN Access Peer Device=3
Device=5 CAN Access Peer Device=4
Device=5 CAN Access Peer Device=6
Device=5 CAN Access Peer Device=7
Device=6 CAN Access Peer Device=0
Device=6 CAN Access Peer Device=1
Device=6 CAN Access Peer Device=2
Device=6 CAN Access Peer Device=3
Device=6 CAN Access Peer Device=4
Device=6 CAN Access Peer Device=5
Device=6 CAN Access Peer Device=7
Device=7 CAN Access Peer Device=0
Device=7 CAN Access Peer Device=1
Device=7 CAN Access Peer Device=2
Device=7 CAN Access Peer Device=3
Device=7 CAN Access Peer Device=4
Device=7 CAN Access Peer Device=5
Device=7 CAN Access Peer Device=6

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2     3     4     5     6     7
     0       1     1     1     1     1     1     1     1
     1       1     1     1     1     1     1     1     1
     2       1     1     1     1     1     1     1     1
     3       1     1     1     1     1     1     1     1
     4       1     1     1     1     1     1     1     1
     5       1     1     1     1     1     1     1     1
     6       1     1     1     1     1     1     1     1
     7       1     1     1     1     1     1     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7
     0 1277.60  14.69  17.62  17.69  18.52  18.14  18.66  17.72
     1  14.96 1276.55  17.66  17.72  18.23  18.23  18.66  17.65
     2  17.58  17.59 1277.60  14.61  18.18  18.61  17.80  17.97
     3  17.53  17.73  14.68 1275.51  18.20  17.74  18.63  17.42
     4  17.79  17.62  17.84  17.93 1291.32  15.95  17.77  17.97
     5  17.46  17.85  17.79  17.93  16.51 1290.26  18.63  17.24
     6  17.35  17.78  17.65  17.81  18.51  18.85 1289.19  15.43
     7  17.44  17.82  17.87  18.11  17.49  18.74  15.90 1290.26
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7
     0 1275.51 263.60 267.07 273.62 273.73 273.54 272.89 273.45
     1 263.77 1288.13 265.92 273.83 274.13 273.33 273.76 274.06
     2 263.38 265.44 1284.95 273.68 274.70 273.48 274.19 274.15
     3 265.34 266.87 266.85 1299.92 272.81 273.92 273.38 274.47
     4 266.56 266.66 268.40 275.25 1305.35 273.76 275.42 275.17
     5 266.49 266.64 266.40 275.85 273.77 1305.35 272.82 274.67
     6 265.32 267.78 269.32 266.30 274.32 275.12 1298.84 274.21
     7 267.13 266.07 269.01 266.72 275.14 274.46 274.83 1304.26
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7
     0 1290.79  15.65  19.52  19.53  20.00  20.39  20.34  20.03
     1  15.97 1304.80  19.37  19.42  19.93  19.92  20.04  19.91
     2  19.09  19.21 1302.08  15.54  19.92  19.93  20.01  19.77
     3  19.17  19.28  15.65 1304.80  20.04  20.06  20.03  19.85
     4  19.48  19.63  19.71  19.85 1304.80  17.55  19.91  19.69
     5  19.45  19.65  19.76  19.94  18.19 1306.44  20.11  19.84
     6  19.49  19.73  19.73  19.95  19.59  20.09 1303.17  17.84
     7  19.29  19.48  19.56  19.60  19.88  19.62  18.31 1304.80
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7
     0 1289.72 411.91 414.96 413.53 415.59 417.36 417.43 415.88
     1 410.46 1290.26 411.43 410.63 411.75 412.08 412.41 411.97
     2 414.04 412.75 1288.66 413.25 415.37 416.25 414.85 415.15
     3 409.55 410.13 411.32 1287.60 412.18 412.30 412.62 411.75
     4 414.31 414.14 417.84 413.87 1304.26 436.86 436.86 437.60
     5 413.65 414.87 417.28 414.35 437.48 1310.27 517.97 518.83
     6 413.25 414.44 418.48 415.52 437.24 521.41 1301.54 519.75
     7 414.79 414.89 418.19 413.95 438.69 517.46 517.80 1301.54
P2P=Disabled Latency Matrix (us)
   GPU     0      1      2      3      4      5      6      7
     0   3.09  24.91  25.81  25.55  24.70  24.73  24.68  24.75
     1  25.33   3.07  25.70  25.59  24.82  24.67  24.48  24.70
     2  25.60  25.69   3.17  25.60  24.95  24.64  24.86  24.63
     3  25.68  25.52  25.35   3.30  24.68  24.69  24.66  24.67
     4  25.58  25.27  25.58  25.59   2.91  24.60  24.60  24.59
     5  25.68  25.54  25.59  25.42  24.57   3.01  24.60  24.60
     6  25.68  25.59  25.60  25.59  24.59  24.56   2.47  24.65
     7  25.59  25.33  25.62  25.61  24.59  24.59  24.65   2.66

   CPU     0      1      2      3      4      5      6      7
     0   4.40  13.76  13.47  13.62  12.74  12.75  12.67  12.86
     1  13.81   4.82  13.60  13.53  12.57  12.70  12.89  13.02
     2  13.69  13.43   4.40  13.41  12.62  12.70  12.62  12.82
     3  13.66  13.36  13.76   4.42  12.87  12.64  12.59  12.56
     4  12.80  12.78  12.91  12.88   4.13  12.17  12.04  12.06
     5  12.93  12.78  12.86  12.86  12.18   4.15  12.14  12.01
     6  12.74  12.81  12.91  12.87  12.06  12.02   4.41  12.01
     7  12.90  12.83  12.99  13.07  11.97  12.16  12.20   4.12
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2      3      4      5      6      7
     0   3.19   3.63   3.57   3.55   3.60   3.56   3.56   3.58
     1   3.63   3.06   3.62   3.59   3.56   3.57   3.55   3.62
     2   3.62   3.58   3.16   3.56   3.57   3.65   3.56   3.63
     3   3.58   3.60   3.62   3.30   3.64   3.56   3.59   3.61
     4   3.49   3.46   3.53   3.46   2.93   3.47   3.53   3.53
     5   3.47   3.54   3.56   3.53   3.53   3.00   3.53   3.46
     6   2.91   2.96   2.92   2.93   2.94   2.98   2.46   2.98
     7   3.03   3.04   3.06   3.03   3.06   3.09   3.14   2.66

   CPU     0      1      2      3      4      5      6      7
     0   4.46   3.81   3.92   3.87   3.85   3.88   3.88   3.92
     1   3.96   4.49   3.93   3.93   3.96   3.97   3.91   3.87
     2   4.00   4.03   4.52   3.93   3.93   3.93   4.12   4.23
     3   4.01   3.95   4.11   4.50   4.23   4.20   3.95   3.93
     4   4.09   3.96   3.71   3.70   4.41   3.70   3.68   3.66
     5   3.76   3.69   3.72   3.71   3.72   4.23   3.68   3.65
     6   3.77   3.63   3.65   3.94   3.64   3.71   4.20   3.60
     7   3.94   4.00   3.69   3.70   3.72   3.74   3.73   4.21

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
        

The above values show GPU-to-GPU unidirectional transfer bandwidth ranging from 263GB/s to 275GB/s. Bidirectional bandwidth ranges from 413GB/s to 521GB/s. Within the same GPU (diagonal output), shows a bandwidth within the same GPU of around 1,300GB/s. These numbers are close to the theoretical full 600GB/s bidirectional NVLink bandwidth that can be achieved between pairs of GPUs.

NVIDIA A100 supports PCIe Gen 4.0 and we can observe the bus speed between the CPU and GPU by using the bandwidthTest available in the CUDA installation directory (/usr/local/cuda/extras/demo_suite).

The test shows a bandwidth between device and host around 23GB/s:


./bandwidthTest

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: A100-SXM4-40GB
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     22865.7

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     22554.8

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     1173168.2

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
        

NVIDIA DCGM

NVIDIA DCGM is a suite of tools for managing and monitoring datacenter GPUs in cluster environments. For more information, review the product page.

To install DCGM for your Linux distribution, download the installer packages.


wget --no-check-certificate https://developer.download.nvidia.com/compute/redist/dcgm/2.0.13/DEBS/datacenter-gpu-manager_2.0.13_amd64.deb

--2020-10-12 12:12:21--  https://developer.download.nvidia.com/compute/redist/dcgm/2.0.13/DEBS/datacenter-gpu-manager_2.0.13_amd64.deb
Resolving developer.download.nvidia.com (developer.download.nvidia.com)... 152.195.19.142
Connecting to developer.download.nvidia.com (developer.download.nvidia.com)|152.195.19.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 184133216 (176M) [application/x-deb]
Saving to: ‘datacenter-gpu-manager_2.0.13_amd64.deb’

datacenter-gpu-manager_2.0.13_amd64.deb            100%[==============================================================================================================>] 175.60M   105MB/s    in 1.7s

2020-10-12 12:12:22 (105 MB/s) - ‘datacenter-gpu-manager_2.0.13_amd64.deb’ saved [184133216/184133216]
        

Then proceed to install and start the DCGM service with systemd:


sudo dpkg -i datacenter-gpu-manager_2.0.13_amd64.deb

(Reading database ... 174123 files and directories currently installed.)
Preparing to unpack datacenter-gpu-manager_2.0.13_amd64.deb ...
Unpacking datacenter-gpu-manager (1:2.0.13) over (1:2.0.10) ...
Setting up datacenter-gpu-manager (1:2.0.13) ...

$ nv-hostengine --version
Version : 2.0.13
Build ID : 18
Build Date : 2020-09-29
Build Type : Release
Commit ID : v2.0.12-6-gbf6e6238
Branch Name : rel_dcgm_2_0
CPU Arch : x86_64
Build Platform : Linux 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64
        

Check the status of the DCGM service to ensure that the nv-hostengine agent has started successfully without errors:


sudo systemctl start dcgm.service

● dcgm.service - DCGM service
   Loaded: loaded (/usr/lib/systemd/system/dcgm.service; disabled; vendor preset: enabled)
   Active: active (running) since Mon 2020-10-12 12:18:57 PDT; 14s ago
 Main PID: 32847 (nv-hostengine)
    Tasks: 7 (limit: 39321)
   CGroup: /system.slice/dcgm.service
           └─32847 /usr/bin/nv-hostengine -n

Oct 12 12:18:57 ubuntu1804 systemd[1]: Started DCGM service.
Oct 12 12:18:58 ubuntu1804 nv-hostengine[32847]: DCGM initialized
Oct 12 12:18:58 ubuntu1804 nv-hostengine[32847]: Host Engine Listener Started
        

Now check that DCGM can enumerate the topology of the system:



dcgmi discovery -l

8 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:07:00.0                                         |
|        | Device UUID: GPU-1d82f4df-3cf9-150d-088b-52f18f8654e1                |
+--------+----------------------------------------------------------------------+
| 1      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:0F:00.0                                         |
|        | Device UUID: GPU-94168100-c5d5-1c05-9005-26953dd598e7                |
+--------+----------------------------------------------------------------------+
| 2      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:47:00.0                                         |
|        | Device UUID: GPU-9387e4b3-3640-0064-6b80-5ace1ee535f6                |
+--------+----------------------------------------------------------------------+
| 3      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:4E:00.0                                         |
|        | Device UUID: GPU-cefd0e59-c486-c12f-418c-84ccd7a12bb2                |
+--------+----------------------------------------------------------------------+
| 4      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:87:00.0                                         |
|        | Device UUID: GPU-1501b26d-f3e4-8501-421d-5a444b17eda8                |
+--------+----------------------------------------------------------------------+
| 5      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:90:00.0                                         |
|        | Device UUID: GPU-f4180a63-1978-6c56-9903-ca5aac8af020                |
+--------+----------------------------------------------------------------------+
| 6      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:B7:00.0                                         |
|        | Device UUID: GPU-8b354e3e-0145-6cfc-aec6-db2c28dae134                |
+--------+----------------------------------------------------------------------+
| 7      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:BD:00.0                                         |
|        | Device UUID: GPU-a16e3b98-8be2-6a0c-7fac-9cb024dbc2df                |
+--------+----------------------------------------------------------------------+
6 NvSwitches found.
+-----------+
| Switch ID |
+-----------+
| 11        |
| 10        |
| 13        |
| 9         |
| 12        |
| 8         |
+-----------+
        

Now check that DCGM can enumerate the NVLinks present in the system:



dcgmi nvlink -s

+----------------------+
|  NvLink Link Status  |
+----------------------+
GPUs:
    gpuId 0:
        U U U U U U U U U U U U
    gpuId 1:
        U U U U U U U U U U U U
    gpuId 2:
        U U U U U U U U U U U U
    gpuId 3:
        U U U U U U U U U U U U
    gpuId 4:
        U U U U U U U U U U U U
    gpuId 5:
        U U U U U U U U U U U U
    gpuId 6:
        U U U U U U U U U U U U
    gpuId 7:
        U U U U U U U U U U U U
NvSwitches:
    physicalId 11:
        X X X X X X X X U U U U X X X X X X X X X X X X U U U U U U U U U U U U
    physicalId 10:
        X X X X X X X X U U U U U U U U X X X X X X X X X X U U U U U U X X U U
    physicalId 13:
        X X X X X X X X X X U U U U U U X X X X X X X X U U U U U U U U X X U U
    physicalId 9:
        X X X X X X X X U U U U U U U U X X X X X X X X U U U U X X X X U U U U
    physicalId 12:
        X X X X X X X X X X U U U U U U X X X X X X X X U U U U U U U U X X U U
    physicalId 8:
        X X X X X X X X U U U U X X X X X X X X X X X X U U U U U U U U U U U U

Key: Up=U, Down=D, Disabled=X, Not Supported=_
        

Supported Software Versions

The following software versions are supported for HGX A100:

Table 2.
Software Version
R450 450.80.02
Fabric Manager 450.80.02
NSCQ 450.80.02
DCGM 2.0.13
CUDA Toolkit 11.0+

Notices

Notice

THE INFORMATION IN THIS GUIDE AND ALL OTHER INFORMATION CONTAINED IN NVIDIA DOCUMENTATION REFERENCED IN THIS GUIDE IS PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE INFORMATION FOR THE PRODUCT, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the product described in this guide shall be limited in accordance with the NVIDIA terms and conditions of sale for the product.

THE NVIDIA PRODUCT DESCRIBED IN THIS GUIDE IS NOT FAULT TOLERANT AND IS NOT DESIGNED, MANUFACTURED OR INTENDED FOR USE IN CONNECTION WITH THE DESIGN, CONSTRUCTION, MAINTENANCE, AND/OR OPERATION OF ANY SYSTEM WHERE THE USE OR A FAILURE OF SUCH SYSTEM COULD RESULT IN A SITUATION THAT THREATENS THE SAFETY OF HUMAN LIFE OR SEVERE PHYSICAL HARM OR PROPERTY DAMAGE (INCLUDING, FOR EXAMPLE, USE IN CONNECTION WITH ANY NUCLEAR, AVIONICS, LIFE SUPPORT OR OTHER LIFE CRITICAL APPLICATION). NVIDIA EXPRESSLY DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY OF FITNESS FOR SUCH HIGH RISK USES. NVIDIA SHALL NOT BE LIABLE TO CUSTOMER OR ANY THIRD PARTY, IN WHOLE OR IN PART, FOR ANY CLAIMS OR DAMAGES ARISING FROM SUCH HIGH RISK USES.

NVIDIA makes no representation or warranty that the product described in this guide will be suitable for any specified use without further testing or modification. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to ensure the product is suitable and fit for the application planned by customer and to do the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this guide. NVIDIA does not accept any liability related to any default, damage, costs or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this guide, or (ii) customer product designs.

Other than the right for customer to use the information in this guide with the product, no other license, either expressed or implied, is hereby granted by NVIDIA under this guide. Reproduction of information in this guide is permissible only if reproduction is approved by NVIDIA in writing, is reproduced without alteration, and is accompanied by all associated conditions, limitations, and notices.

Trademarks

NVIDIA and the NVIDIA logo are trademarks and/or registered trademarks of NVIDIA Corporation in the Unites States and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.