NVIDIA HGX A100 Software User Guide
This edition of the user guide describes how to get started with the NVIDIA® HGX A100.
Introduction
NVIDIA HGX A100 combines NVIDIA A100 Tensor Core GPUs with next generation NVIDIA® NVLink® and NVSwitch™ high-speed interconnects to create the world’s most powerful servers. HGX A100 is available in single baseboards with four or eight A100 GPUs. The four-GPU configuration (HGX A100 4-GPU) is fully interconnected with NVIDIA NVLink, and the eight-GPU configuration (HGX A100 8-GPU) is interconnected with NVSwitch. Two NVIDIA HGX A100 8-GPU baseboards can also be combined using an NVSwitch interconnect to create a powerful 16-GPU single node.
More information is available on the product website.
This document provides an overview of the base software that NVIDIA provides to get started with using a system with NVIDIA HGX A100.
Software Configuration
Note that the HGX A100 4-GPU system does not include NVSwitch, so FM is not a required component on this system configuration.
For convenience, NVIDIA provides packages on a network repository for installation using Linux package managers (apt/dnf/zypper) and uses package dependencies to install these software components in order.
NVIDIA Datacenter Drivers
NVIDIA releases drivers that are qualified for enterprise and datacenter GPUs. The documentation portal includes release notes, software lifecycle (including active drivers branches), installation and user guides.
According to the software lifecycle, the minimum recommended driver for production use with NVIDIA HGX A100 is R450. Refer to the lifecyle for active and supported driver branches.
The table lists the current actively supported datacenter drivers.
R418 | R440 | R450 | |
---|---|---|---|
Branch Designation | Long Term Service Branch | New Feature Branch | Long Term Service Branch |
End of Life | March 2022 | November 2020 | July 2023 |
Maximum CUDA Version Supported | CUDA 10.1.
This driver branch supports CUDA 10.2 and CUDA 11.0 (through CUDA compatibility platform). |
CUDA 10.2.
This driver branch supports CUDA 11.0 through CUDA compatibility platform. |
CUDA 11.0 |
Architectures Supported | Turing and below. | Turing and below. | NVIDIA Ampere and below. |
For A100 (NVIDIA Ampere architecture) based systems such as HGX A100, the R450 driver is a minimum requirement. Before setting up the HGX A100 system, ensure that you have completed the pre-requisites - specifically, you’re running a supported Linux distribution, the system has build tools (e.g. gcc/make) and kernel headers. More information is available here.
To get started with installing drivers and the NVIDIA Fabric Manager (FM), first set up the CUDA network repository and the repository priority:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g') \ && wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-$distribution.pin \ && sudo mv cuda-$distribution.pin /etc/apt/preferences.d/cuda-repository-pin-600
Setup the GPG keys:
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/7fa2af80.pub \ && echo "deb http://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64 /" | sudo tee /etc/apt/sources.list.d/cuda.list \ && sudo apt-get update
Since FM and NSCQ are revlocked to the driver, NVIDIA provides a meta-package called cuda-drivers-fabricmanager-<branch-number> to ensure that FM and drivers are installed together using dependencies.
Since we’re on HGX A100, we will pick the R450 driver branch. For example, the dependency tree is for this package is shown below:
├─ cuda-drivers-fabricmanager-450 │ ├─ cuda-drivers-450 (= 450.80.02-1) │ └─ nvidia-fabricmanager-450 (= 450.80.02-1)
The available package versions can be seen using apt-cache:
sudo apt-cache madison cuda-drivers-fabricmanager-450 cuda-drivers-fabricmanager-450 | 450.80.02-1 | http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 Packages cuda-drivers-fabricmanager-450 | 450.51.06-1 | http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 Packages cuda-drivers-fabricmanager-450 | 450.51.06-1 | http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 Packages
Now install the drivers using the cuda-drivers-fabricmanager-450:
sudo apt-get -y cuda-drivers-fabricmanager-450
Once the driver install is complete, you may need to reboot the system. Once the system is available, run the nvidia-smi command to observe all 8-GPUs and 6 NVSwitches in the system:
nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 A100-SXM4-40GB On | 00000000:07:00.0 Off | Off | | N/A 22C P0 52W / 400W | 0MiB / 40537MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 A100-SXM4-40GB On | 00000000:0F:00.0 Off | Off | | N/A 22C P0 49W / 400W | 0MiB / 40537MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 2 A100-SXM4-40GB On | 00000000:47:00.0 Off | Off | | N/A 21C P0 49W / 400W | 0MiB / 40537MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 3 A100-SXM4-40GB On | 00000000:4E:00.0 Off | Off | | N/A 23C P0 53W / 400W | 0MiB / 40537MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 4 A100-SXM4-40GB On | 00000000:87:00.0 Off | Off | | N/A 24C P0 51W / 400W | 0MiB / 40537MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 5 A100-SXM4-40GB On | 00000000:90:00.0 Off | Off | | N/A 23C P0 49W / 400W | 0MiB / 40537MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 6 A100-SXM4-40GB On | 00000000:B7:00.0 Off | Off | | N/A 23C P0 51W / 400W | 0MiB / 40537MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 7 A100-SXM4-40GB On | 00000000:BD:00.0 Off | Off | | N/A 25C P0 52W / 400W | 0MiB / 40537MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
NVIDIA Fabric Manager
Fabric Manager is an agent that configures the NVSwitches to form a single memory fabric among all participating GPUs and monitors NVLinks that support the memory fabric. For more information on using and configuring (including advanced options), refer to the Fabric Manager User Guide.
After installing the package in the previous section, check the version of FM installed:
/usr/bin/nv-fabricmanager --version Fabric Manager version is : 450.80.02
Start the FM service using the provided systemd service file:
sudo systemctl status nvidia-fabricmanager.service ● nvidia-fabricmanager.service - NVIDIA fabric manager service Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled) Active: active (running) since Mon 2020-10-12 11:23:25 PDT; 11min ago Process: 10981 ExecStart=/usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg (code=exited, status=0/SUCCESS) Main PID: 10992 (nv-fabricmanage) Tasks: 18 (limit: 39321) CGroup: /system.slice/nvidia-fabricmanager.service └─10992 /usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg Oct 12 11:23:09 ubuntu1804 systemd[1]: Starting NVIDIA fabric manager service... Oct 12 11:23:25 ubuntu1804 nv-fabricmanager[10992]: Successfully configured all the available GPUs and NVSwitches. Oct 12 11:23:25 ubuntu1804 systemd[1]: Started NVIDIA fabric manager service.
Check the status of the service:
sudo systemctl start nvidia-fabricmanager.service
Ensure Fabric Manager logs (under /var/log/fabricmanager.log) do not include any errors.
Now review the topology to ensure that “NV12” appears between peer GPUs. This indicates that all 12 NVLinks are trained and available for full bi-directional bandwidth. This can be done with the nvidia-smi topo -m command.
nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_1 mlx5_2 mlx5_3 mlx5_4 mlx5_5 mlx5_6 mlx5_7 mlx5_8 mlx5_9 CPU Affinity NUMA Affinity GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 48-63,176-191 3 GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 48-63,176-191 3 GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS 16-31,144-159 1 GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS 16-31,144-159 1 GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS 112-127,240-255 7 GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS 112-127,240-255 7 GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS 80-95,208-223 5 GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS 80-95,208-223 5 mlx5_0 PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS SYS SYS SYS SYS SYS SYS mlx5_1 PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS SYS SYS SYS SYS SYS SYS mlx5_2 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS SYS SYS SYS SYS mlx5_3 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS SYS SYS SYS SYS mlx5_4 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS SYS SYS mlx5_5 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS SYS SYS mlx5_6 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS mlx5_7 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS mlx5_8 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X PIX mlx5_9 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
NVSwitch Configuration and Query Library (NSCQ)
The NSCQ library currently provides topology information of the NVSwitches and GPUs to clients of the library such as DCGM.
Note that currently, DCGM is the only client of NSCQ. In the future, NSCQ will include a public API for gathering NVSwitch information. To allow clients such as DCGM to access NSCQ, the library should be installed on the system - note that in the near future, the library package will be installed as part of the driver similar to FM.
Setup the library using the libnvidia-nscq-450 package.
sudo apt-get install -y libnvidia-nscq-450
Once the package is installed, you should be able to verify the libraries in the standard installation path on your system:
ls -ol /usr/lib/x86_64-linux-gnu/libnvidia-nscq* lrwxrwxrwx 1 root 24 Sep 29 13:01 /usr/lib/x86_64-linux-gnu/libnvidia-nscq-dcgm.so -> libnvidia-nscq-dcgm.so.1 lrwxrwxrwx 1 root 26 Sep 29 13:01 /usr/lib/x86_64-linux-gnu/libnvidia-nscq-dcgm.so.1 -> libnvidia-nscq-dcgm.so.1.0 lrwxrwxrwx 1 root 32 Sep 29 13:01 /usr/lib/x86_64-linux-gnu/libnvidia-nscq-dcgm.so.1.0 -> libnvidia-nscq-dcgm.so.450.51.06 -rwxr-xr-x 1 root 1041416 Sep 29 12:56 /usr/lib/x86_64-linux-gnu/libnvidia-nscq-dcgm.so.450.51.06 lrwxrwxrwx 1 root 19 Sep 22 21:48 /usr/lib/x86_64-linux-gnu/libnvidia-nscq.so -> libnvidia-nscq.so.1 lrwxrwxrwx 1 root 21 Sep 22 21:48 /usr/lib/x86_64-linux-gnu/libnvidia-nscq.so.1 -> libnvidia-nscq.so.1.0 lrwxrwxrwx 1 root 27 Sep 22 21:48 /usr/lib/x86_64-linux-gnu/libnvidia-nscq.so.1.0 -> libnvidia-nscq.so.450.80.02 -rw-r--r-- 1 root 1041416 Sep 22 21:48 /usr/lib/x86_64-linux-gnu/libnvidia-nscq.so.450.80.02
CUDA Toolkit
After installing the NVIDIA driver, Fabric Manager and NSCQ, you can proceed to install the CUDA Toolkit on the system to build CUDA applications. Note that if you are deploying CUDA applications only, then the CUDA Toolkit is not necessary as the CUDA application should include the dependencies it needs.
To install CUDA Toolkit, let’s use the cuda-toolkit-11-0 meta-package. For other meta-packages, review this table in the documentation. This meta-package installs only the CUDA Toolkit (and does not install the NVIDIA driver).
Check the meta-packages available using the following command:
sudo apt-cache madison cuda-toolkit-11- cuda-toolkit-11-0 cuda-toolkit-11-1
APT shows that two CUDA versions are available. Let’s choose CUDA 11.0 for the purposes of this document:
sudo apt-cache madison cuda-toolkit-11-0 cuda-toolkit-11-0 | 11.0.3-1 | http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 Packages cuda-toolkit-11-0 | 11.0.3-1 | http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 Packages cuda-toolkit-11-0 | 11.0.3-1 | http://international.download.nvidia.com/dgx/repos/bionic bionic-4.99/multiverse amd64 Packages cuda-toolkit-11-0 | 11.0.2-1 | http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 Packages cuda-toolkit-11-0 | 11.0.2-1 | http://international.download.nvidia.com/dgx/repos/bionic bionic-4.99/multiverse amd64 Packages cuda-toolkit-11-0 | 11.0.1-1 | http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 Packages cuda-toolkit-11-0 | 11.0.1-1 | http://international.download.nvidia.com/dgx/repos/bionic bionic-4.99/multiverse amd64 Packages cuda-toolkit-11-0 | 11.0.0-1 | http://international.download.nvidia.com/dgx/repos/bionic bionic-4.99/multiverse amd64 Packages
You can now proceed to install CUDA Toolkit::
sudo apt-get install -y cuda-toolkit-11-0
Once CUDA is installed, let’s build the included CUDA p2pBandwidthLatencyTest sample. Once the binary is available, we can run it to check the unidirectional and bidirectional bandwidth.
./bin/x86_64/linux/release/p2pBandwidthLatencyTest [P2P (Peer-to-Peer) GPU Bandwidth Latency Test] Device: 0, A100-SXM4-40GB, pciBusID: 7, pciDeviceID: 0, pciDomainID:0 Device: 1, A100-SXM4-40GB, pciBusID: f, pciDeviceID: 0, pciDomainID:0 Device: 2, A100-SXM4-40GB, pciBusID: 47, pciDeviceID: 0, pciDomainID:0 Device: 3, A100-SXM4-40GB, pciBusID: 4e, pciDeviceID: 0, pciDomainID:0 Device: 4, A100-SXM4-40GB, pciBusID: 87, pciDeviceID: 0, pciDomainID:0 Device: 5, A100-SXM4-40GB, pciBusID: 90, pciDeviceID: 0, pciDomainID:0 Device: 6, A100-SXM4-40GB, pciBusID: b7, pciDeviceID: 0, pciDomainID:0 Device: 7, A100-SXM4-40GB, pciBusID: bd, pciDeviceID: 0, pciDomainID:0 Device=0 CAN Access Peer Device=1 Device=0 CAN Access Peer Device=2 Device=0 CAN Access Peer Device=3 Device=0 CAN Access Peer Device=4 Device=0 CAN Access Peer Device=5 Device=0 CAN Access Peer Device=6 Device=0 CAN Access Peer Device=7 Device=1 CAN Access Peer Device=0 Device=1 CAN Access Peer Device=2 Device=1 CAN Access Peer Device=3 Device=1 CAN Access Peer Device=4 Device=1 CAN Access Peer Device=5 Device=1 CAN Access Peer Device=6 Device=1 CAN Access Peer Device=7 Device=2 CAN Access Peer Device=0 Device=2 CAN Access Peer Device=1 Device=2 CAN Access Peer Device=3 Device=2 CAN Access Peer Device=4 Device=2 CAN Access Peer Device=5 Device=2 CAN Access Peer Device=6 Device=2 CAN Access Peer Device=7 Device=3 CAN Access Peer Device=0 Device=3 CAN Access Peer Device=1 Device=3 CAN Access Peer Device=2 Device=3 CAN Access Peer Device=4 Device=3 CAN Access Peer Device=5 Device=3 CAN Access Peer Device=6 Device=3 CAN Access Peer Device=7 Device=4 CAN Access Peer Device=0 Device=4 CAN Access Peer Device=1 Device=4 CAN Access Peer Device=2 Device=4 CAN Access Peer Device=3 Device=4 CAN Access Peer Device=5 Device=4 CAN Access Peer Device=6 Device=4 CAN Access Peer Device=7 Device=5 CAN Access Peer Device=0 Device=5 CAN Access Peer Device=1 Device=5 CAN Access Peer Device=2 Device=5 CAN Access Peer Device=3 Device=5 CAN Access Peer Device=4 Device=5 CAN Access Peer Device=6 Device=5 CAN Access Peer Device=7 Device=6 CAN Access Peer Device=0 Device=6 CAN Access Peer Device=1 Device=6 CAN Access Peer Device=2 Device=6 CAN Access Peer Device=3 Device=6 CAN Access Peer Device=4 Device=6 CAN Access Peer Device=5 Device=6 CAN Access Peer Device=7 Device=7 CAN Access Peer Device=0 Device=7 CAN Access Peer Device=1 Device=7 CAN Access Peer Device=2 Device=7 CAN Access Peer Device=3 Device=7 CAN Access Peer Device=4 Device=7 CAN Access Peer Device=5 Device=7 CAN Access Peer Device=6 ***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure. So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases. P2P Connectivity Matrix D\D 0 1 2 3 4 5 6 7 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 1 1 5 1 1 1 1 1 1 1 1 6 1 1 1 1 1 1 1 1 7 1 1 1 1 1 1 1 1 Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 4 5 6 7 0 1277.60 14.69 17.62 17.69 18.52 18.14 18.66 17.72 1 14.96 1276.55 17.66 17.72 18.23 18.23 18.66 17.65 2 17.58 17.59 1277.60 14.61 18.18 18.61 17.80 17.97 3 17.53 17.73 14.68 1275.51 18.20 17.74 18.63 17.42 4 17.79 17.62 17.84 17.93 1291.32 15.95 17.77 17.97 5 17.46 17.85 17.79 17.93 16.51 1290.26 18.63 17.24 6 17.35 17.78 17.65 17.81 18.51 18.85 1289.19 15.43 7 17.44 17.82 17.87 18.11 17.49 18.74 15.90 1290.26 Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s) D\D 0 1 2 3 4 5 6 7 0 1275.51 263.60 267.07 273.62 273.73 273.54 272.89 273.45 1 263.77 1288.13 265.92 273.83 274.13 273.33 273.76 274.06 2 263.38 265.44 1284.95 273.68 274.70 273.48 274.19 274.15 3 265.34 266.87 266.85 1299.92 272.81 273.92 273.38 274.47 4 266.56 266.66 268.40 275.25 1305.35 273.76 275.42 275.17 5 266.49 266.64 266.40 275.85 273.77 1305.35 272.82 274.67 6 265.32 267.78 269.32 266.30 274.32 275.12 1298.84 274.21 7 267.13 266.07 269.01 266.72 275.14 274.46 274.83 1304.26 Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 4 5 6 7 0 1290.79 15.65 19.52 19.53 20.00 20.39 20.34 20.03 1 15.97 1304.80 19.37 19.42 19.93 19.92 20.04 19.91 2 19.09 19.21 1302.08 15.54 19.92 19.93 20.01 19.77 3 19.17 19.28 15.65 1304.80 20.04 20.06 20.03 19.85 4 19.48 19.63 19.71 19.85 1304.80 17.55 19.91 19.69 5 19.45 19.65 19.76 19.94 18.19 1306.44 20.11 19.84 6 19.49 19.73 19.73 19.95 19.59 20.09 1303.17 17.84 7 19.29 19.48 19.56 19.60 19.88 19.62 18.31 1304.80 Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 4 5 6 7 0 1289.72 411.91 414.96 413.53 415.59 417.36 417.43 415.88 1 410.46 1290.26 411.43 410.63 411.75 412.08 412.41 411.97 2 414.04 412.75 1288.66 413.25 415.37 416.25 414.85 415.15 3 409.55 410.13 411.32 1287.60 412.18 412.30 412.62 411.75 4 414.31 414.14 417.84 413.87 1304.26 436.86 436.86 437.60 5 413.65 414.87 417.28 414.35 437.48 1310.27 517.97 518.83 6 413.25 414.44 418.48 415.52 437.24 521.41 1301.54 519.75 7 414.79 414.89 418.19 413.95 438.69 517.46 517.80 1301.54 P2P=Disabled Latency Matrix (us) GPU 0 1 2 3 4 5 6 7 0 3.09 24.91 25.81 25.55 24.70 24.73 24.68 24.75 1 25.33 3.07 25.70 25.59 24.82 24.67 24.48 24.70 2 25.60 25.69 3.17 25.60 24.95 24.64 24.86 24.63 3 25.68 25.52 25.35 3.30 24.68 24.69 24.66 24.67 4 25.58 25.27 25.58 25.59 2.91 24.60 24.60 24.59 5 25.68 25.54 25.59 25.42 24.57 3.01 24.60 24.60 6 25.68 25.59 25.60 25.59 24.59 24.56 2.47 24.65 7 25.59 25.33 25.62 25.61 24.59 24.59 24.65 2.66 CPU 0 1 2 3 4 5 6 7 0 4.40 13.76 13.47 13.62 12.74 12.75 12.67 12.86 1 13.81 4.82 13.60 13.53 12.57 12.70 12.89 13.02 2 13.69 13.43 4.40 13.41 12.62 12.70 12.62 12.82 3 13.66 13.36 13.76 4.42 12.87 12.64 12.59 12.56 4 12.80 12.78 12.91 12.88 4.13 12.17 12.04 12.06 5 12.93 12.78 12.86 12.86 12.18 4.15 12.14 12.01 6 12.74 12.81 12.91 12.87 12.06 12.02 4.41 12.01 7 12.90 12.83 12.99 13.07 11.97 12.16 12.20 4.12 P2P=Enabled Latency (P2P Writes) Matrix (us) GPU 0 1 2 3 4 5 6 7 0 3.19 3.63 3.57 3.55 3.60 3.56 3.56 3.58 1 3.63 3.06 3.62 3.59 3.56 3.57 3.55 3.62 2 3.62 3.58 3.16 3.56 3.57 3.65 3.56 3.63 3 3.58 3.60 3.62 3.30 3.64 3.56 3.59 3.61 4 3.49 3.46 3.53 3.46 2.93 3.47 3.53 3.53 5 3.47 3.54 3.56 3.53 3.53 3.00 3.53 3.46 6 2.91 2.96 2.92 2.93 2.94 2.98 2.46 2.98 7 3.03 3.04 3.06 3.03 3.06 3.09 3.14 2.66 CPU 0 1 2 3 4 5 6 7 0 4.46 3.81 3.92 3.87 3.85 3.88 3.88 3.92 1 3.96 4.49 3.93 3.93 3.96 3.97 3.91 3.87 2 4.00 4.03 4.52 3.93 3.93 3.93 4.12 4.23 3 4.01 3.95 4.11 4.50 4.23 4.20 3.95 3.93 4 4.09 3.96 3.71 3.70 4.41 3.70 3.68 3.66 5 3.76 3.69 3.72 3.71 3.72 4.23 3.68 3.65 6 3.77 3.63 3.65 3.94 3.64 3.71 4.20 3.60 7 3.94 4.00 3.69 3.70 3.72 3.74 3.73 4.21 NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
The above values show GPU-to-GPU unidirectional transfer bandwidth ranging from 263GB/s to 275GB/s. Bidirectional bandwidth ranges from 413GB/s to 521GB/s. Within the same GPU (diagonal output), shows a bandwidth within the same GPU of around 1,300GB/s. These numbers are close to the theoretical full 600GB/s bidirectional NVLink bandwidth that can be achieved between pairs of GPUs.
NVIDIA A100 supports PCIe Gen 4.0 and we can observe the bus speed between the CPU and GPU by using the bandwidthTest available in the CUDA installation directory (/usr/local/cuda/extras/demo_suite).
The test shows a bandwidth between device and host around 23GB/s:
./bandwidthTest [CUDA Bandwidth Test] - Starting... Running on... Device 0: A100-SXM4-40GB Quick Mode Host to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 22865.7 Device to Host Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 22554.8 Device to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 1173168.2 Result = PASS NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
NVIDIA DCGM
NVIDIA DCGM is a suite of tools for managing and monitoring datacenter GPUs in cluster environments. For more information, review the product page.
To install DCGM for your Linux distribution, download the installer packages.
wget --no-check-certificate https://developer.download.nvidia.com/compute/redist/dcgm/2.0.13/DEBS/datacenter-gpu-manager_2.0.13_amd64.deb --2020-10-12 12:12:21-- https://developer.download.nvidia.com/compute/redist/dcgm/2.0.13/DEBS/datacenter-gpu-manager_2.0.13_amd64.deb Resolving developer.download.nvidia.com (developer.download.nvidia.com)... 152.195.19.142 Connecting to developer.download.nvidia.com (developer.download.nvidia.com)|152.195.19.142|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 184133216 (176M) [application/x-deb] Saving to: ‘datacenter-gpu-manager_2.0.13_amd64.deb’ datacenter-gpu-manager_2.0.13_amd64.deb 100%[==============================================================================================================>] 175.60M 105MB/s in 1.7s 2020-10-12 12:12:22 (105 MB/s) - ‘datacenter-gpu-manager_2.0.13_amd64.deb’ saved [184133216/184133216]
Then proceed to install and start the DCGM service with systemd:
sudo dpkg -i datacenter-gpu-manager_2.0.13_amd64.deb (Reading database ... 174123 files and directories currently installed.) Preparing to unpack datacenter-gpu-manager_2.0.13_amd64.deb ... Unpacking datacenter-gpu-manager (1:2.0.13) over (1:2.0.10) ... Setting up datacenter-gpu-manager (1:2.0.13) ... $ nv-hostengine --version Version : 2.0.13 Build ID : 18 Build Date : 2020-09-29 Build Type : Release Commit ID : v2.0.12-6-gbf6e6238 Branch Name : rel_dcgm_2_0 CPU Arch : x86_64 Build Platform : Linux 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64
Check the status of the DCGM service to ensure that the nv-hostengine agent has started successfully without errors:
sudo systemctl start dcgm.service ● dcgm.service - DCGM service Loaded: loaded (/usr/lib/systemd/system/dcgm.service; disabled; vendor preset: enabled) Active: active (running) since Mon 2020-10-12 12:18:57 PDT; 14s ago Main PID: 32847 (nv-hostengine) Tasks: 7 (limit: 39321) CGroup: /system.slice/dcgm.service └─32847 /usr/bin/nv-hostengine -n Oct 12 12:18:57 ubuntu1804 systemd[1]: Started DCGM service. Oct 12 12:18:58 ubuntu1804 nv-hostengine[32847]: DCGM initialized Oct 12 12:18:58 ubuntu1804 nv-hostengine[32847]: Host Engine Listener Started
Now check that DCGM can enumerate the topology of the system:
dcgmi discovery -l 8 GPUs found. +--------+----------------------------------------------------------------------+ | GPU ID | Device Information | +--------+----------------------------------------------------------------------+ | 0 | Name: A100-SXM4-40GB | | | PCI Bus ID: 00000000:07:00.0 | | | Device UUID: GPU-1d82f4df-3cf9-150d-088b-52f18f8654e1 | +--------+----------------------------------------------------------------------+ | 1 | Name: A100-SXM4-40GB | | | PCI Bus ID: 00000000:0F:00.0 | | | Device UUID: GPU-94168100-c5d5-1c05-9005-26953dd598e7 | +--------+----------------------------------------------------------------------+ | 2 | Name: A100-SXM4-40GB | | | PCI Bus ID: 00000000:47:00.0 | | | Device UUID: GPU-9387e4b3-3640-0064-6b80-5ace1ee535f6 | +--------+----------------------------------------------------------------------+ | 3 | Name: A100-SXM4-40GB | | | PCI Bus ID: 00000000:4E:00.0 | | | Device UUID: GPU-cefd0e59-c486-c12f-418c-84ccd7a12bb2 | +--------+----------------------------------------------------------------------+ | 4 | Name: A100-SXM4-40GB | | | PCI Bus ID: 00000000:87:00.0 | | | Device UUID: GPU-1501b26d-f3e4-8501-421d-5a444b17eda8 | +--------+----------------------------------------------------------------------+ | 5 | Name: A100-SXM4-40GB | | | PCI Bus ID: 00000000:90:00.0 | | | Device UUID: GPU-f4180a63-1978-6c56-9903-ca5aac8af020 | +--------+----------------------------------------------------------------------+ | 6 | Name: A100-SXM4-40GB | | | PCI Bus ID: 00000000:B7:00.0 | | | Device UUID: GPU-8b354e3e-0145-6cfc-aec6-db2c28dae134 | +--------+----------------------------------------------------------------------+ | 7 | Name: A100-SXM4-40GB | | | PCI Bus ID: 00000000:BD:00.0 | | | Device UUID: GPU-a16e3b98-8be2-6a0c-7fac-9cb024dbc2df | +--------+----------------------------------------------------------------------+ 6 NvSwitches found. +-----------+ | Switch ID | +-----------+ | 11 | | 10 | | 13 | | 9 | | 12 | | 8 | +-----------+
Now check that DCGM can enumerate the NVLinks present in the system:
dcgmi nvlink -s +----------------------+ | NvLink Link Status | +----------------------+ GPUs: gpuId 0: U U U U U U U U U U U U gpuId 1: U U U U U U U U U U U U gpuId 2: U U U U U U U U U U U U gpuId 3: U U U U U U U U U U U U gpuId 4: U U U U U U U U U U U U gpuId 5: U U U U U U U U U U U U gpuId 6: U U U U U U U U U U U U gpuId 7: U U U U U U U U U U U U NvSwitches: physicalId 11: X X X X X X X X U U U U X X X X X X X X X X X X U U U U U U U U U U U U physicalId 10: X X X X X X X X U U U U U U U U X X X X X X X X X X U U U U U U X X U U physicalId 13: X X X X X X X X X X U U U U U U X X X X X X X X U U U U U U U U X X U U physicalId 9: X X X X X X X X U U U U U U U U X X X X X X X X U U U U X X X X U U U U physicalId 12: X X X X X X X X X X U U U U U U X X X X X X X X U U U U U U U U X X U U physicalId 8: X X X X X X X X U U U U X X X X X X X X X X X X U U U U U U U U U U U U Key: Up=U, Down=D, Disabled=X, Not Supported=_
Supported Software Versions
The following software versions are supported for HGX A100:
Software | Version |
---|---|
R450 | 450.80.02 |
Fabric Manager | 450.80.02 |
NSCQ | 450.80.02 |
DCGM | 2.0.13 |
CUDA Toolkit | 11.0+ |
Notices
Notice
THE INFORMATION IN THIS GUIDE AND ALL OTHER INFORMATION CONTAINED IN NVIDIA DOCUMENTATION REFERENCED IN THIS GUIDE IS PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE INFORMATION FOR THE PRODUCT, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the product described in this guide shall be limited in accordance with the NVIDIA terms and conditions of sale for the product.
THE NVIDIA PRODUCT DESCRIBED IN THIS GUIDE IS NOT FAULT TOLERANT AND IS NOT DESIGNED, MANUFACTURED OR INTENDED FOR USE IN CONNECTION WITH THE DESIGN, CONSTRUCTION, MAINTENANCE, AND/OR OPERATION OF ANY SYSTEM WHERE THE USE OR A FAILURE OF SUCH SYSTEM COULD RESULT IN A SITUATION THAT THREATENS THE SAFETY OF HUMAN LIFE OR SEVERE PHYSICAL HARM OR PROPERTY DAMAGE (INCLUDING, FOR EXAMPLE, USE IN CONNECTION WITH ANY NUCLEAR, AVIONICS, LIFE SUPPORT OR OTHER LIFE CRITICAL APPLICATION). NVIDIA EXPRESSLY DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY OF FITNESS FOR SUCH HIGH RISK USES. NVIDIA SHALL NOT BE LIABLE TO CUSTOMER OR ANY THIRD PARTY, IN WHOLE OR IN PART, FOR ANY CLAIMS OR DAMAGES ARISING FROM SUCH HIGH RISK USES.
NVIDIA makes no representation or warranty that the product described in this guide will be suitable for any specified use without further testing or modification. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to ensure the product is suitable and fit for the application planned by customer and to do the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this guide. NVIDIA does not accept any liability related to any default, damage, costs or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this guide, or (ii) customer product designs.
Other than the right for customer to use the information in this guide with the product, no other license, either expressed or implied, is hereby granted by NVIDIA under this guide. Reproduction of information in this guide is permissible only if reproduction is approved by NVIDIA in writing, is reproduced without alteration, and is accompanied by all associated conditions, limitations, and notices.