Base OS - DGX OS 5

System Configuration

This section provides information about less common configuration options once a system has been installed.

Refer also to Appendix A: DGX OS Connectivity Requirements for a list of network ports used by various services.

This section provides information about you can configure the network in your DGX system.

Configuration Network Proxies

If your network needs to use a proxy server, you need to set up configuration files to ensure the DGX system communicates through the proxy.

For the OS and Most Applications

Here is some information about configuring the network for the OS and other applications.

Edit the /etc/environment file and add the following proxy addresses to the file, below the PATH line.

Copy
Copied!
            

http_proxy="http://<username>:<password>@<host>:<port>/" ftp_proxy="ftp://<username>:<password>@<host>:<port>/" https_proxy="https://<username>:<password>@<host>:<port>/" no_proxy="localhost,127.0.0.1,localaddress,.localdomain.com" HTTP_PROXY="http://<username>:<password>@<host>:<port>/" FTP_PROXY="ftp://<username>:<password>@<host>:<port>/" HTTPS_PROXY="https://<username>:<password>@<host>:<port>/" NO_PROXY="localhost,127.0.0.1,localaddress,.localdomain.com"

Where username and password are optional.

For example, for the HTTP proxy (both, upper and lower case versions must be changed):

Copy
Copied!
            

http_proxy="http://myproxy.server.com:8080/" HTTP_PROXY="http://myproxy.server.com:8080/"

For the apt Package Manager

Here is some information about configuring the network for the apt package manager.

Edit or create the /etc/apt/apt.conf.d/myproxy proxy configuration file and include the following lines:

Copy
Copied!
            

Acquire::http::proxy "http://<username>:<password>@<host>:<port>/"; Acquire::ftp::proxy "ftp://<username>:<password>@<host>:<port>/"; Acquire::https::proxy "https://<username>:<password>@<host>:<port>/";

For example:

Copy
Copied!
            

Acquire::http::proxy "http://myproxy.server.com:8080/"; Acquire::ftp::proxy "ftp://myproxy.server.com:8080>/"; Acquire::https::proxy "https://myproxy.server.com:8080/";

To ensure that Docker can access the NGC container registry through a proxy, Docker uses environment variables.

For best practice recommendations on configuring proxy environment variables for Docker, refer to Control Docker with systemd.

Preparing the DGX System to be Used With Docker

Some initial setup of the DGX system is required to ensure that users have the required privileges to run Docker containers and to prevent IP address conflicts between Docker and the DGX system.

Enabling Users To Run Docker Containers

To prevent the docker daemon from running without protection against escalation of privileges, the Docker software requires sudo privileges to run containers. Meeting this requirement involves enabling users who will run Docker containers to run commands with sudo privileges.

You should ensure that only users whom you trust and who are aware of the potential risks to the DGX system of running commands with sudo privileges can run Docker containers.

Before you allow multiple users to run commands with sudo privileges, consult your IT department to determine whether you might be violating your organization’s security policies. For the security implications of enabling users to run Docker containers, see Docker daemon attack surface.

You can enable users to run the Docker containers in one of the following ways:

  • Add each user as an administrator user with sudo privileges.

  • Add each user as a standard user without sudo privileges and then add the user to the docker group.

This approach is inherently insecure because any user who can send commands to the docker engine can escalate privilege and run root-user operations.

To add an existing user to the docker group, run this command:

Copy
Copied!
            

$ sudo usermod -aG docker user-login-id

user-login-id

The user login ID of the existing user that you are adding to the docker group.

Configuring Docker IP Addresses

To ensure that your DGX system can access the network interfaces for Docker containers, Docker should be configured to use a subnet distinct from other network resources used by the DGX system.

By default, Docker uses the 172.17.0.0/16 subnet. Consult your network administrator to find out which IP addresses are used by your network. If your network does not conflict with the default Docker IP address range, no changes are needed and you can skip this section. However, if your network uses the addresses in this range for the DGX system, you should change the default Docker network addresses.

You can change the default Docker network addresses by modifying the /etc/docker/daemon.json file or modifying the /etc/systemd/system/docker.service.d/dockeroverride.conf file. These instructions provide an example of modifying the /etc/systemd/system/docker.service.d/docker-override.conf to override the default Docker network addresses.

  1. Open the docker-override.conf file for editing.

    Copy
    Copied!
                

    $ sudo vi /etc/systemd/system/docker.service.d/docker-override.conf [Service] ExecStart= ExecStart=/usr/bin/dockerd -H fd:// -s overlay2 LimitMEMLOCK=infinity LimitSTACK=67108864

  2. Make the changes indicated in bold below, setting the correct bridge IP address and IP address ranges for your network.

    Consult your IT administrator for the correct addresses.

    Copy
    Copied!
                

    [Service] ExecStart= ExecStart=/usr/bin/dockerd -H fd:// -s overlay2 --bip=192.168.127.1/24 --fixed-cidr=192.168.127.128/25 LimitMEMLOCK=infinity LimitSTACK=67108864

  3. Save and close the /etc/systemd/system/docker.service.d/dockeroverride.conf file.

  4. Reload the systemctl daemon.

    Copy
    Copied!
                

    $ sudo systemctl daemon-reload

  5. Restart Docker.

    Copy
    Copied!
                

    $ sudo systemctl restart docker

Connectivity Requirements for NGC Containers

To run NVIDIA NGC containers from the NGC container registry, your network must be able to access the following URLs:

To verify connection to nvcr.io, run

Copy
Copied!
            

$ wget https://nvcr.io/v2

You should see connecting verification followed by a 401 error:

Copy
Copied!
            

--2018-08-01 19:42:58-- https://nvcr.io/v2 Resolving nvcr.io (nvcr.io) --> 52.8.131.152, 52.9.8.8 Connecting to nvcr.io (nvcr.io)|52.8.131.152|:443. --> connected. HTTP request sent, awaiting response. --> 401 Unauthorized

Configuring Static IP Addresses for the Network Ports

Here are the steps to configure static IP addresses for network ports.

During the initial boot set up process for your DGX system, one of the steps was to configure static IP addresses for a network interface. If you did not configure the addresses at that time, you can configure the static IP addresses from the Ubuntu command line using the following instructions.

Note

If you are connecting to the DGX console remotely, connect by using the BMC remote console. If you connect using SSH, your connection will be lost when you complete the final step. Also, if you encounter issues with the configuration file, the BMC connection will help with troubleshooting.

If you cannot remotely access the DGX system, connect a display with a 1440x900 or lower resolution, and a keyboard directly to the DGX system.

  1. Determine the port designation that you want to configure, based on the physical Ethernet port that you have connected to your network. See [Configuring Network Proxies](index.html#config-network-proxies “If your network needs to use a proxy server, you need to set up configuration files to ensure the DGX system communicates through the proxy.” for the port designation of the connection that you want to configure.

  2. Edit the network configuration yaml file.

    Note

    Ensure that your file identical to the following sample and use spaces and not tabs.

    Copy
    Copied!
                

    $ sudo vi /etc/netplan/01-netcfg.yaml network: version: 2 renderer: networkd Ethernets: <port-designation>: dhcp4: no dhcp6: no addresses: [10.10.10.2/24] gateway4: 10.10.10.1 nameservers: search: [<mydomain>, <other-domain>] addresses: [10.10.10.1, 1.1.1.1]

    Consult your network administrator for the appropriate information for the items in bold, such as network, gateway, and nameserver addresses, and use the port designations that you determined in step 1.

  3. After you complete your edits, press ESC to switch to the command mode.

  4. Save the file to the disk and exit the editor.

  5. Apply the changes.

    Copy
    Copied!
                

    $ sudo netplan apply

Note

If you are not returned to the command line prompt after a information, see Changes, errors, and bugs in the Ubuntu Server Guide.

DGX OS software includes security updates to mitigate CPU speculative side-channel vulnerabilities. These mitigations can decrease the performance of deep learning and machine learning workloads.

If your DGX system installation incorporates other measures to mitigate these vulnerabilities, such as measures at the cluster level, you can disable the CPU mitigations for individual DGX nodes and increase performance.

Determining the CPU Mitigation State of the DGX System

Here is information about how you can determine the CPU mitigation state of your DGX system.

If you do not know whether CPU mitigations are enabled or disabled, issue the following.

Copy
Copied!
            

$ cat /sys/devices/system/cpu/vulnerabilities/*

CPU mitigations are enabled when the output consists of multiple lines prefixed with Mitigation:.

For example:

Copy
Copied!
            

KVM: Mitigation: Split huge pages Mitigation: PTE Inversion; VMX: conditional cache flushes, SMT vulnerable Mitigation: Clear CPU buffers; SMT vulnerable Mitigation: PTI Mitigation: Speculative Store Bypass disabled via prctl and seccomp Mitigation: usercopy/swapgs barriers and __user pointer sanitization Mitigation: Full generic retpoline, IBPB: conditional, IBRS_FW, STIBP: conditional, RSB filling Mitigation: Clear CPU buffers; SMT vulnerable

CPU mitigations are disabled if the output consists of multiple lines prefixed with Vulnerable.

Copy
Copied!
            

KVM: Vulnerable Mitigation: PTE Inversion; VMX: vulnerable Vulnerable; SMT vulnerable Vulnerable Vulnerable Vulnerable: user pointer sanitization and usercopy barriers only; no swapgs barriers Vulnerable, IBPB: disabled, STIBP: disabled Vulnerable

Disabling CPU Mitigations

Here are the steps to disable CPU mitigations.

Caution: Performing the following instructions will disable the CPU mitigations provided by the DGX OS software.

  1. Install the nv-mitigations-off package.

    Copy
    Copied!
                

    $ sudo apt install nv-mitigations-off -y

  2. Reboot the system.

  3. Verify that the CPU mitigations are disabled.

    Copy
    Copied!
                

    $ cat /sys/devices/system/cpu/vulnerabilities/*

The output should include several vulnerable lines. See Determining the CPU Mitigation State of the DGX System “Here is information about how you can determine the CPU mitigation state of your DGX system example output.

Re-enable CPU Mitigations

Here are the steps to enable CPU mitigations again.

  1. Remove the nv-mitigations-off package.

    Copy
    Copied!
                

    $ sudo apt purge nv-mitigations-off

  2. Reboot the system.

  3. Verify that the CPU mitigations are enabled.

    Copy
    Copied!
                

    $ cat /sys/devices/system/cpu/vulnerabilities/*

The output should include several Mitigations lines. See Determining the CPU Mitigation State of the DGX System for example output.

This section provides information about managing the DGX Crash Dump feature. You can use the script that is included in the DGX OS to manage this feature.

Using the Script

Here are commands that help you complete the necessary tasks with the script.

  • To enable only dmesg crash dumps, run:

    Copy
    Copied!
                

    $ /usr/sbin/nvidia-kdump-config enable-dmesg-dump

    This option reserves memory for the crash kernel.

  • To enable both dmesg and vmcore crash dumps, run:

    Copy
    Copied!
                

    $ /usr/sbin/nvidia-kdump-config enable-vmcore-dump

    This option reserves memory for the crash kernel.

  • To disable crash dumps, run:

    Copy
    Copied!
                

    $ /usr/sbin/nvidia-kdump-config disable

This option disables the use of kdump and ensures that no memory is reserved for the crash kernel.

You can connect to serial over a LAN.

Warning

This applies only to systems that have the BMC.

While dumping vmcore, the BMC screen console goes blank approximately 11 minutes after the crash dump is started. To view the console output during the crash dump, connect to serial over LAN as follows:

Copy
Copied!
            

$ ipmitool -I lanplus -H -U -P sol activate

Here is some information about filesystem quotas.

When running NGC containers you might need to limit the amount of disk space that is used on a filesystem to avoid filling up the partition. Refer to How to Set Filesystem Quotas on Ubuntu 18.04 about how to set filesystem quotas on Ubuntu 18.04 and later.

The DGX Station A100 comes equipped with four high performance NVIDIA A100 GPUs and one DGX Display GPU. The NVIDIA A100 GPU is used to run high performance and AI workloads, and the DGX Display card is used to drive a high-quality display on a monitor.

When running applications on this system, it is important to identify the best method to launch applications and workloads to make sure the high performance NVIDIA A100 GPUs are used. You can achieve this in one of the following ways:

When you log into the system and check which GPUs are available, you find the following:

Copy
Copied!
            

$ nvidia-smi -L GPU 0: Graphics Device (UUID: GPU-269d95f8-328a-08a7-5985-ab09e6e2b751) GPU 1: Graphics Device (UUID: GPU-0f2dff15-7c85-4320-da52-d3d54755d182) GPU 2: Graphics Device (UUID: GPU-dc598de6-dd4d-2f43-549f-f7b4847865a5) GPU 3: DGX Display (UUID: GPU-91b9d8c8-e2b9-6264-99e0-b47351964c52) GPU 4: Graphics Device (UUID: GPU-e32263f2-ae07-f1db-37dc-17d1169b09bf)

A total of five GPUs are listed by nvidia-smi. This is because nvidia-smi is including the DGX Display GPU that is used to drive the monitor and high-quality graphics output.

When running an application or workload, the DGX Display GPU can get in the way because it does not have direct NVlink connectivity, sufficient memory, or the performance characteristics of the NVIDIA A100 GPUs that are installed on the system. As a result you should ensure that the correct GPUs are being used.

Running with Docker Containers

On the DGX OS, because Docker has already been configured to identify the high performance NVIDIA A100 GPUs and assign the GPUs to the container, this method is the simplest.

A simple test is to run a small container with the [–gpus all] flag in the command and once in the container that is running nvidia-smi. The output shows that only the high-performance GPUs are available to the container:

Copy
Copied!
            

$ docker run --gpus all --rm -it ubuntu nvidia-smi -L GPU 0: Graphics Device (UUID: GPU-269d95f8-328a-08a7-5985-ab09e6e2b751) GPU 1: Graphics Device (UUID: GPU-0f2dff15-7c85-4320-da52-d3d54755d182) GPU 2: Graphics Device (UUID: GPU-dc598de6-dd4d-2f43-549f-f7b4847865a5) GPU 3: Graphics Device (UUID: GPU-e32263f2-ae07-f1db-37dc-17d1169b09bf)

This step will also work when the --gpus n flag is used, where n can be 1, 2, 3, or 4. These values represent the number of GPUs that should be assigned to that container. For example:

Copy
Copied!
            

$ docker run --gpus 2 --rm -it ubuntu nvidia-smi -L GPU 0: Graphics Device (UUID: GPU-269d95f8-328a-08a7-5985-ab09e6e2b751) GPU 1: Graphics Device (UUID: GPU-0f2dff15-7c85-4320-da52-d3d54755d182)

In this example, Docker selected the first two GPUs to run the container, but if the device option is used, you can specify which GPUs to use:

Copy
Copied!
            

$ docker run --gpus '"device=GPU-dc598de6-dd4d-2f43-549f-f7b4847865a5,GPU-e32263f2-ae07-f1db-37dc-17d1169b09bf"' --rm -it ubuntu nvidia-smi -L GPU 0: Graphics Device (UUID: GPU-dc598de6-dd4d-2f43-549f-f7b4847865a5) GPU 1: Graphics Device (UUID: GPU-e32263f2-ae07-f1db-37dc-17d1169b09bf)

In this example, the two GPUs that were not used earlier are now assigned to run on the container.

Running on Bare Metal

To run applications by using the four high performance GPUs, the CUDA_VISIBLE_DEVICES variable must be specified before you run the application.

Note

This method does not use containers.

CUDA orders the GPUs by performance, so GPU 0 will be the highest performing GPU, and the last GPU will be the slowest GPU.

Warning

CUDA_DEVICE_ORDER variable is set to PCI_BUS_ID, this ordering will be overridden.

In the following example, a CUDA application that comes with CUDA samples is run. In the output, GPU 0 is the fastest in a DGX Station A100, and GPU 4 (DGX Display GPU) is the slowest:

Copy
Copied!
            

$ sudo apt install cuda-samples-11-2

Copy
Copied!
            

$ cd /usr/local/cuda-11.2/samples/1_Utilities/p2pBandwidthLatencyTest

Copy
Copied!
            

$ sudo make /usr/local/cuda/bin/nvcc -ccbin g++ -I../../common/inc -m64 --threads 0 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_86,code=compute_86 -o p2pBandwidthLatencyTest.o -c p2pBandwidthLatencyTest.cu nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). /usr/local/cuda/bin/nvcc -ccbin g++ -m64 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_86,code=compute_86 -o p2pBandwidthLatencyTest p2pBandwidthLatencyTest.o nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). mkdir -p ../../bin/x86_64/linux/release cp p2pBandwidthLatencyTest ../../bin/x86_64/linux/release lab@ro-dvt-058-80gb:/usr/local/cuda-11.2/samples/1_Utilities/p2pBandwidthLatencyTest $ cd /usr/local/cuda-11.2/samples/bin/x86_64/linux/release lab@ro-dvt-058-80gb:/usr/local/cuda-11.2/samples/bin/x86_64/linux/release $ ./p2pBandwidthLatencyTest [P2P (Peer-to-Peer) GPU Bandwidth Latency Test] Device: 0, Graphics Device, pciBusID: 1, pciDeviceID: 0, pciDomainID:0 Device: 1, Graphics Device, pciBusID: 47, pciDeviceID: 0, pciDomainID:0 Device: 2, Graphics Device, pciBusID: 81, pciDeviceID: 0, pciDomainID:0 Device: 3, Graphics Device, pciBusID: c2, pciDeviceID: 0, pciDomainID:0 Device: 4, DGX Display, pciBusID: c1, pciDeviceID: 0, pciDomainID:0 Device=0 CAN Access Peer Device=1 Device=0 CAN Access Peer Device=2 Device=0 CAN Access Peer Device=3 Device=0 CANNOT Access Peer Device=4 Device=1 CAN Access Peer Device=0 Device=1 CAN Access Peer Device=2 Device=1 CAN Access Peer Device=3 Device=1 CANNOT Access Peer Device=4 Device=2 CAN Access Peer Device=0 Device=2 CAN Access Peer Device=1 Device=2 CAN Access Peer Device=3 Device=2 CANNOT Access Peer Device=4 Device=3 CAN Access Peer Device=0 Device=3 CAN Access Peer Device=1 Device=3 CAN Access Peer Device=2 Device=3 CANNOT Access Peer Device=4 Device=4 CANNOT Access Peer Device=0 Device=4 CANNOT Access Peer Device=1 Device=4 CANNOT Access Peer Device=2 Device=4 CANNOT Access Peer Device=3

Note

In case a device doesn’t have P2P access to other one, it falls back to normal memcopy procedure. So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

Copy
Copied!
            

P2P Connectivity Matrix D\D 0 1 2 3 4 0 1 1 1 1 0 1 1 1 1 1 0 2 1 1 1 1 0 3 1 1 1 1 0 4 0 0 0 0 1 Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 4 0 1323.03 15.71 15.37 16.81 12.04 1 16.38 1355.16 15.47 15.81 11.93 2 16.25 15.85 1350.48 15.87 12.06 3 16.14 15.71 16.80 1568.78 11.75 4 12.61 12.47 12.68 12.55 140.26 Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s) D\D 0 1 2 3 4 0 1570.35 93.30 93.59 93.48 12.07 1 93.26 1583.08 93.55 93.53 11.93 2 93.44 93.58 1584.69 93.34 12.05 3 93.51 93.55 93.39 1586.29 11.79 4 12.68 12.54 12.75 12.51 140.26 Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 4 0 1588.71 19.60 19.26 19.73 16.53 1 19.59 1582.28 19.85 19.13 16.43 2 19.53 19.39 1583.88 19.61 16.58 3 19.51 19.11 19.58 1592.76 15.90 4 16.36 16.31 16.39 15.80 139.42 Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 4 0 1590.33 184.91 185.37 185.45 16.46 1 185.04 1587.10 185.19 185.21 16.37 2 185.15 185.54 1516.25 184.71 16.47 3 185.55 185.32 184.86 1589.52 15.71 4 16.26 16.28 16.16 15.69 139.43 P2P=Disabled Latency Matrix (us) GPU 0 1 2 3 4 0 3.53 21.60 22.22 21.38 12.46 1 21.61 2.62 21.55 21.65 12.34 2 21.57 21.54 2.61 21.55 12.40 3 21.57 21.54 21.58 2.51 13.00 4 13.93 12.41 21.42 21.58 1.14 CPU 0 1 2 3 4 0 4.26 11.81 13.11 12.00 11.80 1 11.98 4.11 11.85 12.19 11.89 2 12.07 11.72 4.19 11.82 12.49 3 12.14 11.51 11.85 4.13 12.04 4 12.21 11.83 12.11 11.78 4.02 P2P=Enabled Latency (P2P Writes) Matrix (us) GPU 0 1 2 3 4 0 3.79 3.34 3.34 3.37 13.85 1 2.53 2.62 2.54 2.52 12.36 2 2.55 2.55 2.61 2.56 12.34 3 2.58 2.51 2.51 2.53 14.39 4 19.77 12.32 14.75 21.60 1.13 CPU 0 1 2 3 4 0 4.27 3.63 3.65 3.59 13.15 1 3.62 4.22 3.61 3.62 11.96 2 3.81 3.71 4.35 3.73 12.15 3 3.64 3.61 3.61 4.22 12.06 4 12.32 11.92 13.30 12.03 4.05

Note

The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

The example above shows the peer-to-peer bandwidth and latency test across all five GPUs, including the DGX Display GPU. The application also shows that there is no peer-to-peer connectivity between any GPU and GPU 4. This indicates that GPU 4 should not be used for high-performance workloads.

Run the example one more time by using the CUDA_VISIBLE_DEVICES variable, which limits the number of GPUs that the application can see.

Note

All GPUs can communicate with all other peer devices.

Copy
Copied!
            

$ CUDA_VISIBLE_DEVICES=0,1,2,3 ./p2pBandwidthLatencyTest [P2P (Peer-to-Peer) GPU Bandwidth Latency Test] Device: 0, Graphics Device, pciBusID: 1, pciDeviceID: 0, pciDomainID:0 Device: 1, Graphics Device, pciBusID: 47, pciDeviceID: 0, pciDomainID:0 Device: 2, Graphics Device, pciBusID: 81, pciDeviceID: 0, pciDomainID:0 Device: 3, Graphics Device, pciBusID: c2, pciDeviceID: 0, pciDomainID:0 Device=0 CAN Access Peer Device=1 Device=0 CAN Access Peer Device=2 Device=0 CAN Access Peer Device=3 Device=1 CAN Access Peer Device=0 Device=1 CAN Access Peer Device=2 Device=1 CAN Access Peer Device=3 Device=2 CAN Access Peer Device=0 Device=2 CAN Access Peer Device=1 Device=2 CAN Access Peer Device=3 Device=3 CAN Access Peer Device=0 Device=3 CAN Access Peer Device=1 Device=3 CAN Access Peer Device=2

Note

In case a device doesn’t have P2P access to other one, it falls back to normal memcopy procedure. So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

Copy
Copied!
            

P2P Connectivity Matrix D\D 0 1 2 3 0 1 1 1 1 1 1 1 1 1 2 1 1 1 1 3 1 1 1 1 Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 0 1324.15 15.54 15.62 15.47 1 16.55 1353.99 15.52 16.23 2 15.87 17.26 1408.93 15.91 3 16.33 17.31 18.22 1564.06 Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s) D\D 0 1 2 3 0 1498.08 93.30 93.53 93.48 1 93.32 1583.08 93.54 93.52 2 93.55 93.60 1583.08 93.36 3 93.49 93.55 93.28 1576.69 Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 0 1583.08 19.92 20.47 19.97 1 20.74 1586.29 20.06 20.22 2 20.08 20.59 1590.33 20.01 3 20.44 19.92 20.60 1589.52 Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 0 1592.76 184.88 185.21 185.30 1 184.99 1589.52 185.19 185.32 2 185.28 185.30 1585.49 185.01 3 185.45 185.39 184.84 1587.91 P2P=Disabled Latency Matrix (us) GPU 0 1 2 3 0 2.38 21.56 21.61 21.56 1 21.70 2.34 21.54 21.56 2 21.55 21.56 2.41 21.06 3 21.57 21.34 21.56 2.39 CPU 0 1 2 3 0 4.22 11.99 12.71 12.09 1 11.86 4.09 12.00 11.71 2 12.52 11.98 4.27 12.24 3 12.22 11.75 12.19 4.25 P2P=Enabled Latency (P2P Writes) Matrix (us) GPU 0 1 2 3 0 2.32 2.57 2.55 2.59 1 2.55 2.32 2.59 2.52 2 2.59 2.56 2.41 2.59 3 2.57 2.55 2.56 2.40 CPU 0 1 2 3 0 4.24 3.57 3.72 3.81 1 3.68 4.26 3.75 3.63 2 3.79 3.75 4.34 3.71 3 3.72 3.64 3.66 4.32

Note

The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

For bare metal applications, the UUID can also be specified in the [CUDA_VISIBLE_DEVICES] variable as shown below:

Copy
Copied!
            

$ CUDA_VISIBLE_DEVICES=GPU-0f2dff15-7c85-4320-da52-d3d54755d182,GPU-dc598de6-dd4d-2f43-549f-f7b4847865a5 ./p2pBandwidthLatencyTest

The GPU specification is longer because of the nature of UUIDs, but this is the most precise way to pin specific GPUs to the application.

Multi-Instance GPUs (MIG) is available on NVIDIA A100 GPUs. If MIG is enabled on the GPUs, and if the GPUs have already been partitioned, then applications can be limited to run on these devices.

This works for both Docker containers and for bare metal using the [CUDA_VISIBLE_DEVICES] as shown in the examples below. For instructions on how to configure and use MIG, refer to the NVIDIA Multi-Instance GPU User Guide.

Identify the MIG instances that will be used. Here is the output from a system that has GPU 0 partitioned into 7 MIGs:

Copy
Copied!
            

$ nvidia-smi -L GPU 0: Graphics Device (UUID: GPU-269d95f8-328a-08a7-5985-ab09e6e2b751) MIG 1g.10gb Device 0: (UUID: MIG-GPU-269d95f8-328a-08a7-5985-ab09e6e2b751/7/0) MIG 1g.10gb Device 1: (UUID: MIG-GPU-269d95f8-328a-08a7-5985-ab09e6e2b751/8/0) MIG 1g.10gb Device 2: (UUID: MIG-GPU-269d95f8-328a-08a7-5985-ab09e6e2b751/9/0) MIG 1g.10gb Device 3: (UUID: MIG-GPU-269d95f8-328a-08a7-5985-ab09e6e2b751/11/0) MIG 1g.10gb Device 4: (UUID: MIG-GPU-269d95f8-328a-08a7-5985-ab09e6e2b751/12/0) MIG 1g.10gb Device 5: (UUID: MIG-GPU-269d95f8-328a-08a7-5985-ab09e6e2b751/13/0) MIG 1g.10gb Device 6: (UUID: MIG-GPU-269d95f8-328a-08a7-5985-ab09e6e2b751/14/0) GPU 1: Graphics Device (UUID: GPU-0f2dff15-7c85-4320-da52-d3d54755d182) GPU 2: Graphics Device (UUID: GPU-dc598de6-dd4d-2f43-549f-f7b4847865a5) GPU 3: DGX Display (UUID: GPU-91b9d8c8-e2b9-6264-99e0-b47351964c52) GPU 4: Graphics Device (UUID: GPU-e32263f2-ae07-f1db-37dc-17d1169b09bf)

In Docker, enter the MIG UUID from this output, in which GPU 0 and Device 0 have been selected.

If you are running on DGX Station A100, restart the nv-docker-gpus and docker system services any time MIG instances are created, destroyed or modified by running the following:

Copy
Copied!
            

$ sudo systemctl restart nv-docker-gpus; sudo systemctl restart docker

nv-docker-gpus has to be restarted on DGX Station A100 because this service is used to mask the available GPUs that can be used by Docker. When the GPU architecture changes, the service needs to be refreshed.

Copy
Copied!
            

$ docker run --gpus '"device=MIG-GPU-269d95f8-328a-08a7-5985-ab09e6e2b751/7/0"' --rm -it ubuntu nvidia-smi -L GPU 0: Graphics Device (UUID: GPU-269d95f8-328a-08a7-5985-ab09e6e2b751) MIG 1g.10gb Device 0: (UUID: MIG-GPU-269d95f8-328a-08a7-5985-ab09e6e2b751/7/0)

On bare metal, specify the MIG instances:

Note

This application measures the communication across GPUs, and it is not relevant to read the bandwidth and latency with only one GPU MIG.

The purpose of this example is to illustrate how to use specific GPUs with applications, which is illustrated below.

  1. Go to the following directory:

    Copy
    Copied!
                

    cd /usr/local/cuda-11.2/samples/bin/x86_64/linux/release

  2. Run the p2pBandwidthLatencyTest

    Copy
    Copied!
                

    $ CUDA_VISIBLE_DEVICES=MIG-GPU-269d95f8-328a-08a7-5985-ab09e6e2b751/7/0 ./p2pBandwidthLatencyTest [P2P (Peer-to-Peer) GPU Bandwidth Latency Test] Device: 0, Graphics Device MIG 1g.10gb, pciBusID: 1, pciDeviceID: 0, pciDomainID:0

Note

In case a device doesn’t have P2P access to other one, it falls back to normal memcopy procedure. So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

Copy
Copied!
            

P2P Connectivity Matrix D\D 0 0 1 Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 0 176.20 Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s) D\D 0 0 187.87 Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 0 190.77 Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 0 190.53 P2P=Disabled Latency Matrix (us) GPU 0 0 3.57 CPU 0 0 4.07 P2P=Enabled Latency (P2P Writes) Matrix (us) GPU 0 0 3.55 CPU 0 0 4.07

Note

The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

By default, the DGX system includes several drives in a RAID 0 configuration. These drives are intended for application caching, so you

Using Data Storage for NFS Caching

This section provides information about how you can use data storage for NFS caching.

The DGX systems use cachefilesd to manage NFS caching.

  • Ensure that you have an NFS server with one or more exports with data that will be accessed by the DGX system

  • Ensure that there is network access between the DGX system and the NFS server.

Using cachefilesd

Here are the steps that describe how you can mount the NFS on the DGX system, and how you can cache the NFS by using the DGX SSDs for improved performance.

  1. Configure an NFS mount for the DGX system.

    1. Edit the filesystem tables configuration.

      Copy
      Copied!
                  

      $ sudo vi /etc/fstab

    2. Add a new line for the NFS mount by using the local /mnt local mount point.

      Copy
      Copied!
                  

      <nfs_server>:<export_path> /mnt nfs rw,noatime,rsize=32768,wsize=32768,nolock,tcp,intr,fsc,nofail 0 0

      Here, /mnt is used an example mount point.

      • Contact your Network Administrator for the correct values for <nfs_server> and <export_path>.

      • The nfs arguments presented here are a list of recommended values based on typical use cases. However, fsc must always be included because that argument specifies using FS-Cache

    1. Save the changes.

  2. Verify that the NFS server is reachable.

    Copy
    Copied!
                

    $ ping <nfs_server>

    Use the server IP address or the server name that was provided by your network administrator.

  3. Mount the NFS export.

    Copy
    Copied!
                

    $ sudo mount /mnt

    /mnt is an example mount point.

  4. Verify that caching is enabled.

    Copy
    Copied!
                

    $ cat /proc/fs/nfsfs/volumes

  1. In the output, find FSC=yes.

    The NFS will be automatically mounted and cached on the DGX system in subsequent reboot cycles.

Disabling cachefilesd

Here is some information about how to disable cachefilesd.

If you do not want to enable cachefilesd by running:

  1. Stop the cachefilesd service:

    Copy
    Copied!
                

    $ sudo systemctl stop cachefilesd

  2. Disable the cachefilesd service permanently

    Copy
    Copied!
                

    $ sudo systemctl disable cachefilesd

Changing the RAID Configuration for Data Drives

Here is information that describes how to change the RAID configuration for your data drives.

Warning

You must have a minimum of 2 drives to complete these tasks. If you have less than 2 drives, you cannot complete the tasks.

From the factory, the RAID level of the DGX RAID array is RAID 0. This level provides the maximum storage capacity, but it does not provide redundancy. If one SSD in the array fails, the data that is stored on the array is lost. If you are willing to accept reduced capacity in return for a level of protection against drive failure, you can change the level of the RAID array to RAID 5.

Note

If you change the RAID level from RAID 0 to RAID 5, the total storage capacity of the RAID array is reduced.

Before you change the RAID level of the DGX RAID array, back up the data on the array that you want to preserve. When you change the RAID level of the DGX RAID array, the data that is stored on the array is erased.

You can use the configure_raid_array.py custom script, which is installed on the system to change the level of the RAID array without unmounting the RAID volume.

  • To change the RAID level to RAID 5, run the following command:

    Copy
    Copied!
                

    $ sudo configure_raid_array.py -m raid5

    After you change the RAID level to RAID 5, the RAID array is rebuilt. Although a RAID array that is being rebuilt is online and ready to be used, a check on the health of the DGX system reports the status of the RAID volume as unhealthy. The time required to rebuild the RAID array depends on the workload on the system. For example, on an idle system, the rebuild might be completed in 30 minutes.

  • To change the RAID level to RAID 0, run the following command:

    Copy
    Copied!
                

    $ sudo configure_raid_array.py -m raid0

To confirm that the RAID level was changed, run the lsblk command. The entry in the TYPE column for each drive in the RAID array indicates the RAID level of the array.

This section provides information about how to run NGC containers with your DGX system.

Obtaining an NGC Account

Here is some information about how you can obtain an NGC account.

NVIDIA NGC provides simple access to GPU-optimized software for deep learning, machine learning , and high-performance computing (HPC). An NGC account grants you access to these tools and gives you the ability to set up a private registry to manage your customized software.

If you are the organization administrator for your DGX system purchase, work with NVIDIA Enterprise Support to set up an NGC enterprise account. Refer to the NGC Private Registry User Guide for more information about getting an NGC enterprise account.

Running NGC Containers with GPU Support

To obtain the best performance when running NGC containers on DGX systems, you can use one of the following methods to provide GPU support for Docker containers:

  • Native GPU support (included in Docker 19.03 and later, installed)

  • NVIDIA Container Runtime for Docker

This is in the nvidia-docker2 package.

The recommended method for DGX OS 5 is native GPU support. To run GPU-enabled containers, run [docker run –gpus].

Here is an example that uses all GPUs:

Copy
Copied!
            

$ docker run --gpus all …

Here is an example that uses 2 GPUs:

Copy
Copied!
            

$ docker run --gpus 2 …

Here is an example that uses specific GPUs:

Copy
Copied!
            

$ docker run --gpus '"device=1,2"' ...

Copy
Copied!
            

$ docker run --gpus '"device=UUID-ABCDEF-

Refer to Running Containers for more information about running NGC containers on MIG devices.

© Copyright 2020-2023, NVIDIA. Last updated on Mar 24, 2023.