DGX-1 User Guide :: DGX Systems Documentation

Installing Docker and the Docker Engine Utility for NVIDIA GPUs on DGX OS Server Software 2.x or Earlier

To enable portability in Docker images that leverage GPUs, NVIDIA developed the Docker Engine Utility for NVIDIA GPUs (nvidia-docker), an open-source project that provides a command line tool to mount the user mode components of the NVIDIA driver and the GPUs into the Docker container at launch.

As of DGX OS Server software version 3.1.1 and later, Docker and the nvidia-docker utility are part of the base software installation and you do not need to perform the steps in this section. However, if your DGX-1 is installed with software version 2.x or earlier, then follow these instructions to install Docker and nvidia-docker on the system.

To determine the DGX OS Server software version on your system, enter the following command.

$ grep VERSION /etc/dgx-release
DGX_SWBUILD_VERSION="2.0.4"

Ensure your environment meets the prerequisites before installing Docker. For more information, see Getting Started with Docker.

Install Docker.

$ sudo apt-key adv --keyserver
hkp://p80.pool.sks-keyservers.net:80 --recv-keys
58118E89F3A912897C070ADBF76221572C52609D
$ echo deb https://apt.dockerproject.org/repo ubuntu-trusty main
| sudo tee /etc/apt/sources.list.d/docker.list
$ sudo apt-get update
$ sudo apt-get -y install docker-engine=1.12.6-0~ubuntu-trusty

Edit the /etc/default/docker file to use the Overlay2 storage driver.
1. Open the /etc/default/docker file for editing.
```
 $ sudo vi /etc/default/docker
```
2. Add the following line:
```
DOCKER_OPTS="--storage-driver=overlay2"
```
  If there is already a DOCKER_OPTS line, then add the parameters (text between the quote marks) to the DOCKER_OPTS environment variable.
3. Save and close the /etc/default/docker file when done.
4. Restart Docker with the new configuration.
```
$ sudo service docker restart
```

Install the NVIDIA Docker Engine Utility for NVIDIA GPUs. The following example installs both nvidia-docker and the nvidia-docker-plugin.

$ wget -P /tmp
https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker_1.0.1-1_amd64.deb

$ sudo dpkg -i /tmp/nvidia-docker*.deb && rm
/tmp/nvidia-docker*.deb

Configuring Docker IP Addresses

To ensure that the DGX-1 can access the network interfaces for Docker containers, Docker should be configured to use a subnet distinct from other network resources used by the DGX-1.

By default, Docker uses the 172.17.0.0/16 subnet. Consult your network administrator to find out which IP addresses are used by your network. If your network does not conflict with the default Docker IP address range, then no changes are needed and you can skip this section.

However, if your network uses the addresses within this range for the DGX-1, you should change the default Docker network addresses. The method for accomplishing this depends on the Base OS software version installed on the DGX-1.

If you don't know the Base OS software version installed on the DGX-1, then enter the following and inspect the VERSION entry.
```
$ grep VERSION /etc/dgx-release
DGX_SWBUILD_VERSION="3.1.1"
```
Follow the instructions in the section appropriate for the software version installed.
- Configuring Docker IP Addresses for DGX OS Server Software Version 2.x and Earlier
- Configuring Docker IP Addresses for DGX OS Server Software Version 3.1.1 and Later

Configuring Docker IP Addresses for DGX OS Server Software Version 2.x and Earlier

Open the /etc/default/docker file for editing.
```
 $ sudo vi /etc/default/docker
```
Modify the /etc/default/docker file, specifying the correct bridge IP address and IP address ranges for your network. Consult your IT administrator for the correct addresses.
For example, if your DNS server exists at IP address 10.10.254.254, and the 192.168.0.0/24 subnet is not otherwise needed by the DGX-1, you can add the following line to the /etc/default/docker file:
```
DOCKER_OPTS=”--dns 10.10.254.254 --bip=192.168.0.1/24 --fixedcidr=192.168.0.0/24”
```
If there is already a DOCKER_OPTS line, then add the parameters (text between the quote marks) to the DOCKER_OPTS environment variable.
Save and close the /etc/default/docker file when done.
Restart Docker with the new configuration.
```
$ sudo service docker restart
```

Configuring Docker IP Addresses for DGX OS Server Software Version 3.1.1 and Later

You can change the default Docker network addresses by either modifying the /etc/docker/daemon.json file or modifying the /etc/systemd/ system/docker.service.d/docker-override.conf file. These instructions provide an example of modifying the /etc/systemd/system/docker.service.d/docker-override.conf to override the default Docker network addresses.

Open the docker-override.conf file for editing.

 $ sudo vi /etc/systemd/system/docker.service.d/docker-override.conf
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd -H fd:// -s overlay2
LimitMEMLOCK=infinity
LimitSTACK=67108864

Make the changes indicated in bold below, setting the correct bridge IP address and IP address ranges for your network. Consult your IT administrator for the correct addresses.
```
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd -H fd:// -s overlay2 --bip=192.168.127.1/24
      --fixed-cidr=192.168.127.128/25

LimitMEMLOCK=infinity
LimitSTACK=67108864 
```
Save and close the /etc/systemd/system/docker.service.d/docker-override.conf file when done.
Reload the systemctl daemon.
```
$ sudo systemctl daemon-reload
```
Restart Docker.
```
$ sudo systemctl restart docker
```

Letting Users Issue Docker Commands

To prevent the docker daemon from running without protection against escalation of privileges, the NVIDIA Docker software requires sudo privileges to run containers.

You can grant the required privileges to users who will run containers on the DGX-1 in one of the following ways:

Add each user as an administrator user with sudo privileges.
Add each user as a standard user without sudo privileges and then add the user to the docker group.

This section provides instructions for adding users to the docker group.

Note:WARNING: Only add users to the docker group whom you would trust with root privilege. These instructions make it more convenient for users to access Docker containers; however, the resulting docker group is equivalent to the root user, because once a user is able to send commands to the Docker engine, they are able to escalate privilege and run root level operations. This may violate your organization's security policies. See the Docker Daemon Attack Surface for information on how this can impact security in your system. Always consult your IT department to make sure the installation is in accordance with the security policies of your data center.

Note: The commands in this section require sudo access, and should be performed by a system administrator.

Checking if a User is in the Docker Group

To check whether a user is already part of the docker group, enter the following:

$ groups username

The output shows all the groups of which that user is a member. If docker is not listed, then add that user.

Creating a User

To create a new user in order to add them to the docker group, perform the following:

$ sudo adduser username

Follow the prompts for creating a password and other user configuration settings.

Adding a User to the Docker Group

For each user you want to add to the docker group, enter the following command:

$ sudo usermod -a -G docker username

Enabling GPU Support for NGC Containers

To obtain the best performance when running NGC containers, three methods of providing GPU support for Docker containers have been developed:

Native GPU support (included with Docker-ce 19.03 or later)
NVIDIA Container Runtime for Docker (nvidia-docker2 package)
Docker Engine Utility for NVIDIA GPUs (nvidia-docker package)

The method implemented in your system depends on the DGX OS version installed (for DGX systems), the specific NGC Cloud Image provided by a Cloud Service Provider, or the software that you have installed in preparation for running NGC containers on TITAN PCs, Quadro PCs, or vGPUs.

Refer to the following table to assist in determining which method is implemented in your system.

DGX OS Release	Method Included
4.2	Native GPU support NVIDIA Container Runtime for Docker
4.1	NVIDIA Container Runtime for Docker
4.0	NVIDIA Container Runtime for Docker
3.1 (versions 3.1.8, 3.1.7, and 3.1.6)	Docker Engine Utility for NVIDIA GPUs but can be updated to the NVIDIA Container Runtime for Docker
3.1 (up to version 3.1.5)	Docker Engine Utility for NVIDIA GPUs
2.1	Docker Engine Utility for NVIDIA GPUs
2.0	Docker Engine Utility for NVIDIA GPUs

Each method is invoked by using specific Docker commands, described as follows.

Using Native GPU support

Note: If Docker is updated to 19.03 on a system which also has nvidia-docker2 installed, then instructions for using the NVIDIA Container Runtime for Docker can still be used.

Use docker run --gpus to run GPU-enabled containers.

Example using all GPUs
```
$ docker run --gpus all ...
```
Example using two GPUs
```
$ docker run --gpus 2 ...
```

Examples using specific GPUs

$ docker run --gpus "device=1,2" ...
$ docker run --gpus "device=UUID-ABCDEF,1" ...

Using the NVIDIA Container Runtime for Docker

With the NVIDIA Container Runtime for Docker installed (nvidia-docker2), you can run GPU-accelerated containers in one of the following ways.

Use docker run and specify runtime=nvidia.
```
$ docker run --runtime=nvidia ...
```
Use nvidia-docker run.
```
$ nvidia-docker run ...
```
The new package provides backward compatibility, so you can still run GPU-accelerated containers by using this command, and the new runtime will be used.
Use docker run with nvidia as the default runtime.
You can set nvidia as the default runtime, for example, by adding the following line to the /etc/docker/daemon.json configuration file as the first entry.
```
"default-runtime": "nvidia",
```
The following is an example of how the added line appears in the JSON file. Do not remove any pre-existing content when making this change.
```
{
 "default-runtime": "nvidia",
  "runtimes": {
     "nvidia": {
         "path": "/usr/bin/nvidia-container-runtime",
         "runtimeArgs": []
     }
 },

}
```
You can then use docker run to run GPU-accelerated containers.
```
$ docker run ...
```
CAUTION:

If you build Docker images while nvidia is set as the default runtime, make sure the build scripts executed by the Dockerfile specify the GPU architectures that the container will need. Failure to do so may result in the container being optimized only for the GPU architecture on which it was built. Instructions for specifying the GPU architecture depend on the application and are beyond the scope of this document. Consult the specific application build process for guidance.

Using the Docker Engine Utility for NVIDIA GPUs

With the Docker Engine Utility for NVIDIA GPUs installed (nvidia-docker), run GPU-enabled containers as follows.

$ nvidia-docker run ...

Configuring a System Proxy

If you will be using the DGX-1 in base OS mode, and your network requires use of a proxy, then edit the file /etc/apt/apt.conf.d/proxy.conf and make sure the following lines are present, using the parameters that apply to your network:

Acquire::http::proxy "http://<username>:<password>@<host>:<port>/";
Acquire::ftp::proxy "ftp://<username>:<password>@<host>:<port>/";
Acquire::https::proxy "https://<username>:<password>@<host>:<port>/";

To ensure that Docker is able to access the DGX-1 Container Registry through the proxy, Docker uses environment variables. For best practice recommendations on configuring proxy environment variables for Docker, see https://docs.docker.com/engine/admin/systemd/#http-proxy.

Configuring NFS Mount and Cache

The DGX-1 includes four SSDs in a RAID 0 configuration. These SSDs are intended for application caching, so you must set up your own NFS drives for long term data storage.

Disabling cachefilesd

The DGX-1 system uses cachefilesd to manage caching of the NFS. If you do not want cachefilesd enabled, you can disable it as follows.

sudo systemctl stop cachefilesd
sudo systemctl disable cachefilesd

Using cachefilesd

The following instructions describe how to mount the NFS onto the DGX-1, and how to cache the NFS using the DGX-1 SSDs for improved performance.

Make sure that you have an NFS server with one or more exports with data to be accessed by the DGX-1, and that there is network access between the DGX-1 and the NFS server.

Configure an NFS mount for the DGX-1.
1. Edit the filesystem tables configuration.
```
sudo vi /etc/fstab
```
2. Add a new line for the NFS mount, using the local mount point of /mnt.
```
<nfs_server>:<export_path> /mnt nfs rw,noatime,rsize=32768,wsize=32768,nolock,tcp,intr,fsc,nofail 0 0
```
  - /mnt is used here as an example mount point.
  - Consult your Network Administrator for the correct values for <nfs_server> and <export_path>.
  - The nfs arguments presented here are a list of recommended values based on typical use cases. However, "fsc" must always be included as that argument specifies use of FS-Cache.
3. Save the changes.
Verify the NFS server is reachable.
```
ping <nfs_server>
```
Use the server IP address or the server name provided by your network administrator.
Mount the NFS export.
```
sudo mount /mnt
```
/mnt is the example mount point used in step 1.
Verify caching is enabled.
```
cat /proc/fs/nfsfs/volumes
```
Look for the text FSC=yes in the output. Upon rebooting, the NFS should be mounted and cached on the DGX-1.