Preparing for Using Docker Containers

This chapter presents an overview of the prerequisites for accessing NVIDIA Docker containers from the Docker command line for use on the NVIDIA® DGX-1™ in base OS mode. These containers include NVIDIA DGX-1 specific software to ensure the best performance for your applications. Using these containers as a basis for your applications should provide the best single-GPU performance and multi-GPU scaling.

Installing Docker and NVIDIA Docker on DGX OS Server Software 2.x or Earlier

To enable portability in Docker images that leverage GPUs, NVIDIA® developed nvidia-docker, an open-source project that provides a command line tool to mount the user mode components of the NVIDIA driver and the GPUs into the Docker container at launch.

As of DGX OS Server software version 3.1.1 and later, Docker and nvidia-docker are part of the base software installation and you do not need to perform the steps in this section. However, if your DGX-1 is installed with software version 2.x or earlier, then follow these instructions to install Docker and nvidia-docker on the system.

To determine the DGX OS Server software version on your system, enter the following command.
$ grep VERSION /etc/dgx-release
DGX_SWBUILD_VERSION="3.1.1"
Ensure your environment meets the prerequisites before installing Docker. For more information, see Getting Started with Docker.
  1. Install Docker.
    $ sudo apt-key adv --keyserver
    hkp://p80.pool.sks-keyservers.net:80 --recv-keys
    58118E89F3A912897C070ADBF76221572C52609D
    $ echo deb https://apt.dockerproject.org/repo ubuntu-trusty main
    | sudo tee /etc/apt/sources.list.d/docker.list
    $ sudo apt-get update
    $ sudo apt-get -y install docker-engine=1.12.6-0~ubuntu-trusty
  2. Edit the /etc/default/docker file to use the Overlay2 storage driver.
    1. Open the /etc/default/docker file for editing.
       $ sudo vi /etc/default/docker
    2. Add the following line:
      DOCKER_OPTS="--storage-driver=overlay2"
      If there is already a DOCKER_OPTS line, then add the parameters (text between the quote marks) to the DOCKER_OPTS environment variable.
    3. Save and close the /etc/default/docker file when done.
    4. Restart Docker with the new configuration.
      $ sudo service docker restart
  3. Install NVIDIA Docker. The following example installs both nvidia-docker and the nvidia-docker-plugin.
    $ wget -P /tmp  
    https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker_1.0.1-1_amd64.deb
    
    $ sudo dpkg -i /tmp/nvidia-docker*.deb && rm
    /tmp/nvidia-docker*.deb

Configuring Docker IP Addresses

To ensure that the DGX-1 can access the network interfaces for nvidia-docker containers, the nvidia-docker containers should be configured to use a subnet distinct from other network resources used by the DGX-1.

By default, Docker uses the 172.17.0.0/16 subnet. Consult your network administrator to find out which IP addresses are used by your network. If your network does not conflict with the default Docker IP address range, then no changes are needed and you can skip this section.

However, ff your network uses the addresses within this range for the DGX-1, you should change the default nvidia-docker network addresses. The method for accomplishing this depends on the Base OS software version installed on the DGX-1.

  1. If you don't know the Base OS software version installed on the DGX-1, then enter the following and inspect the VERSION entry.
    $ cat /etc/dgx-release
    DGX_NAME="DGX Server"
    DGX_PRETTY_NAME="NVIDIA DGX Server"
    DGX_SWBUILD_DATE="2017-08-02"
    DGX_SWBUILD_VERSION="3.1.1"
    DGX_COMMIT_ID="0a0a8ec9e08836c5e99144dd19ae61690f2d9484"
    DGX_SERIAL_NUMBER=QTFCOU7080017
  2. Follow the instructions in the section appropriate for the software version installed.

Configuring Docker IP Addresses for DGX OS Server Software Version 2.x and Earlier

  1. Open the /etc/default/docker file for editing.
     $ sudo vi /etc/default/docker
  2. Modify the /etc/default/docker file, specifying the correct bridge IP address and IP address ranges for your network. Consult your IT administrator for the correct addresses.
    For example, if your DNS server exists at IP address 10.10.254.254, and the 192.168.0.0/24 subnet is not otherwise needed by the DGX-1, you can add the following line to the /etc/default/docker file:
    DOCKER_OPTS=”--dns 10.10.254.254 --bip=192.168.0.1/24 --fixedcidr=192.168.0.0/24”
    If there is already a DOCKER_OPTS line, then add the parameters (text between the quote marks) to the DOCKER_OPTS environment variable.
  3. Save and close the /etc/default/docker file when done.
  4. Restart Docker with the new configuration.
    $ sudo service docker restart

Configuring Docker IP Addresses for DGX OS Server Software Version 3.1.1 and Later

You can change the default Docker network addresses by either modifying the /etc/docker/daemon.json file or modifying the /etc/systemd/ system/docker.service.d/docker-override.conf file. These instructions provide an example of modifying the /etc/systemd/system/docker.service.d/docker-override.conf to override the default nvidia-docker network addresses.

  1. Open the docker-override.conf file for editing.
     $ sudo vi /etc/systemd/system/docker.service.d/docker-override.conf 
    [Service] 
    ExecStart= 
    ExecStart=/usr/bin/dockerd -H fd:// -s overlay2 --disable-legacy-registry=false
    LimitMEMLOCK=infinity 
    LimitSTACK=67108864
  2. Make the changes indicated in bold below, setting the correct bridge IP address and IP address ranges for your network. Consult your IT administrator for the correct addresses.
    [Service]
    ExecStart= 
    ExecStart=/usr/bin/dockerd -H fd:// -s overlay2 --bip=192.168.127.1/24
          --fixed-cidr=192.168.127.128/25 
           --disable-legacy-registry=false
    LimitMEMLOCK=infinity 
    LimitSTACK=67108864 
    Save and close the /etc/systemd/system/docker.service.d/docker-override.conf file when done.
  3. Reload the systemctl daemon.
     $ sudo systemctl daemon-reload
  4. Restart Docker.
    $ sudo systemctl restart docker

Letting Users Issue Docker Commands

To prevent the docker daemon from running without protection against escalation of privileges, the NVIDIA Docker software requires sudo privileges to run containers.

You can grant the required privileges to users who will run containers on the DGX-1 in one of the following ways:

  • Add each user as an administrator user with sudo privileges.

  • Add each user as a standard user without sudo privileges and then add the user to the docker group.

    This section provides instructions for adding users to the docker group.

Note:WARNING: Only add users to the docker group whom you would trust with root privilege. These instructions make it more convenient for users to access Docker containers; however, the resulting docker group is equivalent to the root user, because once a user is able to send commands to the Docker engine, they are able to escalate privilege and run root level operations. This may violate your organization's security policies. See the Docker Daemon Attack Surface for information on how this can impact security in your system. Always consult your IT department to make sure the installation is in accordance with the security policies of your data center.
Note: The commands in this section require sudo access, and should be performed by a system administrator.

Checking if a User is in the Docker Group

To check whether a user is already part of the docker group, enter the following:

$ groups username 

The output shows all the groups of which that user is a member. If docker is not listed, then add that user.

Creating a User

To create a new user in order to add them to the docker group, perform the following:

  1. Add the user.
    $ sudo useradd username
  2. Set up the password.
    $ sudo passwd username
    Enter a password at the prompts:
    Enter new UNIX password:
    Retype new UNIX password:
    passwd: password updated successfully

Adding a User to the Docker Group

For each user you want to add to the docker group, enter the following command:

$ sudo usermod -a -G docker username

Configuring a System Proxy

If you will be using the DGX-1 in base OS mode, and your network requires use of a proxy, then edit the file /etc/apt/apt.conf.d/proxy.conf and make sure the following lines are present, using the parameters that apply to your network:

Acquire::http::proxy "http://<username>:<password>@<host>:<port>/"; 
Acquire::ftp::proxy "ftp://<username>:<password>@<host>:<port>/"; 
Acquire::https::proxy "https://<username>:<password>@<host>:<port>/";

This is to ensure that Docker is able to access the DGX-1 Container Registry through the proxy. For best practice recommendations on configuring proxies for Docker, see https://docs.docker.com/engine/admin/systemd/#http-proxy.

Configuring NFS Mount and Cache

The DGX-1 includes four SSDs in a RAID 0 configuration. These SSDs are intended for application caching, so you must set up your own NFS drives for long term data storage. The following instructions describe how to mount the NFS onto the DGX-1, and how to cache the NFS using the DGX-1 SSDs for improved performance.

Make sure your DGX-1 is set up in Base OS mode, that you have an NFS server with one or more exports with data to be accessed by the DGX-1, and that there is network access between the DGX-1 and the NFS server.
Note: Skip this section if you are going to use the DGX-1 in cloud-managed mode. The DGX-1 Cloud Services software will set up the NFS cache for you as part of the cloud-managed mode configuration. Similarly, in cloud-managed mode, the person setting up the job will specify any NFS mount requirements for the job at that time.
  1. Check if the cache daemon is installed and configured.
    $ service cachefilesd status
    If the output indicates that cachefilesd is disabled, continue with the following steps. Otherwise, skip to step 7.
  2. Install the cache daemon.
    $ sudo apt-get install cachefilesd
  3. Edit the cache daemon startup file.
    $ sudo vi /etc/default/cachefilesd
    Uncomment the "RUN=yes" line in the startup file and then save the file.
  4. Configure the cache daemon for the DGX-1.
    1. Open the cache daemon configuration file.
      $ sudo vi /etc/cachefilesd.conf
    2. Edit the contents to match the following, then save the file.
      dir /raid
      tag dgx1cache
      brun 25%
      bcull 15%
      bstop 5%
      frun 10%
      fcull 7%
      fstop 3%
      These settings are optimized for Deep Learning workloads, and provide the best throughput for training from large datasets.
  5. Start the cache daemon.
    $ service cachefilesd start
  6. Verify the cache daemon started properly.
    $ service cachefilesd status
    Expected output.
    Checking status of FilesCache daemon cachefilesd
  7. Configure an NFS mount for the DGX-1.
    1. Edit the filesystem tables configuration.
      sudo vi /etc/fstab
    2. Add a new line for the NFS mount, using the local mount point of /mnt.
      <nfs_server>:<export_path> /mnt nfs rw,noatime,rsize=32768,wsize=32768,nolock,tcp,intr,fsc,nofail 0 0
      • /mnt is used here as an example mount point.
      • Consult your Network Administrator for the correct values for <nfs_server> and <export_path>.
      • The nfs arguments presented here are a list of recommended values based on typical use cases. However, "fsc" must always be included as that argument specifies use of FS-Cache.
    3. Save the changes.
  8. Verify the NFS server is reachable.
    ping <nfs_server>
    Use the server IP address or the server name provided by your network administrator.
  9. Mount the NFS export.
    sudo mount /mnt
    /mnt is the example mount point used in step 7.
  10. Verify caching is enabled.
    cat /proc/fs/nfsfs/volumes
    Look for the text FSC=yes in the output. Upon rebooting, the NFS should be mounted and cached on the DGX-1.