Running A Container

Before you can run an NGC deep learning framework container, your Docker environment must support NVIDIA GPUs. To run a container, issue the appropriate command as explained in this chapter, specifying the registry, repository, and tags.

Enabling GPU Support for NGC Containers

To obtain the best performance when running NGC containers, three methods of providing GPU support for Docker containers have been developed:
  • Native GPU support (included with Docker-ce 19.03 or later)
  • NVIDIA Container Runtime for Docker (nvidia-docker2 package)
  • Docker Engine Utility for NVIDIA GPUs (nvidia-docker package)
The method implemented in your system depends on the DGX OS version installed (for DGX systems), the specific NGC Cloud Image provided by a Cloud Service Provider, or the software that you have installed in preparation for running NGC containers on TITAN PCs, Quadro PCs, or vGPUs.

Refer to the following table to assist in determining which method is implemented in your system.

GPU Support Method When Used How to Determine
Native GPU Support Included with Docker-ce 19.03 or later Run docker version to determine the installed Docker version.
NVIDIA Container Runtime for Docker If the nvidia-docker2 package is installed Run nvidia-docker version and check for NVIDIA Docker version 2.0 or later
Docker Engine Utility for NVIDIA GPUs If the nvidia-docker package is installed Run nvidia-docker version and check for NVIDIA Docker version 1.x

Each method is invoked by using specific Docker commands, described as follows.

Using Native GPU support

Note: If Docker is updated to 19.03 on a system which already has nvidia-docker or nvidia-docker2 installed, then the corresponding methods can still be used.
  • To use the native support on a new installation of Docker, first enable the new GPU support in Docker.
    $ sudo apt-get install -y docker nvidia-container-toolkit 

    This step is not needed if you have updated Docker to 19.03 on a system with nvidia-docker2 installed. The native support will be enabled automatically.

  • Use docker run --gpus to run GPU-enabled containers.
    • Example using all GPUs
      $ docker run --gpus all ...
    • Example using two GPUs
      $ docker run --gpus 2 ...
    • Examples using specific GPUs
      $ docker run --gpus "device=1,2" ... 
      $ docker run --gpus "device=UUID-ABCDEF,1" ... 

Using the NVIDIA Container Runtime for Docker

With the NVIDIA Container Runtime for Docker installed (nvidia-docker2), you can run GPU-accelerated containers in one of the following ways.
  • Use docker run and specify runtime=nvidia.
    $ docker run --runtime=nvidia ...
  • Use nvidia-docker run.
    $ nvidia-docker run ...

    The new package provides backward compatibility, so you can still run GPU-accelerated containers by using this command, and the new runtime will be used.

  • Use docker run with nvidia as the default runtime.

    You can set nvidia as the default runtime, for example, by adding the following line to the /etc/docker/daemon.json configuration file as the first entry.

    "default-runtime": "nvidia",

    The following is an example of how the added line appears in the JSON file. Do not remove any pre-existing content when making this change.

    {
     "default-runtime": "nvidia",
      "runtimes": {
         "nvidia": {
             "path": "/usr/bin/nvidia-container-runtime",
             "runtimeArgs": []
         }
     },
    
    }

    You can then use docker run to run GPU-accelerated containers.

    $ docker run ...
    CAUTION:
    If you build Docker images while nvidia is set as the default runtime, make sure the build scripts executed by the Dockerfile specify the GPU architectures that the container will need. Failure to do so may result in the container being optimized only for the GPU architecture on which it was built. Instructions for specifying the GPU architecture depend on the application and are beyond the scope of this document. Consult the specific application build process for guidance.

Using the Docker Engine Utility for NVIDIA GPUs

With the Docker Engine Utility for NVIDIA GPUs installed (nvidia-docker), run GPU-enabled containers as follows.

$ nvidia-docker run ... 

Running NGC Containers

On a system with GPU support for NGC containers, the following occurs when running a container.

  • The Docker Engine loads the image into a container which runs the software.
  • You define the runtime resources of the container by including additional flags and settings that are used with the command. These flags and settings are described in the following sections.
  • The GPUs are explicitly defined for the Docker container (defaults to all GPUs, can be specified using NV_GPU environment variable).
Note: The base command docker run --gpu all assumes that your system has Docker 19.03-CE installed. See the section Enabling GPU Support for NGC Containers for the command to use for earlier versions of Docker.
  1. As a user, run the container interactively.
    $ docker run --gpus all -it --rm –v local_dir:container_dir
              nvcr.io/nvidia/<repository>:<xx.xx>

    The following example runs the December 2016 release (16.12) of the NVCaffe container in interactive mode. The container is automatically removed when the user exits the container.

    $ docker run --gpus all --rm -ti nvcr.io/nvidia/caffe:16.12
    
    ===========
    == Caffe ==
    ===========
    
    NVIDIA Release 16.12 (build 6217)
    
    Container image Copyright (c) 2016, NVIDIA CORPORATION.  All rights reserved.
    Copyright (c) 2014, 2015, The Regents of the University of California (Regents)
    All rights reserved.
    
    Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
    NVIDIA modifications are covered by the license terms that apply to the underlying project or file.
    root@df57eb8e0100:/workspace#
  2. From within the container, start the job that you want to run. The precise command to run depends on the deep learning framework in the container that you are running and the job that you want to run. For details see the /workspace/README.md file for the container.

    The following example runs the caffe time command on one GPU to measure the execution time of the deploy.prototxt model.

    # caffe time -model models/bvlc_alexnet/ -solver deploy.prototxt -gpu=0
  3. Optional: Run the December 2016 release (16.12) of the same NVCaffe container but in non-interactive mode.
    % docker run --gpus all --rm nvcr.io/nvidia/caffe:16.12 caffe time -model
          /workspace/models/bvlc_alexnet -solver /workspace/deploy.prototxt -gpu=0

Specifying A User

Unless otherwise specified, the user inside the container is the root user.

When running within the container, files created on the host operating system or network volumes can be accessed by the root user. This is unacceptable for some users and they will want to set the ID of the user in the container. For example, to set the user in the container to be the currently running user, issue the following:
% docker run --gpus all -ti --rm -u $(id -u):$(id -g) nvcr.io/nvidia/<repository>:<tag>
Typically, this results in warnings due to the fact that the specified user and group do not exist in the container. You might see a message similar to the following:
groups: cannot find name for group ID 1000I have no name! @c177b61e5a93:/workspace$
The warning can usually be ignored.

Setting The Remove Flag

By default, Docker containers remain on the system after being run. Repeated pull or run operations use up more and more space on the local disk, even after exiting the container. Therefore, it is important to clean up the nvidia-docker containers after exiting.
Note: Do not use the --rm flag if you have made changes to the container that you want to save, or if you want to access job logs after the run finishes.
To automatically remove a container when exiting, add the --rm flag to the run command.
% docker run --gpus all --rm nvcr.io/nvidia/<repository>:<tag>

Setting The Interactive Flag

By default, containers run in batch mode; that is, the container is run once and then exited without any user interaction. Containers can also be run in interactive mode as a service.

To run in interactive mode, add the -ti flag to the run command.
% docker run --gpus all -ti --rm nvcr.io/nvidia/<repository>:<tag>

Setting The Volumes Flag

There are no data sets included with the containers, therefore, if you want to use data sets, you need to mount volumes into the container from the host operating system. For more information, see Manage data in containers.

Typically, you would use either Docker volumes or host data volumes. The primary difference between host data volumes and Docker volumes is that Docker volumes are private to Docker and can only be shared amongst Docker containers. Docker volumes are not visible from the host operating system, and Docker manages the data storage. Host data volumes are any directory that is available from the host operating system. This can be your local disk or network volumes.

Example 1
Mount a directory /raid/imagedata on the host operating system as /images in the container.
% docker run --gpus all -ti --rm -v /raid/imagedata:/images
        nvcr.io/nvidia/<repository>:<tag>
Example 2
Mount a local docker volume named data (must be created if not already present) in the container as /imagedata.
% docker run --gpus all -ti --rm -v data:/imagedata nvcr.io/nvidia/<repository>:<tag>

Setting The Mapping Ports Flag

Applications such as Deep Learning GPU Training System™ (DIGITS) open a port for communications. You can control whether that port is open only on the local system or is available to other computers on the network outside of the local system.

Using DIGITS as an example, in DIGITS 5.0 starting in container image 16.12, by default the DIGITS server is open on port 5000. However, after the container is started, you may not easily know the IP address of that container. To know the IP address of the container, you can choose one of the following ways:
  • Expose the port using the local system network stack (--net=host) where port 5000 of the container is made available as port 5000 of the local system.
or
  • Map the port (-p 8080:5000) where port 5000 of the container is made available as port 8080 of the local system.

In either case, users outside the local system have no visibility that DIGITS is running in a container. Without publishing the port, the port is still available from the host, however not from the outside.

Setting The Shared Memory Flag

Certain applications, such as PyTorch™ and the Microsoft® Cognitive Toolkit™ , use shared memory buffers to communicate between processes. Shared memory can also be required by single process applications, such as MXNet™ and TensorFlow™ , which use the NVIDIA® Collective Communications Library ™ (NCCL) (NCCL).

By default, Docker containers are allotted 64MB of shared memory. This can be insufficient, particularly when using all 8 GPUs. To increase the shared memory limit to a specified size, for example 1GB, include the --shm-size=1g flag in your docker run command.

Alternatively, you can specify the --ipc=host flag to re-use the host’s shared memory space inside the container. Though this latter approach has security implications as any data in shared memory buffers could be visible to other containers.

Setting The Restricting Exposure Of GPUs Flag

From inside the container, the scripts and software are written to take advantage of all available GPUs. To coordinate the usage of GPUs at a higher level, you can use this flag to restrict the exposure of GPUs from the host to the container. For example, if you only want GPU 0 and GPU 1 to be seen in the container, you would issue the following:

Using native GPU support

$ docker run --gpus "device=0,1" ...

Using nvidia-docker2

$ NV_GPU=0,1 docker run --runtime=nvidia ...

Using nvidia-docker

$ NV_GPU=0,1 nvidia-docker run ...

This flag creates a temporary environment variable that restricts which GPUs are used.

Specified GPUs are defined per container using the Docker device-mapping feature, which is currently based on Linux cgroups.

Container Lifetime

The state of an exited container is preserved indefinitely if you do not pass the --rm flag to the docker run command. You can list all of the saved exited containers and their size on the disk with the following command:
$ docker ps --all --size --filter Status=exited

The container size on the disk depends on the files created during the container execution, therefore the exited containers take only a small amount of disk space.

You can permanently remove a exited container by issuing:
docker rm [CONTAINER ID]
By saving the state of containers after they have exited, you can still interact with them using the standard Docker commands. For example:
  • You can examine logs from a past execution by issuing the docker logs command.
    $ docker logs 9489d47a054e
  • You can extract files using the docker cp command.
    $ docker cp 9489d47a054e:/log.txt .
  • You can restart a stopped container using the docker restart command.
    $ docker restart <container name>
    For the NVCaffe™ container, issue this command:
    $ docker restart caffe
  • You can save your changes by creating a new image using the docker commit command. For more information, see Example 3: Customizing a Container using docker commit.
    Note: Use care when committing Docker container changes, as data files created during use of the container will be added to the resulting image. In particular, core dump files and logs can dramatically increase the size of the resulting image.