Abstract

This Docker And Container Best Practices Guide provides recommendations to help administrators and users work with Docker. This guide also highlights the best practices to using Docker with NVIDIA containers.

1. Docker Best Practices with NVIDIA Containers

The following sections highlight the best practices for using Docker with NVIDIA containers.
  1. Prerequisites. See Prerequisites.
  2. Log into Docker. See Logging into Docker.
  3. List the Docker images on the DGX-1, DGX Station, or the NVIDIA NGC Cloud Services. See Listing Docker Images.
  4. Pull a container. See Pulling a Container.
  5. Run the container. See Running a Container.
  6. Verifying the container is running properly. See Verifying.

1.1. Prerequisites

You can access NVIDIA’s GPU accelerated containers from all three products, the DGX-1, DGX Station, and the NVIDIA NGC Cloud Services. If you own a DGX-1 or DGX Station then you should use the NVIDIA® DGX™ container registry at https://compute.nvidia.com. This is a web interface to the Docker hub, nvcr.io (NVIDIA DGX container registry). You can pull the containers from there and you can also push containers there into your own account in the registry.

If you are accessing the NVIDIA containers from the NVIDIA® GPU Cloud™ (NGC) container registry via a cloud services provider such as Amazon Web Services (AWS), then you should use NGC container registry at https://ngc.nvidia.com. This is also a web interface to the same Docker repository as for the DGX-1 and DGX Station. After you create an account, the commands to pull containers are the same as if you had a DGX-1 in your own data center. However, currently, you cannot save any containers to the NGC container registry. Instead you have to save the containers to your own Docker repository.
Note: The containers are exactly the same, whether you pull them from the NVIDIA DGX container registry or the NGC container registry.

For all three products, the DGX-1, DGX Station, and the NVIDIA NGC Cloud Services, the location of the framework source is in /opt/<framework> in the container.

Before you can pull a container from the NGC container registry, you must have Docker and nvidia-docker installed as explained in Preparing to use NVIDIA Containers Getting Started Guide. You must also have access and logged into the NGC container registry as explained in the NGC Getting Started Guide.

1.1.1. Hello World For Containers

To make sure you have access to the NVIDIA containers, start with the proverbial “hello world” of Docker commands.

For the DGX-1 and DGX Station, just log into the system. For the NVIDIA NGC Cloud Services consult the NGC Getting Started Guide for details about your specific cloud provider. In general, you will start a cloud instance with your cloud provider using the NVIDIA Volta Deep Learning Image. After the instance has booted, log into the instance.

Next, you can issue the docker --version command to list the version of Docker for all three products, the DGX-1, DGX Station, and the NVIDIA NGC Cloud Services. The output of this command tells you the version of Docker on the system (17.05-ce, build 89658be).
Figure 1. Listing of Docker version Listing of Docker version

At any time, if you are not sure about a Docker command, issue the $ docker --help command.

1.2. Logging Into Docker

If you have a DGX-1 or a DGX Station on premise, then the first time you log into the DGX-1 or DGX Station, you are required to set up access to the containers using https://compute.nvidia.com. This requires that the DGX-1 or DGX Station be connected to the Internet. For more information, see the DGX Container Registry User Guide.

In the case of NVIDIA NGC Cloud Services where you are running nvidia-docker containers in the Cloud, the first time you login you are required to set up access to the NVIDIA NGC Cloud Services at https://ngc.nvidia.com. This requires that the cloud instance be connected to the Internet. For more information, see the Preparing To Use NVIDIA Containers Getting Started Guide and NGC Getting Started Guide.

1.3. Listing Docker Images

Typically, one of the first things you will want to do is get a list of all the Docker images that are currently available. When the Docker containers are stored in a repository, they are said to be a container. When you pull the container from a repository to a system, such as the DGX-1, it is then said to be a Dockerimage. This means the image is local.

Issue the $ docker images command to list the images on the server. Your screen will look similar to the following:
Figure 2. Listing of Docker images Listing of Docker images
In this example, there are a few Docker containers that have been pulled down to this system. Each image is listed along with its tag, the corresponding Image ID. There are two other columns that list when the container was created (approximately), and the approximate size of the image in GB. These columns have been cropped to improve readability.
Note: The output from the command will vary. The above screen capture is just an example.

At any time, if you need help, issue the $ docker images --help command.

1.4. Pulling A Container

A Docker container is composed of layers. The layers are combined to create the container. You can think of layers as intermediate images that add some capability to the overall container. If you make a change to a layer through a DockerFile (see Building Containers), than Docker rebuilds that layer and all subsequent layers but not the layers that are not affected by the build. This reduces the time to create containers and also allows you to keep them modular.

Docker is also very good about keeping one copy of the layers on a system. This saves space and also greatly reduces the possibility of version skew so that layers that should be the same are not duplicated.

Pulling a container to the system makes the container an image. When the container is pulled to become an image, all of the layers are downloaded. Depending upon how many layers are in the container and how the system is connected to the Internet, it may take some time to download.

The $ docker pull nvcr.io/nvidia/tensorflow:17.06 command pulls the container from the NVIDIA repository to the local system where the command is run. At that point, it is a Docker image. The structure of the pull command is:
$ docker pull <repository>/nvidia/<container>:17.06
Where:
  • <repository> is the path to where the container is stored (the Docker repo). In the following example, the repository is nvcr.io/nvidia (NVIDIA’s private repository).
  • <container> is the name of the container. In the following example we use tensorflow.
  • <xx.xx> is the specific version of the container. In the following example we use 17.06.
Below is an image when a TensorFlow container is pulled using the following command:
$ docker pull nvrc.io/nvidia/tensorflow:17.06
Figure 3. Example of pulling TensorFlow 17.06 Example of pulling TensorFlow 17.06
As you can tell, the container had already been pulled down on this particular system (some of the output from the command has been cut off). At this point the image is ready to be run.
Note: The example uses the 17.06 container as an example. The command is the same for other container versions, however, the exact output will differ.
In most cases, you will not find a container already downloaded to the system. Below is some sample output for the case when the container has to be pulled down from the registry, using the command:
$ docker pull nvrc.io/nvidia/tensorflow:17.06
Figure 4. Example of pulling TensorFlow 17.06 that had not already been loaded onto the server Example of pulling TensorFlow 17.06 that had not already been loaded onto the server
Below is the output after the pull is finished, using the command:
$ docker pull nvrc.io/nvidia/tensorflow:17.06
Figure 5. Pulling of the container is complete Pulling of the container is complete
Note: The screen capture has been cropped in the interest of readability.

1.5. Running A Container

After the nvidia-docker container is pulled down to the system, creating a Docker image, you can run or execute the image.
Important: Use the nvidia-docker command to ensure that the correct NVIDIA drivers and libraries are used. The next section discusses nvidia-docker.
A typical command to run the container is:
nvidia-docker run -it --rm -v local_dir:container_dir 
nvcr.io/nvidia/<container>:<xx.xx>

Where:
  • -it means interactive
  • --rm means delete the image when finished
  • –v means mount directory
  • local_dir is the directory or file from your host system (absolute path) that you want to access from inside your container. For example, the local_dir in the following path is /home/jsmith/data/mnist.
    -v /home/jsmith/data/mnist:/data/mnist 

    If you are inside the container, for example, ls /data/mnist, you will see the same files as if you issued the ls /home/jsmith/data/mnist command from outside the container.

  • container_dir is the target directory when you are inside your container. For example, /data/mnist is the target directory in the example:
    -v /home/jsmith/data/mnist:/data/mnist
  • <container> is the name of the container.
  • <xx.xx> is the tag. For example, 17.06.

1.6. Verifying

After a Docker image is running, you can verify by using the classic *nix option ps. For example, issue the $ docker ps -a command.
Figure 6. Verifying a Docker image is running Verifying a Docker image is running
Without the -a option, only running instances are listed.
Important: It is best to include the -a option in case there are hung jobs running or other performance problems.
You can also stop a running container if you want. For example:
Figure 7. Stopping a container from running Stopping a container from running
Note: This screen capture has been cropped to improve readability.

Notice that you need the Container ID of the image you want to stop. This can be found using the $ docker ps -a command.

Another useful command or Docker option is to remove the image from the server. Removing or deleting the image saves space on the server. For example, issue the following command:
$ docker rmi nvcr.io/nvidia.tensorflow:1706
Figure 8. Removing an image from the server Removing an image from the server
If you list the images, $ docker images, on the server, then you will see that the image is no longer there.
Figure 9. Confirming the image is removed from the server Confirming the image is removed from the server
Note: This screen capture has been cropped to improve readability.

2. Docker Best Practices

You can run an nvidia-docker container on any platform that is Docker compatible allowing you to move your application to wherever you need. The containers are platform-agnostic, and therefore, hardware agnostic as well. To get the best performance and to take full advantage of the tremendous performance of a NVIDIA GPU, specific kernel modules and user-level libraries are needed. NVIDIA GPUs introduce some complexity because they require kernel modules and user-level libraries to operate.

One approach to solving this complexity when using containers is to have the NVIDIA drivers installed in the container and have the character devices mapped corresponding to the NVIDIA GPUs such as /dev/nvidia0. For this to work, the drivers on the host (the system that is running the container), must match the version of the driver installed in the container. This approach drastically reduces the portability of the container.

2.1. nvidia-docker Containers Best Practices

To make things easier for Docker® containers that are built for GPUs, NVIDIA® has created nvidia-docker. It is and open-source project hosted on GitHub. It is basically a wrapper around the docker command that takes care of orchestrating the GPU containers that are needed for your container to run.
Important: It is highly recommended you use nvidia-docker when running a Docker container that uses GPUs.
Specifically, it provides two components for portable GPU-based containers.
  1. Driver-agnostic CUDA® images
  2. A Docker command-line wrapper that mounts the user mode components of the driver and the GPUs (character devices) into the container at launch.
The nvidia-docker containers focus solely on helping you run images that contain GPU dependent applications. Otherwise, it passes the arguments to the regular Docker commands. A good introduction to nvidia-docker is here.
Important: Some things to always remember:
  • Use the nvidia-docker command when you are running and executing containers.
  • When building containers for NVIDIA GPUs, use the base containers in the repository. This will ensure the containers are compatible with nvidia-docker.
Let’s assume the TensorFlow 17.06 container has been pulled down to the system and is now an image that is ready to be run. The following command can be used to execute it.
$ nvidia-docker run --rm -ti nvcr.io/nvidia/tensorflow:17.06
Figure 10. Executing the run command Executing the run command Executing the run command
This takes you to a command prompt inside the container.
Remember: You are root inside the container.

The option --rm tells nvidia-docker to remove the container instance when the image is finished. If you make any changes to the image while it’s running, they will be lost.

The option -ti tells docker to run in interactive mode and associate a tty with the instance (basically, a shell).

Running the TensorFlow image didn’t really do anything; it just brought up a command line inside the image where you are root. Below is a better example where the CUDA container is pulled down and the image is executed along with a simple command. This view at least gives you some feedback.
Figure 11. Running an image to give you feedback Running an image to give you feedback
This docker image actually executed a command, nvcc --version, which provides some output, for example, the version of the nvcc compiler). If you want to get a bash shell in the image then you can run bash within the image.
Figure 12. Getting a bash shell in the image Getting a bash shell in the image
Note: This screen capture has been cropped to improve readability.

The frameworks that are part of the nvidia-docker repository, nvcr.io, have some specific options for achieving the best performance. This is true for all three systems, the DGX-1, DGX Station, and the NVIDIA NGC Cloud Services. For more information, see Frameworks Best Practices.

In the section Using And Mounting File Systems, some options for mounting external file systems in the running image are explained.
Important: This allows you to keep data and code stored in one place on the system outside of the containers, while keeping the containers intact.
This allows the containers to stay generic so they don’t start proliferating when each user creates their own version of the container for their data and code.

2.2. docker exec

There are times when you will need to connect to a running container. You can use the docker exec command to connect to a running container to run commands. You can use the bash command to start an interactive command line terminal or bash shell. The format of the command is:
$ docker exec -it <CONTAINER_ID_OR_NAME> bash
As an example, suppose one starts a Deep Learning GPU Training System™ (DIGITS) container with the following command:
$ nvidia-docker run -d --name test-digits \
  -u $(id -u):$(id -g) -e HOME=$HOME -e USER=$USER -v $HOME:$HOME \
  nvcr.io/nvidia/digits:17.05
After the container is running, you can now connect to the container instance with the following command.
$ docker exec -it test-digits bash
Note:test-digits is the name of the container. If you don’t specifically name the container, you will have to use the container ID.
Important: Using docker exec one can execute a snippet of code, a script, or attach interactively to the container making the docker exec command very useful.

For detailed usage of the docker exec command, see docker exec.

2.3. nvcr.io

Building deep learning frameworks can be quite a bit of work and can be very time consuming. Moreover, these frameworks are being updated weekly if not daily. On top of this, is the need to optimize and tune the frameworks or GPUs. NVIDIA has created a Docker repository, named nvcr.io, where deep learning frameworks are tuned, optimized, and containerized for your use.

NVIDIA creates an updated set of nvidia-docker containers for the frameworks monthly. Included in the container is source (these are open-source frameworks), scripts for building the frameworks, Dockerfiles for creating containers based on these containers, markdown files that contain text about the specific container, and tools and scripts for pulling down data sets that can be used for testing or learning. Customers who purchase a DGX-1 or DGX Station have access to this repository for pushing containers (storing containers). When using NVIDIA NGC Cloud Services with a cloud provider, currently you cannot push or save a container to nvcr.io. Instead, you need to save them to a private Docker repository.

To get started with the DGX-1 or DGX Station, you need to create a system admin account for accessing nvcr.io. This account should be treated as an admin account so that users cannot access it. Once this account is created, the system admin can create accounts for projects that belong to the account. They can then give users access to these projects so that they can store or share any containers that they create.

When using the NVIDIA containers with a cloud provider, you are using the NGC container registry that is part of the NVIDIA NGC Cloud Services. It uses the exact same containers as those in nvcr.io.

2.4. Building Containers

You can build containers for the DGX systems and you can even store them in the nvcr.io registry as a project within your account if you have a DGX-1 or DGX Station (for example, no one else can access the container unless you give them access). Currently, only the DGX-1 and DGX Station can store containers in nvcr.io. If you are running on NVIDIA NGC Cloud Services using a cloud provider, you can only pull containers from nvcr.io. You must save the containers to a private Docker repository (not nvcr.io).

This section of the document applies to Docker containers in general. You can use the general approach for your own Docker repository as well, but be cautious of the details.

Using a DGX-1 or DGX Station, you can either:
  1. Create your container from scratch
  2. Base your container on an existing Docker container
  3. Base your container on containers in nvcr.io.
Any one of the three approaches are valid and will work, however, since the goal is to run the containers on a system which has eight GPUs. Moreover, these containers are already tuned for the DGX systems and the GPU topology. All of them also include the needed GPU libraries, configuration files, and tools to rebuild the container.
Important: Based on these assumptions it is recommended that you start with a container from nvcr.io.

An existing container in nvcr.io should be used as a starting point. As an example, the TensorFlow 17.06 container will be used and Octave will be added to the container so that some post-processing of the results can be accomplished.

  1. Pull the container from the NGC container registry to the server. See Pulling A Container.
  2. On the server, create a subdirectory called mydocker.
    Note: This is an arbitrary directory name.
  3. Inside this directory, create a file called Dockerfile (capitalization is important). This is the default name that Docker looks for when creating a container. The Dockerfile should look similar to the following:
    Figure 13. Example of a Dockerfile Example of a Dockerfile
    There are three lines in the Dockerfile.
    • The first line in the Dockerfile tells Docker to start with the container nvcr.io/nvidia/tensorflow:17.06. This is the base container for the new container.
    • The second line in the Dockerfile performs a package update for the container. It doesn’t update any of the applications in the container but just updates the apt-get database. This is needed before we install new applications in the container.
    • The third and last line in the Dockerfile tells Docker to install the package octave into the container using apt-get.
    The Docker command to create the container is:
    $ docker build -t nvcr.io/nvidian_sas/tensorflow_octave:17.06_with_octave
    Note: This command uses the default file Dockerfile for creating the container.
    In the following screen capture, the command starts with docker build. The -t option creates a tag for this new container. Notice that the tag specifies the project in the nvcr.io repository where the container is to be stored. As an example, the project nvidian_sas was used along with the repository nvcr.io. Projects can be created by your local administrator who controls access to nvcr.io, or they can give you permission to create them. This is where you can store your specific containers and even share them with your colleagues.
    Figure 14. Creating a container using the Dockerfile Creating a container using the Dockerfile
    Note: This screen capture has been cropped to improve readability.

    In the brief output from the docker build … command seen above, each line in the Dockerfile is a Step. In the screen capture, you can see the first and second steps (commands). Docker echos these commands to the standard out (stdout) so you can watch what it is doing or you can capture the output for documentation.

    After the image is built, remember that we haven’t stored the image in a repository yet, therefore, it’s a docker image. Docker prints out the image id to stdout at the very end. It also tells you if you have successfully created and tagged the image.

    If you don’t see Successfully ... at the end of the output, examine your Dockerfile for errors (perhaps try to simplify it) or try a very simple Dockerfile to ensure that Docker is working properly.

  4. Verify that Docker successfully created the image.
    $ docker images
    Figure 15. Verifying Docker created the image Verifying Docker created the image
    Note: The screen capture has been cropped to make it more readable.

    The very first entry is the new image (about 1 minute old).

  5. Push the image into the repository, creating a container.
    docker push <name of image>
    Figure 16. Example of the docker push command Example of the docker push command

    The above screen capture is after the docker push … command pushes the image to the repository creating a container. At this point, you should log into the NGC container registry at https://ngc.nvidia.com and look under your project to see if the container is there.

    If you don’t see the container in your project, make sure that the tag on the image matches the location in the repository. If, for some reason, the push fails, try it again in case there was a communication issue between your system and the container registry (nvcr.io).

    To make sure that the container is in the repository, we can pull it to the server and run it. As a test, first remove the image from the DGX station using the command docker rmi …. Then pull down the container down to the server using docker pull …. The image can be run using nvidia-docker as shown below.
    Figure 17. Example of using nvidia-docker to pull container Example of using nvidia-docker to pull container
    Notice that the octave prompt came up so it is installed and functioning correctly within the limits of this testing.

2.5. Using And Mounting File Systems

One of the fundamental aspects of using Docker is mounting file systems inside the Docker container. These file systems can contain input data for the frameworks or even code to run in the container.

Docker containers have their own internal file system that is separate from file systems on the rest of the host.
Important: You can copy data into the container file system from outside if you want. However, it’s far easier to mount an outside file system into the container.
Mounting outside file systems is done using the nvidia-docker command using the -v option. For example, the following command mounts two file systems:
$ nvidia-docker run --rm -ti ... -v $HOME:$HOME \
  -v /datasets:/digits_data:ro \
  ...
Most of the command has been erased except for the volumes. This command mounts the user’s home directory from the external file system to the home directory in the container (-v $HOME:$HOME). It also takes the /datasets directory from the host and mounts it on /digits_data inside the container (-v /datasets:/digits_data:ro).
Remember: The user has root privileges with Docker, therefore you will mount almost anything from the host system to anywhere in the container.
For this particular command, the volume command takes the form of:
-v <External FS Path>:<Container FS Path>(options) \

The first part of the option is the path for the external file system. To be sure this works correctly, it’s best to use the fully qualified path (FQP). This is also true for the mount point inside the container <Container FS Path>.

After the last path, various options can be used in the parenthesis (). In the above example, the second file system is mounted read-only (ro) inside the container. The various options for the volume option are discussed here.

The DGX™ systems (DGX-1 and DGX Station), and the nvidia-docker containers use the Overlay2 storage driver to mount external file systems onto the container file system. Overlay2 is a union-mount file system driver that allows you to combine multiple file systems so that all the content appears to be combined into a single file system. It creates a union of the file systems rather than an intersection.

Notices

Notice

THE INFORMATION IN THIS GUIDE AND ALL OTHER INFORMATION CONTAINED IN NVIDIA DOCUMENTATION REFERENCED IN THIS GUIDE IS PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE INFORMATION FOR THE PRODUCT, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the product described in this guide shall be limited in accordance with the NVIDIA terms and conditions of sale for the product.

THE NVIDIA PRODUCT DESCRIBED IN THIS GUIDE IS NOT FAULT TOLERANT AND IS NOT DESIGNED, MANUFACTURED OR INTENDED FOR USE IN CONNECTION WITH THE DESIGN, CONSTRUCTION, MAINTENANCE, AND/OR OPERATION OF ANY SYSTEM WHERE THE USE OR A FAILURE OF SUCH SYSTEM COULD RESULT IN A SITUATION THAT THREATENS THE SAFETY OF HUMAN LIFE OR SEVERE PHYSICAL HARM OR PROPERTY DAMAGE (INCLUDING, FOR EXAMPLE, USE IN CONNECTION WITH ANY NUCLEAR, AVIONICS, LIFE SUPPORT OR OTHER LIFE CRITICAL APPLICATION). NVIDIA EXPRESSLY DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY OF FITNESS FOR SUCH HIGH RISK USES. NVIDIA SHALL NOT BE LIABLE TO CUSTOMER OR ANY THIRD PARTY, IN WHOLE OR IN PART, FOR ANY CLAIMS OR DAMAGES ARISING FROM SUCH HIGH RISK USES.

NVIDIA makes no representation or warranty that the product described in this guide will be suitable for any specified use without further testing or modification. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to ensure the product is suitable and fit for the application planned by customer and to do the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this guide. NVIDIA does not accept any liability related to any default, damage, costs or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this guide, or (ii) customer product designs.

Other than the right for customer to use the information in this guide with the product, no other license, either expressed or implied, is hereby granted by NVIDIA under this guide. Reproduction of information in this guide is permissible only if reproduction is approved by NVIDIA in writing, is reproduced without alteration, and is accompanied by all associated conditions, limitations, and notices.

Trademarks

NVIDIA, the NVIDIA logo, and cuBLAS, CUDA, cuDNN, cuFFT, cuSPARSE, DIGITS, DGX, DGX-1, DGX Station, GRID, Jetson, Kepler, NVIDIA GPU Cloud, Maxwell, NCCL, NVLink, Pascal, Tegra, TensorRT, Tesla and Volta are trademarks and/or registered trademarks of NVIDIA Corporation in the Unites States and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.