Abstract

This Best Practices User Guide covers DGX-1, DGX Station, NVIDIA GPU Cloud, Containers and Frameworks. It provides recommendations to help administrators and users work with Docker, extend frameworks, and administer and manage DGX products.

1. About This Guide

This guide provides recommendations to help administrators and users work with Docker®, extend frameworks, and administer and manage the DGX-1™ , DGX Station™ , and NVIDIA® GPU Cloud™ (NGC) products. Although this entire guide provides best practices, whenever possible, the reasons behind those recommendations, the most effective recommendations are labeled as such:
Important:

This guide does not provide step-by-step instructions. For additional procedural instruction, see the Preparing To Use NVIDIA Containers Getting Started Guide and the NVIDIA Containers for Deep Learning Frameworks User Guide.

2. Introduction To nvidia-docker And Docker

The DGX-1, DGX Station, and the NVIDIA NGC Cloud Services are designed to run containers. Containers hold the application as well as any libraries or code that are needed to run the application. Containers are portable within an operating system family. For example, you can create a container using Red Hat Enterprise Linux and run it on an Ubuntu system, or vice versa. The only common thread between the two operating systems is that they each need to have the container software so they can run containers.

Using containers allows you to create the software on whatever OS you are comfortable with and then run the application where ever you want. It also allows you to share the application with other users without having to rebuild the application on the OS they are using.

Containers are different than a virtual machine (VM) such as VMware. A VM has a complete operating system and possibly applications and data files. Containers do not contain a complete operating system. They only contain the software needed to run the application. The container relies on the host OS for things such as file system services, networking, and an OS kernel. The application in the container will always run the same anywhere, regardless of the OS/compute environment.

All three products, the DGX-1, the DGX Station, and the NVIDIA NGC Cloud Services uses Docker. Docker is one of the most popular container services available and is very commonly used by developers in the Artificial Intelligence (AI) space. There is a public Docker repository that holds pre-built Docker containers. These containers can be a simple base OS such as CentOS, or they may be a complete application such as TensorFlow™ . You can use these Docker containers for running the applications that they contain. You can use them as the basis for creating other containers, for example for extending a container.

To enable portability in Docker images that leverage GPUs, NVIDIA developed the Docker® Engine Utility for NVIDIA® GPUs, also known as nvidia-docker. We will refer to the Docker® Engine Utility for NVIDIA® GPUs simply as nvidia-docker for the remainder of this guide.

With the three products, the DGX-1, DGX Station, and the NVIDIA NGC Cloud Services, NVIDIA provides access to Docker containers that have been especially built, tuned, and optimized for NVIDIA GPUs. This is done through NVIDIA’s private Docker Repository, nvcr.io. Some of these containers are for deep learning frameworks and some contain the building blocks of GPU applications. They are there for your use, but are only licensed for use on these three systems, the DGX-1, DGX Station, and the NVIDIA NGC Cloud Services. You are not restricted to using only the nvidia-docker containers, you can use public Docker containers or other Docker containers on these systems as well.

Containers are not difficult to use. There are just a few basic commands. It’s also not difficult to build a container, particularly if you are starting with an existing container and building upon it. If you are new to containers, especially Docker containers, the next section provides some best practices around Docker and its commands.

3. Docker Best Practices with NVIDIA Containers

The following sections highlight the best practices to using Docker with NVIDIA containers.
  1. Prerequisites. See Prerequisites.
  2. Log into Docker. See Logging into Docker.
  3. List the Docker images on the DGX-1, DGX Station, or the NVIDIA NGC Cloud Services. See Listing Docker Images.
  4. Pull a container. See Pulling a Container.
  5. Run the container. See Running a Container.
  6. Verifying the container is running properly. See Verifying.

3.1. Prerequisites

You can access NVIDIA’s GPU accelerated containers from all three products, the DGX-1, DGX Station, and the NVIDIA NGC Cloud Services. If you own a DGX-1 or DGX Station then you should use the NVIDIA® DGX™ container registry at https://compute.nvidia.com. This is a web interface to the Docker hub, nvcr.io (NVIDIA DGX container registry). You can pull the containers from there and you can also push containers there into your own account in the registry.

If you are accessing the NVIDIA containers from the NVIDIA® GPU Cloud™ (NGC) container registry via a cloud services provide such as Amazon Web Services (AWS), then you should use NGC container registry at https://ngc.nvidia.com. This is also a web interface to the same Docker repository as for the DGX-1 and DGX Station. After you create an account, the commands to pull containers are the same as if you had a DGX-1 in your own data center. However, currently, you cannot save any containers to the NGC container registry. Instead you have to save the containers to your own Docker repository.
Note: The containers are exactly the same, whether you pull them from the NVIDIA DGX container registry or the NGC container registry.

For all three products, the DGX-1, DGX Station, and the NVIDIA NGC Cloud Services, the location of the framework source is in /opt/<framework> in the container.

Before you can pull a container from the NGC container registry, you must have Docker and nvidia-docker installed as explained in Preparing to use NVIDIA Containers Getting Started Guide. You must also have access and logged into the NGC container registry as explained in the NGC Getting Started Guide.

3.1.1. Hello World For Containers

To make sure you have access to the NVIDIA containers, start with the proverbial “hello world” of Docker commands.

For the DGX-1 and DGX Station, just log into the system. For the NVIDIA NGC Cloud Services consult the NGC Getting Started Guide for details about your specific cloud provider. In general, you will start a cloud instance with your cloud provider using the NVIDIA Volta Deep Learning Image. After the instance has booted, log into the instance.

Next, you can issue the docker --version command to list the version of Docker for all three products, the DGX-1, DGX Station, and the NVIDIA NGC Cloud Services. The output of this command tells you the version of Docker on the system (17.05-ce, build 89658be).
Figure 1. Listing of Docker version Listing of Docker version

At any time, if you are not sure about a Docker command, issue the $ docker --help command.

3.2. Logging Into Docker

If you have a DGX-1 or a DGX Station on premise, then the first time you log into the DGX-1 or DGX Station, you are required to set up access to the containers using https://compute.nvidia.com. This requires that the DGX-1 or DGX Station be connected to the Internet. For more information, see the DGX Container Registry User Guide.

In the case of NVIDIA NGC Cloud Services where you are running nvidia-docker containers in the Cloud, the first time you login you are required to set up access to the NVIDIA NGC Cloud Services at https://ngc.nvidia.com. This requires that the cloud instance be connected to the Internet. For more information, see the Preparing To Use NVIDIA Containers Getting Started Guide and NGC Getting Started Guide.

3.3. Listing Docker Images

Typically, one of the first things you will want to do is get a list of all the Docker images that are currently available. When the Docker containers are stored in a repository, they are said to be a container. When you pull the container from a repository to a system, such as the DGX-1, it is then said to be a Dockerimage. This means the image is local.

Issue the $ docker images command to list the images on the server. Your screen will look similar to the following:
Figure 2. Listing of Docker images Listing of Docker images

In this example, there are a few Docker containers that have been pulled down to this system. Each image is listed along with its tag, the corresponding Image ID. There are two other columns that list when the container was created (approximately), and the approximate size of the image in GB. These columns have been cropped to improve readability.

At any time, if you need help, issue the $ docker images --help command.

3.4. Pulling A Container

A Docker container is composed of layers. The layers are combined to create the container. You can think of layers as intermediate images that add some capability to the overall container. If you make a change to a layer through a DockerFile (see Building Containers), than Docker rebuilds that layer and all subsequent layers but not the layers that are not affected by the build. This reduces the time to create containers and also allows you to keep them modular.

Docker is also very good about keeping one copy of the layers on a system. This saves space and also greatly reduces the possibility of version skew so that layers that should be the same are not duplicated.

Pulling a container to the system makes the container an image. When the container is pulled to become an image, all of the layers are downloaded. Depending upon how many layers are in the container and how the system is connected to the Internet, it may take some time to download.

The $ docker pull nvcr.io/nvidia/tensorflow:17.06 command pulls the container from the NVIDIA repository to the local system where the command is run. At that point, it is a Docker image. The structure of the pull command is:
$ docker pull <repository>/nvidia/<container>:17.06
Where:
  • <repository> is the path to where the container is stored (the Docker repo). In the following example, the repository is nvcr.io/nvidia (NVIDIA’s private repository).
  • <container> is the name of the container. In the following example we use tensorflow.
  • <xx.xx> is the specific version of the container. In the following example we use 17.06.
Below is an image when a TensorFlow container is pulled using the following command:
$ docker pull nvrc.io/nvidia/tensorflow:17.06
Figure 3. Example of pulling TensorFlow 17.06 Example of pulling TensorFlow 17.06

As you can tell, the container had already been pulled down on this particular system (some of the output from the command has been cut off). At this point the image is ready to be run.

In most cases, you will not find a container already downloaded to the system. Below is some sample output for the case when the container has to be pulled down from the registry, using the command:
$ docker pull nvrc.io/nvidia/tensorflow:17.06
Figure 4. Example of pulling TensorFlow 17.06 that had not already been loaded onto the server Example of pulling TensorFlow 17.06 that had not already been loaded onto the server
Below is the output after the pull is finished, using the command:
$ docker pull nvrc.io/nvidia/tensorflow:17.06
Figure 5. Pulling of the container is complete Pulling of the container is complete
Note: The screen capture has been cropped in the interest of readability.

3.5. Running A Container

After the nvidia-docker container is pulled down to the system, creating a Docker image, you can run or execute the image.
Important: Use the nvidia-docker command to ensure that the correct NVIDIA drivers and libraries are used. The next section discusses nvidia-docker.
A typical command to run the container is:
nvidia-docker run -it --rm -v local_dir:container_dir 
nvcr.io/nvidia/<container>:<xx.xx>

Where:
  • -it means interactive
  • --rm means delete the image when finished
  • –v means mount directory
  • local_dir is the directory or file from your host system (absolute path) that you want to access from inside your container. For example, the local_dir in the following path is /home/jsmith/data/mnist.
    -v /home/jsmith/data/mnist:/data/mnist 

    If you are inside the container, for example, ls /data/mnist, you will see the same files as if you issued the ls /home/jsmith/data/mnist command from outside the container.

  • container_dir is the target directory when you are inside your container. For example, /data/mnist is the target directory in the example:
    -v /home/jsmith/data/mnist:/data/mnist
  • <container> is the name of the container.
  • <xx.xx> is the tag. For example, 17.06.

3.6. Verifying

After a Docker image is running, you can verify by using the classic *nix option ps. For example, issue the $ docker ps -a command.
Figure 6. Verifying a Docker image is running Verifying a Docker image is running
Without the -a option, only running instances are listed.
Important: It is best to include the -a option in case there are hung jobs running or other performance problems.
You can also stop a running container if you want. For example:
Figure 7. Stopping a container from running Stopping a container from running
Note: This screen capture has been cropped to improve readability.

Notice that you need the Container ID of the image you want to stop. This can be found using the $ docker ps -a command.

Another useful command or Docker option is to remove the image from the server. Removing or deleting the image saves space on the server. For example, issue the following command:
$ docker rmi nvcr.io/nvidia.tensorflow:1706
Figure 8. Removing an image from the server Removing an image from the server
If you list the images, $ docker images, on the server, then you will see that the image is no longer there.
Figure 9. Confirming the image is removed from the server Confirming the image is removed from the server
Note: This screen capture has been cropped to improve readability.

4. Docker Best Practices

You can run an nvidia-docker container on any platform that is Docker compatible allowing you to move your application to wherever you need. The containers are platform-agnostic, and therefore, hardware agnostic as well. To get the best performance and to take full advantage of the tremendous performance of a NVIDIA GPU, specific kernel modules and user-level libraries are needed. NVIDIA GPUs introduce some complexity because they require kernel modules and user-level libraries to operate.

One approach to solving this complexity when using containers is to have the NVIDIA drivers installed in the container and have the character devices mapped corresponding to the NVIDIA GPUs such as /dev/nvidia0. For this to work, the drivers on the host (the system that is running the container), must match the version of the driver installed in the container. This approach drastically reduces the portability of the container.

4.1. nvidia-docker Containers Best Practices

To make things easier for Docker® containers that are built for GPUs, NVIDIA® has created nvidia-docker. It is and open-source project hosted on GitHub. It is basically a wrapper around the docker command that takes care of orchestrating the GPU containers that are needed for your container to run.
Important: It is highly recommended you use nvidia-docker when running a Docker container that uses GPUs.
Specifically, it provides two components for portable GPU-based containers.
  1. Driver-agnostic Compute Unified Device Architecture® (CUDA) images
  2. A Docker command-line wrapper that mounts the user mode components of the driver and the GPUs (character devices) into the container at launch.
The nvidia-docker containers focus solely on helping you run images that contain GPU dependent applications. Otherwise, it passes the arguments to the regular Docker commands. A good introduction to nvidia-docker is here.
Important: Some things to always remember:
  • Use the nvidia-docker command when you are running and executing containers.
  • When building containers for NVIDIA GPUs, use the base containers in the repository. This will ensure the containers are compatible with nvidia-docker.
Let’s assume the TensorFlow 17.06 container has been pulled down to the system and is now an image that is ready to be run. The following command can be used to execute it.
$ nvidia-docker run --rm -ti nvcr.io/nvidia/tensorflow:17.06
Figure 10. Executing the run command Executing the run command Executing the run command
This takes you to a command prompt inside the container.
Remember: You are root inside the container.

The option --rm tells nvidia-docker to remove the container instance when the image is finished. If you make any changes to the image while it’s running, they will be lost.

The option -ti tells docker to run in interactive mode and associate a tty with the instance (basically, a shell).

Running the TensorFlow image didn’t really do anything; it just brought up a command line inside the image where you are root. Below is a better example where the CUDA container is pulled down and the image is executed along with a simple command. This view at least gives you some feedback.
Figure 11. Running an image to give you feedback Running an image to give you feedback
This docker image actually executed a command, nvcc --version, which provides some output, for example, the version of the nvcc compiler). If you want to get a bash shell in the image then you can run bash within the image.
Figure 12. Getting a bash shell in the image Getting a bash shell in the image
Note: This screen capture has been cropped to improve readability.

The frameworks that are part of the nvidia-docker repository, nvcr.io, have some specific options for achieving the best performance. This is true for all three systems, the DGX-1, DGX Station, and the NVIDIA NGC Cloud Services. For more information, see Frameworks Best Practices.

In the section Using And Mounting File Systems, some options for mounting external file systems in the running image are explained.
Important: This allows you to keep data and code stored in one place on the system outside of the containers, while keeping the containers intact.
This allows the containers to stay generic so they don’t start proliferating when each user creates their own version of the container for their data and code.

4.2. docker exec

There are times when you will need to connect to a running container. You can use the docker exec command to connect to a running container to run commands. You can use the bash command to start an interactive command line terminal or bash shell. The format of the command is:
$ docker exec -ti <CONTAINER_ID_OR_NAME> bash
As an example, suppose one starts a Deep Learning GPU Training System™ (DIGITS) container with the following command:
$ nvidia-docker run -d --name test-digits \
  -u $(id -u):$(id -g) -e HOME=$HOME -e USER=$USER -v $HOME:$HOME \
  nvcr.io/nvidia/digits:17.05
After the container is running, you can now connect to the container instance with the following command.
$ docker exec -it test-digits bash
Note:test-digits is the name of the container. If you don’t specifically name the container, you will have to use the container ID.
Important: Using docker exec one can execute a snippet of code, a script, or attach interactively to the container making the docker exec command very useful.

For detailed usage of the docker exec command, see docker exec.

4.3. nvcr.io

Building deep learning frameworks can be quite a bit of work and can be very time consuming. Moreover, these frameworks are being updated weekly if not daily. On top of this, is the need to optimize and tune the frameworks or GPUs. NVIDIA has created a Docker repository, named nvcr.io, where deep learning frameworks are tuned, optimized, and containerized for your use.

NVIDIA creates an updated set of nvidia-docker containers for the frameworks monthly. Included in the container is source (these are open-source frameworks), scripts for building the frameworks, Dockerfiles for creating containers based on these containers, markdown files that contain text about the specific container, and tools and scripts for pulling down data sets that can be used for testing or learning. Customers who purchase a DGX-1 or DGX Station have access to this repository for pushing containers (storing containers). When using NVIDIA NGC Cloud Services with a cloud provider, currently you cannot push or save a container to nvcr.io. Instead, you need to save them to a private Docker repository.

To get started with the DGX-1 or DGX Station, you need to create a system admin account for accessing nvcr.io. This account should be treated as an admin account so that users cannot access it. Once this account is created, the system admin can create accounts for projects that belong to the account. They can then give users access to these projects so that they can store or share any containers that they create.

When using the NVIDIA containers with a cloud provider, you are using the NGC container registry that is part of the NVIDIA NGC Cloud Services. It uses the exact same containers as those in nvcr.io.

4.4. Building Containers

You can build containers for the DGX systems and you can even store them in the nvcr.io registry as a project within your account if you have a DGX-1 or DGX Station (for example, no one else can access the container unless you give them access). Currently, only the DGX-1 and DGX Station can store containers in nvcr.io. If you are running on NVIDIA NGC Cloud Services using a cloud provider, you can only pull containers from nvcr.io. You must save the containers to a private Docker repository (not nvcr.io).

This section of the document applies to Docker containers in general. You can use the general approach for your own Docker repository as well, but be cautious of the details.

Using a DGX-1 or DGX Station, you can either:
  1. Create your container from scratch
  2. Base your container on an existing Docker container
  3. Base your container on containers in nvcr.io.
Any one of the three approaches are valid and will work, however, since the goal is to run the containers on a system which has eight GPUs. Moreover, these containers are already tuned for the DGX systems and the GPU topology. All of them also include the needed GPU libraries, configuration files, and tools to rebuild the container.
Important: Based on these assumptions it is recommended that you start with a container from nvcr.io.

An existing container in nvcr.io should be used as a starting point. As an example, the TensorFlow 17.06 container will be used and Octave will be added to the container so that some post-processing of the results can be accomplished.

  1. Pull the container from the NGC container registry to the server. See Pulling A Container.
  2. On the server, create a subdirectory called mydocker.
    Note: This is an arbitrary directory name.
  3. Inside this directory, create a file called Dockerfile (capitalization is important). This is the default name that Docker looks for when creating a container. The Dockerfile should look similar to the following:
    Figure 13. Example of a Dockerfile Example of a Dockerfile
    There are three lines in the Dockerfile.
    • The first line in the Dockerfile tells Docker to start with the container nvcr.io/nvidia/tensorflow:17.06. This is the base container for the new container.
    • The second line in the Dockerfile performs a package update for the container. It doesn’t update any of the applications in the container but just updates the apt-get database. This is needed before we install new applications in the container.
    • The third and last line in the Dockerfile tells Docker to install the package octave into the container using apt-get.
    The Docker command to create the container is:
    $ docker build -t nvcr.io/nvidian_sas/tensorflow_octave:17.06_with_octave
    Note: This command uses the default file Dockerfile for creating the container.
    In the following screen capture, the command starts with docker build. The -t option creates a tag for this new container. Notice that the tag specifies the project in the nvcr.io repository where the container is to be stored. As an example, the project nvidian_sas was used along with the repository nvcr.io. Projects can be created by your local administrator who controls access to nvcr.io, or they can give you permission to create them. This is where you can store your specific containers and even share them with your colleagues.
    Figure 14. Creating a container using the Dockerfile Creating a container using the Dockerfile
    Note: This screen capture has been cropped to improve readability.

    In the brief output from the docker build … command seen above, each line in the Dockerfile is a Step. In the screen capture, you can see the first and second steps (commands). Docker echos these commands to the standard out (stdout) so you can watch what it is doing or you can capture the output for documentation.

    After the image is built, remember that we haven’t stored the image in a repository yet, therefore, it’s a docker image. Docker prints out the image id to stdout at the very end. It also tells you if you have successfully created and tagged the image.

    If you don’t see Successfully ... at the end of the output, examine your Dockerfile for errors (perhaps try to simplify it) or try a very simple Dockerfile to ensure that Docker is working properly.

  4. Verify that Docker successfully created the image.
    $ docker images
    Figure 15. Verifying Docker created the image Verifying Docker created the image
    Note: The screen capture has been cropped to make it more readable.

    The very first entry is the new image (about 1 minute old).

  5. Push the image into the repository, creating a container.
    docker push <name of image>
    Figure 16. Example of the docker push command Example of the docker push command

    The above screen capture is after the docker push … command pushes the image to the repository creating a container. At this point, you should log into the NGC container registry at https://ngc.nvidia.com and look under your project to see if the container is there.

    If you don’t see the container in your project, make sure that the tag on the image matches the location in the repository. If, for some reason, the push fails, try it again in case there was a communication issue between your system and the container registry (nvcr.io).

    To make sure that the container is in the repository, we can pull it to the server and run it. As a test, first remove the image from the DGX station using the command docker rmi …. Then pull down the container down to the server using docker pull …. The image can be run using nvidia-docker as shown below.
    Figure 17. Example of using nvidia-docker to pull container Example of using nvidia-docker to pull container
    Notice that the octave prompt came up so it is installed and functioning correctly within the limits of this testing.

4.5. Using And Mounting File Systems

One of the fundamental aspects of using Docker is mounting file systems inside the Docker container. These file systems can contain input data for the frameworks or even code to run in the container.

Docker containers have their own internal file system that is separate from file systems on the rest of the host.
Important: You can copy data into the container file system from outside if you want. However, it’s far easier to mount an outside file system into the container.
Mounting outside file systems is done using the nvidia-docker command using the -v option. For example, the following command mounts two file systems:
$ nvidia-docker run --rm -ti ... -v $HOME:$HOME \
  -v /datasets:/digits_data:ro \
  ...
Most of the command has been erased except for the volumes. This command mounts the user’s home directory from the external file system to the home directory in the container (-v $HOME:$HOME). It also takes the /datasets directory from the host and mounts it on /digits_data inside the container (-v /datasets:/digits_data:ro).
Remember: The user has root privileges with Docker, therefore you will mount almost anything from the host system to anywhere in the container.
For this particular command, the volume command takes the form of:
-v <External FS Path>:<Container FS Path>(options) \

The first part of the option is the path for the external file system. To be sure this works correctly, it’s best to use the fully qualified path (FQP). This is also true for the mount point inside the container <Container FS Path>.

After the last path, various options can be used in the parenthesis (). In the above example, the second file system is mounted read-only (ro) inside the container. The various options for the volume option are discussed here.

The DGX™ systems (DGX-1 and DGX Station), and the nvidia-docker containers use the Overlay2 storage driver to mount external file systems onto the container file system. Overlay2 is a union-mount file system driver that allows you to combine multiple file systems so that all the content appears to be combined into a single file system. It creates a union of the file systems rather than an intersection.

5. Frameworks Best Practices

As part of the DGX-1, DGX Station, and the NVIDIA NGC Cloud Services systems, NVIDIA makes available tuned, optimized, and ready to run nvidia-docker containers for the major deep learning frameworks. These containers are made available via the container registry, nvcr.io, so that you can use them directly or use them as a basis for creating your own containers.

This section presents tips for efficiently using these frameworks. This section does not explain how to use the frameworks for addressing your projects, rather, it presents best practices for starting them.

There are a few general best practices around the containers (the frameworks) in nvcr.io. As mentioned earlier, it’s possible to use one of the containers and build upon it. By doing this, you are, in a sense, fixing the new container to a specific framework and container version. This approach works well if you are creating a derivative of a framework or adding some capability that doesn’t exist in the framework or container.

Important: However, it is a best practice not to put datasets in a container. If possible also avoid storing business logic code in a container.
The reason is because by storing datasets and/or business logic code within a container, it becomes difficult to generalize the usage of the container. Instead, since one can mount file systems into a container that just mount desired data sets and directories that contain the business logic code to run. Decoupling the container from specific datasets and business logic enables one to easily change containers, such as framework or version of a container, without having to re-build the container to hold the data or code.
Important: The main takeaway is to use volumes from outside the container for datasets and business logic code. Keep the container as generic as possible.
When applying this practice to deep learning workflows, the non-business logic code is the containerized framework, such as TensorFlow for example, and the business logic code would be a python file defining a TensorFlow network and code to read, process, and write data. The data is read from some readable mounted dataset and written to some writeable mounted volume (could be the same location as the mounted readable dataset).

The subsequent sections briefly present some best practices around the major frameworks that are in containers on the container registry (nvcr.io). There is also a section that discusses how to use Keras, a very popular high-level abstraction of deep learning frameworks, with some of the containers.

5.1. NVCaffe

NVCaffe™ can run using the DIGITS application or directly via a command line interface. Also, a Python interface for NVCaffe called pycaffe is available.

When running NVCaffe via the command line or pycaffe use the nvcr.io/nvidia/caffe:17.05 or later container. In section run_caffe_mnist.sh, try the script run_caffe_mnist.sh for an example using the MNIST data and the LeNet network to perform training via the NVCaffe command line. In the script, the data path is set to /datasets/caffe_mnist. You can modify the path to your desired location. To run you can use the following commands.
./run_caffe_mnist.sh
# or with multiple GPUs use -gpu flag: "-gpu=all" for all gpus or
#   comma list.
./run_caffe_mnist.sh -gpu=0,1

This script demonstrates how to orchestrate a container, pass external data to the container, and run NVCaffe training while storing the output in a working directory. Read through the run_caffe_mnist.sh script for more details. It is based on the MNIST training example.

The Python interface, pycaffe, is implemented via import NVCaffe in a Python script. For examples of using pycaffe and the Python interface, refer to the test scripts.

A description of orchestrating a Python script with Docker containers is described in section run_tf_cifar10.sh using the run_tf_cifar10.sh script.

An interactive session with NVCaffe can be setup with the following lines in a script:
DATA=/datasets/caffe_mnist
CAFFEWORKDIR=$HOME/caffe_workdir
 
mkdir -p $DATA
mkdir -p $CAFFEWORKDIR/mnist
 
dname=${USER}_caffe
 
# Orchestrate Docker container with user's privileges
nvidia-docker run -d -t --name=$dname \
  -u $(id -u):$(id -g) -e HOME=$HOME -e USER=$USER -v $HOME:$HOME \
  -e DATA=$DATA -v $DATA:$DATA \
  --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
  -w $CAFFEWORKDIR nvcr.io/nvidia/caffe:17.05
 
# enter interactive session
docker exec -it $dname bash
 
# After exiting the interactive container session, stop and rm
#   container.
# docker stop $dname && docker rm $dname
In the script, the following line has options for Docker to enable proper NVIDIA® Collective Communications Library ™ (NCCL) operation for running NVCaffe with multiple GPUs.
 --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864
You can use the NVCaffe command line or Python interface within the NVCaffe container. For example using the command line would looking similar to the following:
caffe device_query -gpu 0 # query GPU stats. Use "-gpu all" for all gpus
caffe train help # print out help/usage
Using a Python interface would look similar to the following:
# start python in container
>>> import caffe
>>> dir(caffe)
['AdaDeltaSolver', 'AdaGradSolver', 'AdamSolver', 'Classifier', 'Detector',
 'Layer', 'NesterovSolver', 'Net', 'NetSpec', 'RMSPropSolver', 'SGDSolver',
 'TEST', 'TRAIN', '__builtins__', '__doc__', 'docs/using_caffe.md', '__name__',
 '__package__', '__path__', '__version__', '_caffe', 'classifier', 'detector',
 'get_solver', 'io', 'layer_type_list', 'layers', 'net_spec', 'params',
 'proto', 'pycaffe', 'set_device', 'set_mode_cpu', 'set_mode_gpu', 'to_proto']

For more information about NVCaffe, see NVCaffe documentation.

5.2. Caffe2

Caffe2™ is a deep learning framework enabling simple and flexible deep learning. Built on the original BVLC Caffe™ , Caffe2 is designed with expression, speed, and modularity in mind, allowing for a more flexible way to organize computation.

Caffe2 aims to provide an easy and straightforward way for you to experiment with deep learning by leveraging community contributions of new models and algorithms. Caffe2 comes with native Python and C++ APIs that work interchangeably so you can prototype quickly now, and easily optimize later. Caffe2 is fine tuned from the ground up to take full advantage of the latest NVIDIA Deep Learning SDK libraries, CUDA® Deep Neural Network library™ (cuDNN), CUDA® Basic Linear Algebra Subroutines library™ (cuBLAS) and NCCL, to deliver high-performance, multi-GPU acceleration for desktop, data centers, and embedded edge devices.

There is an informative introduction of Caffe2 that includes some comparative tests. NVIDIA provides a web page for the release notes for the Caffe2 version that is included. If you want to build Caffe2 yourself or if you want to see test results with Caffe2, you can find information for it on NVIDIA’s GPU Ready App page for Caffe2. There is also a lab for Caffe2 in the NVIDIA Deep Learning Institute.

5.3. Microsoft Cognitive Toolkit

The Microsoft® Cognitive Toolkit™ , previously known as CNTK, allows users to to easily realize and combine popular model types such as feed-forward DNNs, convolutional nets (CNNs), and recurrent networks (RNNs/LSTMs). Version 2.1 was released on 7/30/2017 and included support for cuDNN 6 and Keras.

NVIDIA includes a pre-built release of the Microsoft Cognitive Toolkit in the container registry (nvcr.io). You can find the release notes here. The NVIDIA Deep Learning Institute (DLI) also has a course that utilizes the Microsoft Cognitive Toolkit, although it may be referred to as CNTK.

5.4. DIGITS

DIGITS is a popular training workflow manager provided by NVIDIA. Using DIGITS, one can manage image data sets and training through an easy to use web interface for the NVCaffe, Torch™ , and TensorFlow frameworks.

For more information, see NVIDIA DIGITS, DIGITS source and DIGITS documentation.

5.4.1. Setting Up DIGITS

The following directories, files and ports are useful in running the DIGITS container.
Table 1. Running DIGITS container details
Description Value Notes
DIGITS working directory $HOME/digits_workdir You must create this directory.
DIGITS job directory $HOME/digits_workdir/jobs You must create this directory.
DIGITS config file $HOME/digits_workdir/digits_config_env.sh Used to pass job directory and log file.
DIGITS port 5000 Choose a unique port if multi-user.
Important: It is recommended to specify a list of environment variables in a single file that can be passed to the nvidia-docker run command via the --env file option.
In section digits_config_env.sh, is a script that declares the location of the DIGITS job directory and log file. This script is very popular when running DIGITS. Below is an example of defining these two variables in the simple bash script.
# DIGITS Configuration File
DIGITS_JOB_DIR=$HOME/digits_workdir/jobs
DIGITS_LOGFILE_FILENAME=$HOME/digits_workdir/digits.log

For more information about configuring DIGITS, see Configuration.md.

5.4.2. Running DIGITS

To run DIGITS, refer to the example script in section run_digits.sh. However, if you want to run DIGITS from the command line, there is a sample nvidia-docker command that has most of the needed details to effectively run DIGITS.
Note: You will have to create the jobs directory if it doesn’t already exist.
$ mkdir -p $HOME/digits_workdir/jobs
 
$ NV_GPU=0,1 nvidia-docker run --rm -ti --name=${USER}_digits -p 5000:5000 \
  -u $(id -u):$(id -g) -e HOME=$HOME -e USER=$USER -v $HOME:$HOME \
  --env-file=${HOME}/digits_workdir/digits_config_env.sh \
  -v /datasets:/digits_data:ro \
  --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
  nvcr.io/nvidia/digits:17.05
This command has several options of which you might need, but you may not need all of them. In the table below is a list of the parameters and their description.
Table 2. nvidia-docker run command options
Parameter Description
NV_GPU Optional environment variable specifying GPUs available to the container.
--name Name to associate with the Docker container instance.
--rm Tells Docker to remove the container instance when done.
-ti Tells Docker to run in interactive mode and associate tty with the instance.
-d Tells Docker to run in daemon mode; no tty, run in background (not shown in the command and not recommended for running with DIGITS).
-p p1:p2 Tells Docker to map host port p1 to container port p2 for external access. This is useful for pushing DIGITS output through a firewall.
-u id:gid Tells Docker to run the container with user id and group id for file permissions.
-v d1:d2 Tells Docker to map host directory d1 into the container at directory d2.
Important: This is a very useful option because it allows you to store the data outside of the container.
--env-file Tells Docker which environment variables to set for the container.
--shm-size ... This line is a temporary workaround for a DIGITS multi-GPU error you might encounter.
container Tells Docker which container instance to run (for example, nvcr.io/nvidia/digits:17.05).
command Optional command to run after the container is started. This option is not used in the example.
After DIGITS starts running, open a browser using the IP address and port of the system. For example, the URL would be, http://dgxip:5000/. If the port is blocked and an SSH tunnel has been setup (see SSH Tunneling), then you can use the URL http://localhost:5000/.
In this example, the datasets are mounted to /digits_data (inside the container) via the option, -v /datasets:/digits_data:ro. Outside the container, the datasets reside in /datasets (this can be any path on the system). Inside the container the data is mapped to /digits_data. It is also mounted read-only (ro) with the option :ro.
Important: For both paths, it is highly recommended to use the fully qualified path name for outside the container and inside the container.

If you are looking for datasets for learning how to use the system and the containers, there are some standard datasets that can be downloaded via DIGITS.

Included in the DIGITS container is a Python script that can be used to download specific sample datasets. The tool is called digits.download_data. It can be used to download the MNIST data set, the CIFAR-10 dataset, and the CIFAR-100 dataset. You can also use this script in the command to run DIGITS so that it pulls down the sample dataset. Below is an example for the MNIST dataset.
$ nvidia-docker run --rm -ti \
  -u $(id -u):$(id -g) -e HOME=$HOME -e USER=$USER -v $HOME:$HOME \
  --env-file=${HOME}/digits_workdir/digits_config_env.sh \
  -v /datasets:/digits_data \
  --entrypoint=bash \
  nvcr.io/nvidia/digits:17.05 \
  -c 'python -m digits.download_data mnist /digits_data/digits_mnist'

In the download example above, the entry point to the container was overridden to run a bash command to download the dataset (the -c option). You should adjust the datasets paths as needed.

An example of running DIGITS on MNIST data can be found here.

More DIGITS examples can be found here.

5.5. Keras And Containerized Frameworks

Keras is a popular Python frontend for TensorFlow, Theano, and Microsoft Cognitive Toolkit v.2.x release. Keras implements a high-level neural network API to the frameworks listed. Keras is not included in the containers in nvcr.io because it is evolving so quickly. You can add it to any of the containers if you like but there are ways to start one of the nvcr.io containers and install Keras during the launch process. This section also provides some scripts for using Keras in a virtual Python environment.

Before jumping into Keras and best practices around how to use it, a good background for Keras is to familiarize yourself with virtualenv and virtualenvwrapper.

When you run Keras, you have to specify the desired framework backend. This can be done using either the $HOME/.keras/keras.json file or by an environment variable KERAS_BACKEND=<backend> where the backend choices are: theano, tensorflow, or cntk. The ability to choose a framework with minimal changes to the Python code makes Keras very popular.

There are several ways to configure Keras to work with containerized frameworks.
Important: The most reliable approach is to create a container with Keras or install Keras within a container.
Setting up a container with Keras might be preferable for deployed containerized services.
Important: Another approach that works well in development environments is to setup a virtual python environment with Keras.
This virtual environment can then be mapped into the container and the Keras code can run against the desired framework backend.

The advantage of decoupling Python environments from the containerized frameworks is that given M containers and N environments instead of having to create M * N containers, one can just create M + N configurations. The configuration then is the launcher or orchestration script that starts the desired container and activates the Keras Python environment within that container. The disadvantage with such an approach is that one cannot guarantee the compatibility of the virtual Python environment and the framework backend without testing. If the environment is incompatible then one would need to re-create the virtual Python environment from within the container to make it compatible.

5.5.1. Adding Keras To Containers

If you choose, you can add Keras to an existing container. Like the frameworks, Keras changes fairly rapidly so you will have to watch for changes in Keras.

There are two good choices for installing Keras into an existing container. Before proceeding with either approach, ensure you are familiar with Docker section of this document to understand how to build on existing containers.

The first approach is to use the OS version of Python to install Keras using the python tool pip.
# sudo pip install keras

Ensure you check the version of Keras that has been installed. This may be an older version to better match the system OS version but it may not be the version you want or need. If that is the case, the next paragraph describes how to install Keras from source code.

The second approach is to build Keras from source. It is recommended that you download one of the releases rather than download from the master branch. A simple step-by-step process is to:
  1. Download a release in .tar.gz format (you can always use .zip if you want).
  2. Start up a container with either TensorFlow, Microsoft Cognitive Toolkit v2.x, or Theano.
  3. Mount your home directory as a volume in the container (see Using And Mounting File Systems).
  4. Navigate into the container and open a shell prompt.
  5. Uncompress and untar the Keras release (or unzip the .zip file).
  6. Issue cd into the directory.
    # cd keras
    # sudo python setup.py install
If you want to use Keras as part of a virtual Python environment, the next section will explain how you can achieve that.

5.5.2. Creating Keras Virtual Python Environment

Before jumping into Keras in a virtual Python environment, it’s always a good idea to review the installation dependencies of Keras. The dependencies are common for data science Python environments, NumPy, SciPy, YAML, and h5py. It can also use cuDNN, but this is already included in the framework containers.

You will be presented with several scripts for running Keras in a virtual Python environment. These scripts are included in the document and provides a better user experience than having to do things by hand.

In the section venvfns.sh, the script is a master script. It needs to be put in a directory on the system that is accessible from all users. For example, it could be placed in /usr/share/virtualenvwrapper/. An administrator needs to put this script in the desired location since it has to be in a directory that every user can access.

In the section setup_keras.sh, the script creates a py-keras virtual Python environment in ~/.virtualenvs directory (this is in the user’s home directory). Each user can run the script as:
$./setup_keras.sh
In this script, you launch the nvcr.io/nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04 container as the local user with your home directory mounted into the container. The salient parts of the script are below:
dname=${USER}_keras
 
nvidia-docker run --name=$dname -d -t \
  -u $(id -u):$(id -g) -e HOME=$HOME -e USER=$USER -v $HOME:$HOME \
  nvcr.io/nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04
Important: When creating the Keras files, ensure you have the correct privileges set when using the -u or --user options. The -d and -t options daemonize the container process. This way the container runs in the background as a daemon service and one can execute code against it.
You can use docker exec to execute a snippet of code, a script, or attach interactively to the container. Below is the portion of the script that sets up a Keras virtual Python environment.
docker exec -it $dname \
  bash -c 'source /usr/share/virtualenvwrapper/virtualenvwrapper.sh
  mkvirtualenv py-keras
  pip install --upgrade pip
  pip install keras --no-deps
  pip install PyYaml
  # pip install -r /pathto/requirements.txt
  pip install numpy
  pip install scipy
  pip install ipython'
If the list of Python packages is extensive, you can write a requirements.txt file listing those packages and install via:
pip install -r /pathto/requirements.txt --no-deps
Note: This particular line is in the previous command, however, it has been commented out because it was not needed.
The --no-deps option specifies that dependencies of packages should not be installed. It is used here because by default installing Keras will also install Theano or TensorFlow.
Important: On a system where you don’t want to install non-optimized frameworks such as Theano and TensorFlow, the --no-deps option prevents this from happening.
Notice the line in the script that begins with bash -c …. This points to the script previously mentioned (venvfns.sh) that needs to be put in a common location on the system. If some time later, more packages are needed, one can relaunch the container and add those new packages as above or interactively. The code snippet below illustrates how to do so interactively.
dname=${USER}_keras
 
nvidia-docker run --name=$dname -d -t \
  -u $(id -u):$(id -g) -e HOME=$HOME -e USER=$USER -v $HOME:$HOME \
  nvcr.io/nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04
 
sleep 2  # wait for above container to come up
 
docker exec -it $dname bash
You can now log into the interactive session where you activated the virtual Python environment and install what is needed. The example below installs h5py which is used by Keras for saving models in HDF5 format.
source ~/.virtualenvs/py-keras/bin/activate
pip install h5py
deactivate
exit

If the installation fails because some underlying library is missing, one can attach to the container as root and install the missing library.

The next example illustrates installing the python-dev package which will install Python.h if it is missing.
$ docker exec -it -u root $dname \
  bash -c 'apt-get update &&  apt-get install -y python-dev # anything else...'
The container can be stopped or removed when you are done using the following command.
$ docker stop $dname && docker rm $dname

5.5.3. Using Keras Virtual Python Environment With Containerized Frameworks

The following examples assume that a py-keras venv (Python virtual environment) has been created per the instructions in the previous section. All of the scripts for this section can be found in the Scripts section.

In the section run_kerastf_mnist.sh, the script demonstrates how the Keras venv is enabled and is then used to run the Keras MNIST code mnist_cnn.py with the default backend TensorFlow. Standard Keras examples can be found here.

Compare the run_kerastf_mnist.sh script to the run_kerasth_mnist.sh (in section run_kerasth_mnist.sh) that uses Theano. There are primarily two differences:
  1. The backend container nvcr.io/nvidia/theano:17.05 is used instead of nvcr.io/nvidia/tensorflow:17.05.
  2. In the code launching section of the script, specify KERAS_BACKEND=theano. You can run these scripts as:
    $./run_kerasth_mnist.sh  # Ctrl^C to stop running
    $./run_kerastf_mnist.sh
    
In section run_kerastf_cifar10.sh, the script has been modified to accept parameters and demonstrates how one would specify an external data directory for the CIFAR-10 data. In section cifar10_cnn_filesystem.py, the script has been modified from the original cifar10_cnn.py. The command line example to run this code on a system is the following:
$./run_kerastf_cifar10.sh --epochs=3 --datadir=/datasets/cifar
The above assumes the storage is mounted on a system at /datasets/cifar.
Important: The key takeaway is that running some code within a container involves setting up a launcher script.
These scripts can be generalized and parameterized for convenience and it is up to the end user or developer to write these scripts for their custom application or their custom workflow.
For example:
  1. The parameters in the example script were joined to a temporary variable via the following:
    function join { local IFS="$1"; shift; echo "$*"; }
    script_args=$(join : "$@")
    
  2. The parameters were passed to the container via the option:
    -e script_args="$script_args"
  3. Within the container, these parameters are split and passed through to the computation code by the line:
    python $cifarcode ${script_args//:/ }
  4. The external system NFS/storage was passed as read-only to the container via the following option to the launcher script:
    -v /datasets/cifar:/datasets/cifar:ro
    and by
    --datadir=/datasets/cifar

In the section run_kerastf_cifar10.sh, the script can be improved by parsing parameters to generalize the launcher logic and avoid duplication. There are several ways to parse parameters in bash via getopts or a custom parser. One can write a non-bash launcher as well as using Python, Perl, or something else.

The final script, in section run_keras_script implements a high-level parameterized bash launcher. The following examples illustrate how to use it to run the previous MNIST and CIFAR examples above.
# running Tensorflow MNIST
./run_keras_script.sh \
  --container=nvcr.io/nvidia/tensorflow:17.05 \
  --script=examples/keras/mnist_cnn.py
 
# running Theano MNIST
./run_keras_script.sh \
  --container=nvcr.io/nvidia/theano:17.05 --backend=theano \
  --script=examples/keras/mnist_cnn.py
 
# running Tensorflow Cifar10
./run_keras_script.sh \
  --container=nvcr.io/nvidia/tensorflow:17.05 --backend=tensorflow \
  --datamnt=/datasets/cifar \
  --script=examples/keras/cifar10_cnn_filesystem.py \
	--epochs=3 --datadir=/datasets/cifar
 
# running Theano Cifar10
./run_keras_script.sh \
  --container=nvcr.io/nvidia/theano:17.05 --backend=theano \
  --datamnt=/datasets/cifar \
  --script=examples/keras/cifar10_cnn_filesystem.py \
	--epochs=3 --datadir=/datasets/cifar
Important: If the code is producing output that needs to be written to a filesystem and persisted after the container stops, that logic needs to be added.
The examples above show containers where their home directory is mounted and is "writeable". This ensures that the code can write the results somewhere within the user’s home path. The filesystem paths need to be mounted into the container and specified or passed to the computational code.
These examples serve to illustrate how one goes about orchestrating computational code via Keras or even non-Keras.
Important: In practice, it is often convenient to launch containers interactively, attach to them interactively, and run code interactively.
During these interactive sessions, it is easier to (automate via helper scripts) debug and develop code. An interactive session might look like the following sequence of commands typed manually into the terminal:
# in bash terminal
dname=mykerastf
 
nvidia-docker run --name=$dname -d -t \
  -u $(id -u):$(id -g) -e HOME=$HOME -e USER=$USER -v $HOME:$HOME \
  -v /datasets/cifar:/datasets/cifar:ro -w $workdir \
  nvcr.io/nvidia/tensorflow:17.05
 
docker exec -it $dname bash
# now interactively in the container.
source ~/.virtualenvs/py-keras/bin/activate
source ~/venvfns.sh
enablevenvglobalsitepackages
./run_kerastf_cifar10.sh --epochs=3 --datadir=/datasets/cifar
# change some parameters or code in cifar10_cnn_filesystem.py and run again
./run_kerastf_cifar10.sh --aug --epochs=2 --datadir=/datasets/cifar
disablevenvglobalsitepackages
exit # exit interactive session in container
 
docker stop $dname && docker rm $dname # stop and remove container

5.5.4. Working With Containerized VNC Desktop Environment

The need for a containerized desktop varies depending on the data center setup. If your system is setup behind a login node for an on-premise system, or a head node for an on-premise system, typically data centers will provide a VNC login node or run X Windows on the login node to facilitate running visual tools such as text editors or an IDE (integrated development environment).

For a cloud based system (NGC), there may already be firewalls and security rules available. In this case, you may want to ensure that the proper ports are open for VNC or something similar.

If the system serves as the primary resource for both development and computing, then it is possible to setup a desktop like environment on it via containerized desktop. The instructions and Dockerfile for this can be found here. Notice that these instructions are primarily for the DGX-1 but should work for the DGX Station.

You can download the latest release of the container to the system. The next step is to modify the Dockerfile by changing the FROM field to be:
FROM nvcr.io/nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04

This is not an officially supported container by the DGX product team, in other words, it is not available on nvcr.io and was provided as an example of how to setup a desktop like environment on a system for convenient development with eclipse or sublime-text (suggestion, try visual studio code which is very like sublime text but free) or any other GUI driven tool.

An example script, build_run_dgxdesk.sh, is available on the GitHub site to build and run a containerized desktop as shown in the Scripts. Other systems such as the DGX Station and NGC would follow a similar process.

To connect to the system, you can download a VNC client for your system from RealVnc, or use a web-browser.
=> connect via VNC viewer hostip:5901, default password: vncpassword
=> connect via noVNC HTML5 client: http://hostip:6901/?password=vncpassword

5.6. MXNet

MXNet™ is part of the Apache Incubator project. The MXNet library is portable and can scale to multiple GPUs and multiple machines. MXNet is supported by major Public Cloud providers including Amazon Web Services (AWS) and Azure Amazon has chosen MXNet as its deep learning framework of choice at AWS. It supports multiple languages (C++, Python, Julia, Matlab, JavaScript, Go, R, Scala, Perl, Wolfram Language).

NVIDIA includes a release of MXNet as well. You can read the release notes here. NVIDIA also has a page in the GPU Ready Apps catalog for MXNEMXNetT that explains how you can build it outside of the container registry (nvcr.io). It also presents some test results for MXNet.

To get started with MXNet, the NVIDIA Deep Learning Institute (DLI) has some courses that utilize MXNet. In the this list, there are some courses that utilize MXNet.

5.7. PyTorch

PyTorch™ is designed to be deeply integrated with Python. It is used naturally as you would use NumPy, SciPy and scikit-learn, or any other Python extension. You can even write the neural network layers in Python using libraries such as Cython and Numba. Acceleration libraries such as NVIDIA cuDNN and NCCL along with Intel MKL are included to maximize performance.

NVIDIA has a release of PyTorch as well. You can read the release notes here. There is also a good blog that discusses recursive neural networks using PyTorch.

5.8. TensorFlow

An efficient way to run TensorFlow on the GPU system involves setting up a launcher script to run the code using a TensorFlowDocker container. For an example of how to run CIFAR-10 on multiple GPUs on system using cifar10_multi_gpu_train.py, see TensorFlow models.

If you prefer to use a script for running TensorFlow, see run_tf_cifar10.sh. It is a bash script that you can run on a system. It assumes you have pulled the Docker container from the nvcr.io repository to the system. It also assumes you have the CIFAR-10 data stored in /datasets/cifar on the system and are mapping it to /datasets/cifar in the container. You can also pass arguments to the script such as the following:
$./run_tf_cifar10.sh --data_dir=/datasets/cifar --num_gpus=8

The details of the run_tf_cifar10.sh script parameterization is explained in the Keras section of this document (see Keras And Containerized Frameworks). You can modify the /datasets/cifar path in the script for the site specific location to CIFAR data. If the CIFAR-10 dataset for TensorFlow is not available, then run the example with writeable volume -v /datasets/cifar:/datasets/cifar (without ro) and the data will be downloaded on the first run.

If you want to parallelize the CIFAR-10 training, basic data-parallelization for TensorFlow via Keras can be done as well. Refer to the example cifar10_cnn_mgpu.py on GitHub.

A description of orchestrating a Python script with Docker containers is described in section run_tf_cifar10.sh using the run_tf_cifar10.sh script.

5.9. Theano

Theano is an open source project primarily developed by a machine learning group at the Université de Montréal. It is really focused on Python and is primarily a Python library or module. It has it’s own Python frontend and Keras can also be used as a frontend. Interestingly, Theano combines aspects of a Computer Algebra System (CAS) with aspects of an optimizing compiler. It can generate customized C code for aspects of the problem that are being solved which is very useful for repetitive computations. Moreover, it can still provide symbolic features such as automatics differentiation, for expressions that may be evaluated once, to improve the performance.

NVIDIA includes a release of Theano as well. You can read the release notes here. To get started with Theano, the NVIDIA Deep Learning Institute (DLI) provides online courses that utilize Theano.

5.10. Torch

Torch is an open-source deep learning framework that uses Lua as a scripting language. It can also be used with DIGITS.

NVIDIA includes a release of Torch as well. You can read the release notes here. To get started with Torch, the NVIDIA Deep Learning Institute (DLI) provides online courses that utilize Torch.

If you want to build Torch from scratch or if you are interested in test results with Torch, you can find more information on the GPU Ready App site for Torch.

6. DGX-1 Best Practices

NVIDIA has created the DGX-1 as an appliance to make administration and operation as simple as possible. However, like any computational resource it still requires administration. This section discusses some of the best practices around configuring and administering a single DGX-1 or several DGX-1 appliances.

There is also some discussion about how to plan for external storage, networking, and other configuration aspects for the DGX-1.

6.1. Storage

In order for deep learning to be effective and to take full advantage of the DGX-1, the various aspects of the DGX-1 have to be balanced. This includes storage and IO. This is particularly important for feeding data to the GPUs to keep them busy and dramatically reduce run times for models. This section presents some best practices for storage within and outside of, the DGX-1. It also talks about storage considerations as the number of DGX-1 units are scaled out.

6.1.1. Internal Storage

The first storage consideration is storage within the DGX-1 itself. For the best possible performance, a NFS read cache has been included in the DGX-1 appliance using the Linux cacheFS capability. It uses four SSD’s in a RAID-0 group. The drives are connected to a dedicated hardware RAID controller.

Deep learning I/O patterns typically consist of multiple iterations of reading the training data. The first pass through the data is sometimes referred to the cold start. Subsequent passes through the data can avoid rereading the data from the filesystem if adequate local caching is provided on the node. If you can estimate the maximum size of your data, you can architect your system to provide enough cache so that the data only needs to be read once during any training job. A set of very fast SSD disks can provide an inexpensive and scalable way of providing adequate caching for your applications.

The purpose of this cache is for storing training and validation data for reading by the frameworks. During the first epoch of training a framework, the training data is read and used to start training the model. The NFS cache is a read cache so that all of the data that is read for the first epoch is cached on the RAID-0 group. Subsequent reads of the data are done from the NFS cache and not the central repository that was used in the first epoch. As a result, the IO is much faster after the first epoch.

The benefit of adequate caching is that your external filesystem does not have to provide maximum performance during a cold start (the first epoch), since this first pass through the data is only a small part of the overall training time. For example, typical training sessions can iterate over the data 100 times. If we assume a 5x slower read access time during the first cold start iteration vs the remaining iterations with cached access, then the total run time of training increases by the following amount.
  • 5x slower shared storage 1st iteration + 99 local cached storage iterations
    • > 4% increase in runtime over 100 iterations

Even if your external file system cannot sustain peak training IO performance, it has only a small impact on overall training time. This should be considered when creating your storage system to allow you to develop the most cost-effective storage systems for your workloads.

By default, the DGX-1 comes with four SSD devices connected to the RAID controller.
CAUTION:
There are more slots that are open in the DGX-1 for other drives but you cannot put additional drives into the system without voiding your warranty.

6.1.2. External Storage

As an organization scales out their GPU enabled data center, there are many shared storage technologies which pair well with GPU applications. Since the performance of a GPU enabled server is so much greater than a traditional CPU server, special care needs to be taken to ensure the performance of your storage system is not a bottleneck to your workflow.

Different data types require different considerations for efficient access from filesystems. For example:
  • Running parallel HPC applications may require the storage technology to support multiple processes accessing the same files simultaneously.
  • To support accelerated analytics, storage technologies often need to support many threads with quick access to small pieces of data.
  • For vision based deep learning, accessing images or video used in classification, object detection or segmentation may require high streaming bandwidth, fast random access, or fast memory mapped (mmap()) performance.
  • For other deep learning techniques, such as recurrent networks, working with text or speech can require any combination of fast bandwidth with random and small files.

HPC workloads typically drive high simultaneous multi-system write performance and benefit greatly from traditional scalable parallel file system solutions. Size HPC storage and network performance to meet the increased dense compute needs of GPU servers. It is not uncommon to see per-node performance increases from between 10-40x for a 4 GPU system vs a CPU system for many HPC applications.

Data Analytics workloads, similar to HPC, drive high simultaneous access, but more read focused than HPC. Again it is important to size Data Analytics storage to match the dense compute performance of GPU servers. As you adopt accelerated analytics technologies such as GPU-enabled in-memory databases, make sure that you can populate the database from your data warehousing solution quickly to minimize startup time when you change database schemas. This may require a network with 10 Gbe or greater performance. To support clients at this rate, you may have to revisit your data warehouse architecture to identify and eliminate bottlenecks.

Deep learning is a fast evolving computational paradigm and it is important to know what your requirements are in the near and long term to properly architect a storage system. The ImageNet database is often used as a reference when benchmarking deep learning frameworks and networks. The resolution of the images in ImageNet are 256x256. It is more common to find images at 1080p or 4k. Images in 1080p resolution are 30 times larger than those in ImageNet. Images in 4k resolution are 4 times larger than that (120X the size of ImageNet images). Uncompressed images are 5-10 times larger than compressed images. If your data cannot be compressed for some reason, for example if you are using a custom image formats, the bandwidth requirements increase dramatically.

For AI-Driven Storage, it is suggested that you make use of deep learning framework features that build databases and archives versus accessing small files directly; reading and writing many small files will reduce performance on the network and local file systems. Storing files in formats such as HDF5, LMDB or LevelDB can reduce metadata access to the filesystem helping performance. However, these formats can lead to their own challenges with additional memory overhead or requiring support for fast mmap() performance. All this means that you should plan to be able to read data at 150-200 MB/s per GPU for files at 1080p resolution. Consider more if you are working with 4k or uncompressed files.

6.1.2.1. NFS Storage

NFS can provide a good starting point for AI workloads on small GPU server configurations with properly sized storage and network bandwidth. NFS based solutions can scale well for larger deployments, but be aware of possible single node and aggregate bandwidth requirements and make sure that matches your vendor of choice. As you scale your data center to need more than 10 GB/s or your data center grows to hundreds or thousands of nodes, other technologies may be more efficient and scale better.

Generally, it is a good idea to start with NFS using one or more of the 10 Gb/s Ethernet connections on the DGX-1. After this is configured, it is recommended that you run your applications and check if IO performance is a bottleneck. Typically, NFS over 10Gb/s Ethernet provides up to 1.25 GB/s of IO throughput for large block sizes. If, in your testing, you see NFS performance that is significantly lower than this, check the network between the NFS server and the DGX-1 to make sure there are no bottlenecks (for example, a 1 GigE network connection somewhere, a misconfigured NFS server, or a smaller MTU somewhere in the network).

There are a number of online articles, such as this one, that list some suggestions for tuning NFS performance on both the client and the server. For example:
  • Increasing Read, Write buffer sizes
  • TCP optimizations including larger buffer sizes
  • Increasing the MTU size to 9000
  • Sync vs. Async
  • NFS Server options
  • Increasing the number of NFS server daemons
  • Increasing the amount of NFS server memory
Linux is very flexible and by default most distributions are conservative about their choice of IO buffer sizes since the amount of memory on the client system is unknown. A quick example is increasing the size of the read buffers on the DGX-1 (the NFS client). This can be achieved with the following system parameters:
  • net.core.rmem_max=67108864
  • net.core.rmem_default=67108864
  • net.core.optmem_max=67108864

The values after the variable are example values (they are in bytes). You can change these values on the NFS client and the NFS server, and then run experiments to determine if the IO performance improves.

The previous examples are for the kernel read buffer values. You can also do the same thing for the write buffers where you use wmem instead rmem.

You can also tune the TCP parameters in the NFS client to make them larger. For example, you could change the net.ipv4.tcp_rmem=”4096 87380 33554432” system parameter.

This changes the TCP buffer size, for ipv4, to 4,096 bytes as a minimum, 87,380 bytes as the default, and 33,554,432 bytes as the maximum.

If you can control the NFS server, one suggestion is to increases the number of NFS daemons on the server. By default, NFS only starts with eight nfsd processes (eight threads), which, given that CPUs today have very large core counts, is not really enough.

You can find the number of NFS daemons in two ways. The first is to look at the process table and count the number of NFS processes via the $ ps -aux | grep nfs command.

The second way is to look at the NFS config file (for example, /etc/sysconfig/nfs) for an entry that says RPCNFSDCOUNT. This tells you the number of NFS daemons for the server.

If the NFS server has a large number of cores and a fair amount of memory, you can increase RPCNFSDCOUNT. There are cases where good performance has been achieved using 256 on an NFS server with 16 cores and 128GB of memory.

You should also increase RPCNFSDCOUNT when you have a large number of NFS clients performing I/O at the same time. For this situation, it is recommended that you should also increase the amount of memory on the NFS server to a larger number, such as 128 or 256GB. Don't forget that if you change the value of RPCNFSDCOUNT, you will have to restart NFS for the change to take effect.

One way to determine whether more NFS threads helps performance is to check the data in /proc/net/rpc/nfs entry for the load on the NFS daemons. The output line that starts with th lists the number of threads, and the last 10 numbers are a histogram of the number of seconds the first 10% of threads were busy, the second 10%, and so on.

Ideally, you want the last two numbers to be zero or close to zero, indicating that the threads are busy and you are not "wasting" any threads. If the last two numbers are fairly high, you should add NFS daemons, because the NFS server has become the bottleneck. If the last two, three, or four numbers are zero, then some threads are probably not being used.

One other option, while a little more complex, can prove to be useful if the IO pattern becomes more write intensive. If you are not getting the IO performance you need, change the mount behavior on the NFS clients from “sync” to “async”.
CAUTION:
By default, NFS file systems are mounted as “sync” which means the NFS client is told the data is on the NFS server after it has actually been written to the storage indicating the data is safe. Some systems will respond that the data is safe if it has made it to the write buffer on the NFS server and not the actual storage.

Switching from “sync” to “async” means that the NFS server responds to the NFS client that the data has been received when the data is in the NFS buffers on the server (in other words, in memory). The data hasn’t actually been written to the storage yet, it’s still in memory. Typically, writing to the storage is much slower than writing to memory, so write performance with “async” is much faster than with “sync”. However, if, for some reason, the NFS server goes down before the data in memory is written to the storage, then the data is lost.

If you try using “async” on the NFS client (in other words, the DGX-1), ensure that the data on the NFS server is replicated somewhere else so that if the server goes down, there is always a copy of the original data. The reason is if the NFS clients are using “async” and the NFS server goes down, data that is in memory on the NFS server will be lost and cannot be recovered.

NFS “async” mode is very useful for write IO, both streaming (sequential) and random IO. It is also very useful for “scratch” file systems where data is stored temporarily (in other words, not permanent storage or storage that is not replicated or backed up).

If you find that the IO performance is not what you expected and your applications are spending a great deal of time waiting for data, then you can also connect NFS to the DGX-1 over InfiniBand using IPoIB (IP over IB). This is part of the DGX-1 software stack and can be easily configured. The main point is that the NFS server should be InfiniBand attached as well as the NFS clients. This can greatly improve IO performance.

6.1.2.2. Parallel File Systems

Other network file systems that require the installation of additional software or modification of the kernel itself are not supported by NVIDIA. This includes file systems such as Lustre, BeeGFS, General Parallel File System (formerly known as GPFS), and Gluster among others. These file systems can improve the aggregate IO performance as well as the reliability (fault tolerance).
CAUTION:
If you require technical support from NVIDIA for your DGX-1, it is possible, although unlikely, that NVIDIA would ask you to uninstall the parallel file system and revert the kernel back to a baseline kernel, to help debug the problem.

6.1.2.3. Scaling Out Recommendations

Based on the general IO patterns of deep learning frameworks (see External Storage), below are suggestions for storage needs based on the use case. These are suggestions only and are to be viewed as general guidelines.
Table 3. Scaling out suggestions and guidelines
Use Case Adequate Read Cache? Network Type Recommended Network File System Options
Data Analytics NA 10 Gbe Object-Storage, NFS, or other system with good multi-threaded read and small file performance
HPC NA 10/40/100 GBe, InfiniBand NFS or HPC targeted filesystem with support for large numbers of clients and fast single-node performance
DL, 256x256 images yes 10 Gbe NFS or storage with good small file support
DL, 1080p images yes 10/40 Gbe, InfiniBand High-end NFS, HPC filesystem or storage with fast streaming performance
DL, 4k images yes 40 Gbe, InfiniBand HPC filesystem, high-end NFS or storage with fast streaming performance capable of 3+ GB/s per node
DL, uncompressed Images yes InfiniBand, 40/100 Gbe HPC filesystem, high-end NFS or storage with fast streaming performance capable of 3+ GB/s per node
DL, Datasets that are not cached no InfiniBand, 10/40/100 Gbe Same as above, aggregate storage performance must scale to meet the all applications simultaneously

As always, it is best to understand your own applications’ requirements to architect the optimal storage system.

Lastly, this discussion has focused only on performance needs. Reliability, resiliency and manageability are as important as the performance characteristics. When choosing between different solutions that meet your performance needs, make sure that you have considered all aspects of running a storage system and the needs of your organization to select the solution that will provide the maximum overall value.

6.2. Authenticating Users

To make the DGX useful, users need to be added to the system in some fashion so they can be authenticated to use the system. Generally, this is referred to as user authentication. There are several different ways this can be accomplished, however, each method has its own pros and cons.

6.2.1. Local

The first way is to create users directly on the DGX-1 server using the useradd command. Let’s assume you want to add a user dgxuser. You would first add the user via the following command.
$ useradd -m -s /bin/bash dgxuser
Where -s refers to the default shell for the user and -m creates the user’s home directory. After creating the user you need to add them to the docker group on the DGX.
$ sudo usermod -aG docker dgxuser

This adds the user dgxuser to the group docker which is required for running Docker containers on the DGX.

Using authentication on the DGX is simple but not without its issues. First, there have been occasions when an OS upgrade on the DGX requires the reformatting of all the drives in the appliance. If this happens, you first must make sure all user data is copied somewhere off the DGX-1 before the upgrade. Second, you will have to recreate the users and add them to the docker group and copy their home data back to the DGX-1. This adds work and time to upgrading the system.
Important: Moreover, there is no RAID-1 on the OS drive so if it fails, you will lose all the users and everything in the home directories. It is highly recommended that you backup the pertinent files on the DGX-1 as well as /home for the users.

6.2.2. NIS or NIS+

Another authentication option is to use NIS or NIS+. In this case, the DGX-1 would be a client in the NIS/NIS+ configuration. As with using local authentication as previously discussed, there is the possibility that the OS drive in the DGX-1 could be overwritten during an upgrade (not all upgrades reformat the drives, but it’s possible). This means that the administrator may have to reinstall the NIS configuration on the DGX-1.

Also, remember that the DGX-1 has a single OS drive. If this drive fails, the administrator will have to re-configure the NIS/NIS+ configuration, therefore, backups are encouraged.
Note: It is possible that if, in the unlikely event that technical support for the DGX-1 is needed, the NVIDIA engineers may require the administrator to disconnect from the NIS/NIS+ server.

6.2.3. LDAP

A third option for authentication is LDAP (Lightweight Directory Access Protocol). It has become very popular in the clustering world, particularly for Linux. You can configure LDAP on the DGX-1 for user information and authentication from an LDAP server. However, as with NIS, there are possible repercussions.
CAUTION:
  • The first is that the OS drive is a single drive. If the drive fails, you will have to rebuild the LDAP configuration (backups are highly recommended).
  • The second is that, as previously mentioned, if, in the unlikely event of needing tech support, you may be asked to disconnect the DGX-1 from the LDAP server so that the system can be triaged.

6.2.4. Active Directory

One other option for user authentication is connecting the DGX-1 to an Active Directory (AD) server. This may require the system administrator to install some extra tools into the DGX-1. This means that this approach should also include the two cautions that were repeated before where the single OS drive may be reformatted for an upgrade or that it may fail (again, backups are highly recommended). It also means that in the unlikely case of needing to involve NVIDIA technical support, you may be asked to take the system off the AD network and remove any added software (this is unlikely but possible).

6.3. Managing Resources

One of the common questions from DGX-1 customers is how can they effectively share the DGX-1 between users without any inadvertent problems or data exchange. The generic phrase for this is resource management, the tools are called resource managers. They can also be called schedulers or job schedulers. These terms are oftentimes used interchangeably.

You can view everything on the DGX as a resource. This includes memory, CPUs, GPUs, and even storage. Users submit a request to the resource manager with their requirements and the resource manager assigns the resources to the user if they are available and not being used. Otherwise, the resource manager puts the request in a queue to wait for the resources to become available. When the resources are available, the resource manager assigns the resources to the user request.

Resource management so that users can effectively share a centralized resource (in this case, the DGX-1 appliance) has been around a long time. There are many open-source solutions, mostly from the HPC world, such as PBS Pro, Torque, SLURM, Openlava, SGE, HTCondor, and Mesos. There are also commercial resource management tools such as UGE and IBM Spectrum LSF.

For more information about getting started, see Job scheduler.

If you haven’t used job scheduling before you should perform some simple experiments first to understand how it works. For example, take a single server and install the resource manager. Then try running some simple jobs using the cores on the server.

6.3.1. SLURM Example

As an example, SLURM is installed and configured on a DGX-1. The first step is to plan how you want to use the DGX-1. The first, and by far the easiest configuration, is to assume that a user gets exclusive access to the entire node. In the case the user gets the entire DGX, access to all 8 GPUs and all cores is given. No other users can use the resources while the first user is using them.

The second way, is to make the GPUs a consumable resource. The user will then ask for the number of GPUs they need ranging from 1 to 8.

There are two public git repositories containing information on SLURM and GPUs, that can help you get started with scheduling jobs.
Note: You may have to configure SLURM to match your specifications.

At a high level, there are two basic options for configuring SLURM with GPU’s and DGX-1 systems. The first is to use what is called exclusive mode access and the second allows each GPU to be scheduled independently of the others.

6.3.1.1. Simple GPU Scheduling With Exclusive Node Access

If you're not interested in allowing multiple jobs per compute node, you many not necessarily need to make SLURM aware of the GPUs in the system, and the configuration can be greatly simplified.

One way of scheduling GPUs without making use of GRES (Generic Resource Scheduling) is to create partitions or queues for logical groups of GPUs. For example, grouping nodes with P100 GPUs into a P100 partition would result in something like the following:
$ sinfo -s
PARTITION AVAIL  TIMELIMIT   NODES(A/I/O/T)  NODELIST
p100     up   infinite         4/9/3/16  node[212-213,215-218,220-229]
The corresponding partition configuration via the SLURM configuration file, slurm.conf, would be something like the following:
NodeName=node[212-213,215-218,220-229]
PartitionName=p100 Default=NO DefaultTime=01:00:00 State=UP Nodes=node[212-213,215-218,220-229]

If a user requests a node from the p100 partition, then they would have access to all of the resources in that node, and other users would not. This is what it is called exclusive access.

This approach can be advantageous if you are concerned that sharing resources might result in performance issues on the node or if you are concerned about overloading the node resources. For example, in the case of a DGX-1, if you think multiple users might overwhelm the 8TB NFS read cache, then you might want to consider using exclusive mode. Of if you are concerned that the users may use all of the physical memory causing pae swapping with a corresponding reduction in performance, then exclusive mode might be useful.

6.3.1.2. Scheduling Resources At The Per GPU Level

A second option for using SLURM, is to treat the GPUs like a consumable resource and allow users to request them in integer units (i.e. 1, 2, 3, etc.). SLURM can be made aware of GPUs as a consumable resource to allow jobs to request any number of GPU’s. This feature requires job accounting to be enabled first; for more info, see Accounting and Resource Limits. A very quick overview is below.

The SLURM configuration file, slurm.conf, needs parameters set to enable cgroups for resource management and GPU resource scheduling. An example is the following:
# General
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup

# Scheduling
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory

# Logging and Accounting
AccountingStorageTRES=gres/gpu
DebugFlags=CPU_Bind,gres                # show detailed information in Slurm logs about GPU binding and affinity
JobAcctGatherType=jobacct_gather/cgroup
The partition information in slurm.conf defines the available GPUs for each resource. Here is an example:
# Partitions
GresTypes=gpu
NodeName=slurm-node-0[0-1] Gres=gpu:2 CPUs=10 Sockets=1 CoresPerSocket=10 ThreadsPerCore=1 RealMemory=30000 State=UNKNOWN
PartitionName=compute Nodes=ALL Default=YES MaxTime=48:00:00 DefaultTime=04:00:00 MaxNodes=2 State=UP DefMemPerCPU=3000
The way that resource management is enforced is through cgroups. The cgroups configuration require a separate configuration file, cgroup.conf, such as the following:
CgroupAutomount=yes 
CgroupReleaseAgentDir="/etc/slurm/cgroup" 

ConstrainCores=yes 
ConstrainDevices=yes
ConstrainRAMSpace=yes
#TaskAffinity=yes
To schedule GPU resources requires a configuration file to define the available GPUs and their CPU affinity. An example configuration file, gres.conf, is below:
Name=gpu File=/dev/nvidia0 CPUs=0-4
Name=gpu File=/dev/nvidia1 CPUs=5-9
To run a job utilizing GPU resources requires using the --gres flag with the srun command. For example, to run a job requiring a single GPU the following srun command can be used.
$ srun --gres=gpu:1 nvidia-smi

You also may want to restrict memory usage on shared nodes so that a user doesn’t cause swapping with other user or system processes. A convenient way to do this is with memory cgroups.

Using memory cgroups can be used to restrict jobs to allocated memory resources requires setting kernel parameters. On Ubuntu systems this is configurable via the file /etc/default/grub.
GRUB_CMDLINE_LINUX="cgroup_enable=memory swapaccount=1"

6.4. Monitoring

Being able to monitor your systems is the first step in being able to manage them. NVIDIA provides some very useful command line tools that can be used specifically for monitoring the GPUs.

NVIDIA DCGM (Data Center GPU Manager) simplifies GPU administration in the data center. It improves resource reliability and uptime, automates administrative tasks, and helps drive overall infrastructure efficiency. It can perform the following tasks with very low overhead on the appliance.
  • Active health monitoring
  • Diagnostics
  • System validation
  • Policies Power and clock management
  • Group configuration and accounting

For more information about DCGM, there is a very good introductory blog post. The DCGM toolkit comes with a user guide that explains how to use the command line tool dcgmi as well as an API guide. In addition to the command line tool, DCGM also comes with headers and libraries for writing your own tools in Python or C.

Rather than treat each GPU as a separate resource, DCGM allows you to group them and then apply policies or tuning options to the group. This also includes being able to run diagnostics on the group.

There are several best practices for using DCGM with the DGX-1 appliance. The first is that the command line tool can run diagnostics on the GPUs. You could create a simple cron job on the DGX-1 to check the GPUs and store the results either into a simple flat file or into a simple database.

There are three levels of diagnostics that can be run starting with level 1. This runs in just a few seconds. The final is level 3 and this takes about 4 minutes to run. An example of the output from running a level 3 diagnostic is below.
Figure 18. Levels of diagnostics Levels of diagnostics
It is fairly easy to parse this output looking for Error in the output. You can easily send an email or raise some other alert if an Error is discovered.
A second best practice for utilizing DCGM is if you have a resource manager (in other words, a job scheduler) installed. Before the user’s job is run, the resource manager can usually perform what is termed a prologue. That is, any system calls before the user’s job is executed. This is a good place to run a quick diagnostic and also use DCGM to start gathering statistics on the job. Below is an example of statistics gathering:
Figure 19. Statistics gathering Statistics gathering

When the user’s job is complete, the resource manager can run something called an epilogue. This is a place where the system can run some system calls for doing such things as cleaning up the environment or summarizing the results of the run including the GPU stats as from the above command. Consult the user’s guide to learn more about stats with DCGM.

If you create a set of prologue and epilogue scripts that run diagnostics you might want to consider storing the results in a flat file or a simple database. This allows you to keep a history of the diagnostics of the GPUs so you can pinpoint any issues (if there are any).

A third way to effectively use DCGM is to combine it with a parallel shell tool such as pdsh. With a parallel shell you can run the same command across all of the nodes in a cluster or a specific subset of nodes. You can use it to run dcgmi to run diagnostics across several DGX-1 appliances or a combination of DGX-1 appliances and non-GPU enabled systems. You can easily capture this output and store it in a flat file or a database. Then you can parse the output and create warnings or emails based on the output.

Having all of this diagnostic output is also an excellent source of information for creating reports regarding topics such as utilization.

6.5. Networking

Networking DGX-1 appliances is an important topic because of the need to provide data to the GPUs for processing. GPUs are remarkably faster than CPUs for many tasks, particularly deep learning. Therefore, the network principles used for connecting CPU servers may not be sufficient for DGX-1 appliances. This is particularly important as the number of DGX-1 appliances grows over time.

To understand best practices for networking the DGX-1 and for planning for future growth, it is best to start with a brief review of the DGX-1 appliance itself. Recall that the DGX-1 comes with four EDR InfiniBand cards (100 Gb/s each) and two 10Gb/s Ethernet cards (copper). These networking interfaces can be used for connecting the DGX-1 to the network for both communications and storage.
Figure 20. Networking interfaces Networking interfaces

Notice that every two GPUs is connected to a single PCIe switch that is on the system board. The switch also connects to an InfiniBand (IB) network card. To reduce latency and improve throughput, and network traffic from these two GPUs should go to the associated IB card. This is why there are four IB cards in the DGX-1 appliance.

6.5.1. InfiniBand Networking

If you want to use the InfiniBand (IB) network to connect DGX appliances, theoretically, you only have to use one of the IB cards. However, this will push data traffic over the QPI link between the CPUs, which is a very slow link for GPU traffic (i.e. it becomes a bottleneck). A better solution would be to use two IB cards, one connected to each CPU. This could be IB0 and IB2, or IB1 and IB3, or IB0 and IB3, or IB1 and IB2. This would greatly reduce the traffic that has to traverse the QPI link. The best performance is always going to be using all four of the IB links to an IB switch.

The best approach for using IB links to connect all four IB cards to an IB fabric. This will results in the best performance (full bisectional bandwidth and lowest latency) if you are using multiple DGX appliances for training.

Typically, the smallest IB switch comes with 36-ports. This means a single IB switch could accommodate nine (9) DGX-1 appliances using all four IB cards. This allows 400 Gb/s of bandwidth from the DGX-1 to the switch.

If your applications do not need the bandwidth between DGX-1 appliances, you can use two IB connections per DGX-1 as mentioned previously. This allows you to connect up to 18 DGX-1 appliances to a single 36-port IB switch.
Note: It is not recommended to use only a single IB card, but if for some reason that is the configuration, then you can connect up to 36 DGX-1 appliances to a single switch.

For larger numbers of DGX-1 appliances, you will likely have to use two levels of switching. The classic HPC configuration is to use 36-port IB switches for the first level (sometimes called leaf switches) and connect them to a single large core switch, which is sometimes called a directory class switch. The largest directory class InfiniBand switch has 648 ports. You can use more than one core switch but the configuration will get rather complex. If this is something you are considering, please contact your NVIDIA sales team for a discussion.

For two tiers of switching, If all four IB cards per DGX-1 appliance are used to connect to a 36-port switch, and there is no over-subscription, the largest number of DGX-1 appliances per switch is 4. This is 4 ports from each DGX-1 into the switch for a total of 16. Then, there are 16 uplinks from the leaf switch to the core switch (the director class switch). A total of 40x 36-port leaf switches can be connected to the 648-port core switch (648/16). This results in 160 DGX-1 appliances being connected with full bi-sectional bandwidth.

You can also use what is termed over-subscription in designing the IB network. Over-subscription means that the bandwidth from an uplink is less than the bandwidth coming into the unit (in other words, poorer bandwidth performance). If we use 2:1 over-subscription from the DGX-1 appliances to the first level of switches (36-port leaf switches), then each DGX-1 appliance is only using two IB cards to connect to the switches. This results in less bandwidth than if we used all four cards and also higher latency.

If we keep the network bandwidth from the leaf switches to the core directory switch as 1:1 (in other words, no over-subscription, full bi-sectional bandwidth), then we can put nine (9) DGX-1 appliances into a single leaf switch (a total of 18 ports into the leaf switch from the DGX appliances and 18 uplink ports to the core switch). The result is that a total of 36 leaf switches can be connected to the core switch. This allows a grand total of 324 DGX-1 appliances to be connected together.

You can tailor the IB network even further by using over-subscription from the leaf switches to the core switch. This can be done using four IB connections to a leaf switch from each DGX appliance and then doing 2:1 over-subscription to the core switch or even using two IB connections to the leaf switches and then 2:1 over-subscription to the core switch. These designs are left up to the user to determine but if this is something you want to consider, please contact your NVIDIA sales team for a discussion.

Another important aspect of InfiniBand networking is the Subnet Manager (SM). The SM simply manages the IB network. There is one SM that manages the IB fabric at any one time but you can have other SM’s running and ready to take over if the first SM crashes. Choosing how many SM’s to run and where to run them can have a major impact on the design of the cluster.

The first decision to make is where you want to run the SM’s. They can be run on the IB switches if you desire. This is called hardware SM since it runs on the switch hardware. The advantage of this is that you do not need any other servers which could also run the SM. Running the SM on a node is called a software SM. A disadvantage to running a hardware SM is that if the IB traffic is large, the SM could have a difficult time. For lots of IB traffic and for larger networks, it is a best practice to use a software SM on a dedicated server.

The second decision to make is how many SM’s you want to run. At a minimum, you will have to run one SM. The least expensive solution is to run a single hardware SM. This will work fine for small clusters of DGX-1 appliances (perhaps 2-4). As the number of units grow, you will want to consider running two SM’s at the same time to get HA (High Availability) capability. The reason you want HA is that more users are on the cluster and having it go down has a larger impact than just a small number of appliances.

As the number of appliances grow, consider running the SM’s on dedicated servers (software SM). You will also want to run at least two SM’s for the cluster. Ideally, this means two dedicated servers for the SM’s, but there may be a better solution that solves some other problems; a master node.

6.5.2. Ethernet Networking

Each DGX-1 system comes with two 10Gb/s NICs. These can be used to connect the systems to the local network for a variety of functions such as logins and storage traffic. As a starting point, it is recommended to push NFS traffic over these NICs to the DGX-1. You should monitor the impact of IO on the performance of your models in this configuration.

If you need to go to more than one level of Ethernet switching to connect all of the DGX-1 units and the storage, be careful of how you configure the network. More than likely, you will have to enable the spanning tree protocol to prevent loops in the network. The spanning tree protocol can impact network performance, therefore, you could see a decrease in application performance.

The InfiniBand NICs that come with the DGX-1 can also be used as Ethernet NICs running TCP. The ports on the cards are QSFP28 so you can plug them into a compatible Ethernet network or a compatible InfiniBand network. You will have to add some software to the appliance and change the networking but you can use the NICs as 100GigE Ethernet cards.

For more information, see Switch Infiniband and Ethernet in DGX-1.

6.5.3. Bonded NICs

The DGX-1 provides two 10GbE ports. Out of the factory these two ports are not bonded but they can be bonded if desired. In particular, VLAN Tagged, Bonded NICs across the two 10 GbE cards can be accomplished.

Before bonding the NICs together, ensure you are familiar with the following:
  • Ensure your network team is involved because you will need to choose a bonding mode for the NICs.
  • Ensure you have a working network connection to pull down the VLAN packages. To do so, first setup a basic, single NIC network (no VLAN/bonding) connection and download the appropriate packages. Then, reconfigure the switch for LACP/VLANs.
Tip: Since the networking goes up and down throughout this process, it's easier to work from a remote console.
The process below walks through the steps of an example for bonding the two NICs together.
  1. Edit the /etc/network/interfaces file to setup an interface on a standard network so that we can access required packages.
    auto em1
    	iface em1 inet static
    	   address 10.253.0.50
     	   netmask 255.255.255.0
     	   network 10.253.0.0
     	   gateway 10.253.0.1
     	   dns-nameservers 8.8.8.8
  2. Bring up the updated interface.
    sudo ifdown em1 && sudo ifup em1
  3. Pull down the required bonding and VLAN packages.
    sudo apt-get install vlan
    sudo apt-get install ifenslave
  4. Shut down the networking.
    sudo stop networking
  5. Add the following lines to /etc/modules to load appropriate drivers.
    sudo echo "8021q" >> /etc/modules
    sudo echo "bonding" >> /etc/modules
  6. Load the drivers.
    sudo modprobe 8021q
    sudo modprobe bonding
  7. Reconfigure your /etc/network/interfaces file. There are some configuration parameters that will be customer network dependent and you will want to work with one of your network engineers.
    The following example creates a bonded network over em1/em2 with IP 172.16.1.11 and VLAN ID 430. You specify the VLAN ID in the NIC name (bond0.###). Also notice that this example uses a bond-mode of 4. Which mode you use is up to you and your situation.
    auto lo
    iface lo inet loopback
    
    
    # The following 3 sections create the bond (bond0) and associated network ports (em1, em2)
    auto bond0
    iface bond0 inet manual
    bond-mode 4
    bond-miimon 100
    bond-slaves em1 em2
     
    auto em1
    iface em1 inet manual
    bond-master bond0
    bond-primary em1
     
    auto em2
    iface em2 inet manual
    bond-master bond0
    
    
    # This section creates a VLAN on top of the bond.  The naming format is device.vlan_id
    auto bond0.430
    iface bond0.430 inet static
    address 172.16.1.11
    netmask 255.255.255.0
    gateway 172.16.1.254
    dns-nameservers 172.16.1.254
    dns-search company.net
    vlan-raw-device bond0
  8. Restart the networking.
    sudo start networking
  9. Bring up the bonded interfaces.
    ifup bond0
  10. Engage your network engineers to re-configure LACP and VLANs on switch.
  11. Test the configuration.

6.6. SSH Tunneling

Some environments are not configured or limit access (firewall or otherwise) to computer nodes within an intranet. When running a container with a service or application exposed on a port, such as DIGITS, remote access must be enabled on the remote system to that port on the DGX-1. The following steps use PuTTY to create SSH tunnel from a remote system into the DGX-1. If you are using an SSH utility, one can set up tunneling via the -L option.
Note: A PuTTY SSH tunnel session must be up, logged in, and running for tunnel to function. SSH tunnels are commonly used for the following applications (with listed port numbers).
Table 4.
Application Port Notes
DIGITS 5000 If multiple users, each selects own port
VNC Viewer 5901, 6901 5901 for VNC app, 6901 for web app
To create an SSH Tunnel session with PuTTY, perform the following steps:
  1. Run the PuTTY application.
  2. In the Host Name field, enter the host name you want to connect to.
  3. In the Saved Sessions section, enter a name to save the session under and click Save.
  4. Click Category > Connection, click + next to SSH to expand the section.
  5. Click Tunnels for Tunnel configuration.
  6. Add the DIGITS port for forwarding.
    1. In the Source Port section, enter 5000, which is the port you need to forward for DIGITS.
  7. In the Destination section, enter localhost:5000 for the local port that you will connect to.
  8. Click Add to save the added Tunnel.
  9. In the Category section, click Session.
  10. In the Saved Sessions section, click the name you previously created, then click Save to save the added Tunnels.
To use PuTTY with tunnels, perform the following steps:
  1. Run the PuTTY application.
  2. In the Saved Sessions section, select the Save Session that you created.
  3. Click Load.
  4. Click Open to start session and login. The SSH tunnel is created and you can connect to a remote system via tunnel. As an example, for DIGITS, you can start a web browser and connect to http://localhost:5000.

6.7. Master Node

A master node, also sometimes called a head node, is a very useful server within a cluster. Typically, it runs the cluster management software, the resource manager, and any monitoring tools that are used. For smaller clusters, it is also used as a login node for users to create and submit jobs.

For clusters of any size that include the DGX-1, a master node can be very helpful. It allows the DGX-1 to focus solely on computing rather than any interactive logins or post-processing that users may be doing. As the number of nodes in a cluster increases, it is recommended to use a master node.

It is recommended to size the master node for things such as:
  • Interactive user logins
  • Resource management (running a job scheduler)
  • Graphical pre-processing and post-processing
    • Consider a GPU in the master node for visualization
  • Cluster monitoring
  • Cluster management

Since the master node becomes an important part of the operation of the cluster, consider using RAID-1 for the OS drive in the master node as well as redundant power supplies. This can help improve the uptime of the master node.

For smaller clusters, you can also use the master node as an NFS server by adding storage and more memory to the master node and NFS export the storage to the cluster clients. For larger clusters, it is recommended to have dedicated storage, either NFS or a parallel file system.

For InfiniBand networks, the master node can also be used for running the software SM. If you want some HA for the SM, run the primary SM on the master node and use an SM on the IB switch as a secondary SM (hardware SM).

As the cluster grows, it is recommended to consider splitting the login and data processing functions from the master node to one or more dedicated login nodes. This is also true as the number of users grows. You can run the primary SM on the master node and other SM’s on the login nodes. You could even use the hardware SM’s on the switches as backups.

7. Scripts

7.1. DIGITS

7.1.1. run_digits.sh

#!/bin/bash
# file: run_digits.sh
 
mkdir -p $HOME/digits_workdir/jobs
 
cat <<EOF > $HOME/digits_workdir/digits_config_env.sh
# DIGITS Configuration File
DIGITS_JOB_DIR=$HOME/digits_workdir/jobs
DIGITS_LOGFILE_FILENAME=$HOME/digits_workdir/digits.log
EOF
 
nvidia-docker run --rm -ti --name=${USER}_digits -p 5000:5000 \
  -u $(id -u):$(id -g) -e HOME=$HOME -e USER=$USER -v $HOME:$HOME \
  --env-file=${HOME}/digits_workdir/digits_config_env.sh \
  -v /datasets:/digits_data:ro \
  --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
  nvcr.io/nvidia/digits:17.05

7.1.2. digits_config_env.sh

# DIGITS Configuration File
DIGITS_JOB_DIR=$HOME/digits_workdir/jobs
DIGITS_LOGFILE_FILENAME=$HOME/digits_workdir/digits.log

7.2. NVCaffe

7.2.1. run_caffe_mnist.sh

#!/bin/bash
# file: run_caffe_mnist.sh
 
_basedir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 
function join { local IFS="$1"; shift; echo "$*"; }
 
# arguments to passthrough to caffe such as "-gpu all" or "-gpu 0,1"
script_args="$(join : $@)"
 
DATA=/datasets/caffe_mnist
CAFFEWORKDIR=$HOME/caffe_workdir
 
mkdir -p $DATA
mkdir -p $CAFFEWORKDIR/mnist
 
# Backend storage for Caffe data.
BACKEND="lmdb"
 
dname=${USER}_caffe
 
# Orchestrate Docker container with user's privileges
nvidia-docker run -d -t --name=$dname \
  -u $(id -u):$(id -g) -e HOME=$HOME -e USER=$USER -v $HOME:$HOME \
  -e DATA=$DATA -v $DATA:$DATA \
  -e BACKEND=$BACKEND -e script_args="$script_args" \
  --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
  -w $CAFFEWORKDIR nvcr.io/nvidia/caffe:17.05
 
sleep 1 # wait for container to come up
 
# download and convert data into lmdb format.
docker exec -it $dname bash -c '
  pushd $DATA
 
  for fname in train-images-idx3-ubyte train-labels-idx1-ubyte \
  	t10k-images-idx3-ubyte t10k-labels-idx1-ubyte ; do
	if [ ! -e ${DATA}/$fname ]; then
    	wget --no-check-certificate http://yann.lecun.com/exdb/mnist/${fname}.gz
    	gunzip ${fname}.gz
	fi
  done
 
  popd
 
  TRAINDIR=$DATA/mnist_train_${BACKEND}
  if [ ! -d "$TRAINDIR" ]; then
	convert_mnist_data \
  	$DATA/train-images-idx3-ubyte $DATA/train-labels-idx1-ubyte \
  	$TRAINDIR --backend=${BACKEND}
  fi
 
  TESTDIR=$DATA/mnist_test_${BACKEND}
  if [ ! -d "$TESTDIR" ]; then
	convert_mnist_data \
  	$DATA/t10k-images-idx3-ubyte $DATA/t10k-labels-idx1-ubyte \
  	$TESTDIR --backend=${BACKEND}
  fi
  '
 
# =============================================================================
# SETUP CAFFE NETWORK TO TRAIN/TEST/SOLVER
# =============================================================================
cat <<EOF > $CAFFEWORKDIR/mnist/lenet_train_test.prototxt
name: "LeNet"
layer {
  name: "mnist"
  type: "Data"
  top: "data"
  top: "label"
  include {
	phase: TRAIN
  }
  transform_param {
	scale: 0.00390625
  }
  data_param {
	source: "$DATA/mnist_train_lmdb"
	batch_size: 64
	backend: LMDB
  }
}
layer {
  name: "mnist"
  type: "Data"
  top: "data"
  top: "label"
  include {
	phase: TEST
  }
  transform_param {
	scale: 0.00390625
  }
  data_param {
	source: "$DATA/mnist_test_lmdb"
	batch_size: 100
	backend: LMDB
  }
}
layer {
  name: "conv1"
  type: "Convolution"
  bottom: "data"
  top: "conv1"
  param {
	lr_mult: 1
  }
  param {
	lr_mult: 2
  }
  convolution_param {
	num_output: 20
	kernel_size: 5
	stride: 1
	weight_filler {
  	type: "xavier"
	}
	bias_filler {
  	type: "constant"
	}
  }
}
layer {
  name: "pool1"
  type: "Pooling"
  bottom: "conv1"
  top: "pool1"
  pooling_param {
	pool: MAX
	kernel_size: 2
	stride: 2
  }
}
layer {
  name: "conv2"
  type: "Convolution"
  bottom: "pool1"
  top: "conv2"
  param {
	lr_mult: 1
  }
  param {
	lr_mult: 2
  }
  convolution_param {
	num_output: 50
	kernel_size: 5
	stride: 1
	weight_filler {
  	type: "xavier"
	}
	bias_filler {
  	type: "constant"
	}
  }
}
layer {
  name: "pool2"
  type: "Pooling"
  bottom: "conv2"
  top: "pool2"
  pooling_param {
	pool: MAX
	kernel_size: 2
	stride: 2
  }
}
layer {
  name: "ip1"
  type: "InnerProduct"
  bottom: "pool2"
  top: "ip1"
  param {
	lr_mult: 1
  }
  param {
	lr_mult: 2
  }
  inner_product_param {
	num_output: 500
	weight_filler {
  	type: "xavier"
	}
	bias_filler {
  	type: "constant"
	}
  }
}
layer {
  name: "relu1"
  type: "ReLU"
  bottom: "ip1"
  top: "ip1"
}
layer {
  name: "ip2"
  type: "InnerProduct"
  bottom: "ip1"
  top: "ip2"
  param {
	lr_mult: 1
  }
  param {
	lr_mult: 2
  }
  inner_product_param {
	num_output: 10
	weight_filler {
  	type: "xavier"
	}
	bias_filler {
  	type: "constant"
	}
  }
}
layer {
  name: "accuracy"
  type: "Accuracy"
  bottom: "ip2"
  bottom: "label"
  top: "accuracy"
  include {
	phase: TEST
  }
}
layer {
  name: "loss"
  type: "SoftmaxWithLoss"
  bottom: "ip2"
  bottom: "label"
  top: "loss"
}
EOF
 
 
cat <<EOF > $CAFFEWORKDIR/mnist/lenet_solver.prototxt
# The train/test net protocol buffer definition
net: "mnist/lenet_train_test.prototxt"
# test_iter specifies how many forward passes the test should carry out.
# In the case of MNIST, we have test batch size 100 and 100 test iterations,
# covering the full 10,000 testing images.
test_iter: 100
# Carry out testing every 500 training iterations.
test_interval: 500
# The base learning rate, momentum and the weight decay of the network.
base_lr: 0.01
momentum: 0.9
weight_decay: 0.0005
# The learning rate policy
lr_policy: "inv"
gamma: 0.0001
power: 0.75
# Display every 100 iterations
display: 100
# The maximum number of iterations
max_iter: 10000
# snapshot intermediate results
snapshot: 5000
snapshot_prefix: "mnist/lenet"
# solver mode: CPU or GPU
solver_mode: GPU
EOF

# RUN TRAINING WITH CAFFE ---------------------------------------------------
docker exec -it $dname bash -c '
  # workdir is CAFFEWORKDIR when container was started.
  caffe train --solver=mnist/lenet_solver.prototxt ${script_args//:/ }
  '
 
docker stop $dname && docker rm $dname

TensorFlow

7.3.1. run_tf_cifar10.sh

#!/bin/bash
# file: run_tf_cifar10.sh
 
# run example:
# 	./run_kerastf_cifar10.sh --epochs=3 --datadir=/datasets/cifar
# Get usage help via:
# 	./run_kerastf_cifar10.sh --help 2>/dev/null
 
_basedir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 
# specify workdirectory for the container to run scripts or work from.
workdir=$_basedir
cifarcode=${_basedir}/examples/tensorflow/cifar/cifar10_multi_gpu_train.py
# cifarcode=${_basedir}/examples/tensorflow/cifar/cifar10_train.py
 
function join { local IFS="$1"; shift; echo "$*"; }
 
script_args=$(join : "$@")
 
dname=${USER}_tf
 
nvidia-docker run --name=$dname -d -t \
  --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
  -u $(id -u):$(id -g) -e HOME=$HOME -e USER=$USER -v $HOME:$HOME \
  -v /datasets/cifar:/datasets/cifar:ro -w $workdir \
  -e cifarcode=$cifarcode -e script_args="$script_args" \
  nvcr.io/nvidia/tensorflow:17.05
 
sleep 1 # wait for container to come up
 
docker exec -it $dname bash -c 'python $cifarcode ${script_args//:/ }'
 
docker stop $dname && docker rm $dname

7.4. Keras

7.4.1. venvfns.sh

#!/bin/bash
# file: venvfns.sh
# functions for virtualenv
 
[[ "${BASH_SOURCE[0]}" == "${0}" ]] && \
  echo Should be run as : source "${0}" && exit 1
 
enablevenvglobalsitepackages() {
	if ! [ -z ${VIRTUAL_ENV+x} ]; then
    	_libpypath=$(dirname $(python -c \
  "from distutils.sysconfig import get_python_lib; print(get_python_lib())"))
   	if ! [[ "${_libpypath}" == *"$VIRTUAL_ENV"* ]]; then
      	return # VIRTUAL_ENV path not in the right place
   	fi
       no_global_site_packages_file=${_libpypath}/no-global-site-packages.txt
   	if [ -f $no_global_site_packages_file ]; then
       	rm $no_global_site_packages_file;
       	echo "Enabled global site-packages"
   	else
       	echo "Global site-packages already enabled"
   	fi
	fi
}
 
disablevenvglobalsitepackages() {
	if ! [ -z ${VIRTUAL_ENV+x} ]; then
    	_libpypath=$(dirname $(python -c \
  "from distutils.sysconfig import get_python_lib; print(get_python_lib())"))
   	if ! [[ "${_libpypath}" == *"$VIRTUAL_ENV"* ]]; then
      	return # VIRTUAL_ENV path not in the right place
   	fi
   	no_global_site_packages_file=${_libpypath}/no-global-site-packages.txt
   	if ! [ -f $no_global_site_packages_file ]; then
       	touch $no_global_site_packages_file
       	echo "Disabled global site-packages"
   	else
       	echo "Global site-packages were already disabled"
   	fi
	fi
}

7.4.2. setup_keras.sh

#!/bin/bash
# file: setup_keras.sh
 
dname=${USER}_keras
 
nvidia-docker run --name=$dname -d -t \
  -u $(id -u):$(id -g) -e HOME=$HOME -e USER=$USER -v $HOME:$HOME \
  nvcr.io/nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04

docker exec -it -u root $dname \
  bash -c 'apt-get update && apt-get install -y virtualenv virtualenvwrapper'
 
docker exec -it $dname \
  bash -c 'source /usr/share/virtualenvwrapper/virtualenvwrapper.sh
  mkvirtualenv py-keras
  pip install --upgrade pip
  pip install keras --no-deps
  pip install PyYaml
  pip install numpy
  pip install scipy
  pip install ipython'
 
docker stop $dname && docker rm $dname

7.4.3. run_kerastf_mnist.sh

#!/bin/bash
# file: run_kerastf_mnist.sh
 
_basedir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 
# specify workdirectory for the container to run scripts or work from.
workdir=$_basedir
mnistcode=${_basedir}/examples/keras/mnist_cnn.py
 
dname=${USER}_keras
 
nvidia-docker run --name=$dname -d -t \
  -u $(id -u):$(id -g) -e HOME=$HOME -e USER=$USER -v $HOME:$HOME \
  -w $workdir -e mnistcode=$mnistcode \
  nvcr.io/nvidia/tensorflow:17.05
 
sleep 1 # wait for container to come up
 
docker exec -it $dname \
	bash -c 'source ~/.virtualenvs/py-keras/bin/activate
	source ~/venvfns.sh
	enablevenvglobalsitepackages
	python $mnistcode
	disablevenvglobalsitepackages'
 
docker stop $dname && docker rm $dname

7.4.4. run_kerasth_mnist.sh

#!/bin/bash
# file: run_kerasth_mnist.sh
 
_basedir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 
# specify workdirectory for the container to run scripts or work from.
workdir=$_basedir
mnistcode=${_basedir}/examples/keras/mnist_cnn.py
 
dname=${USER}_keras
 
nvidia-docker run --name=$dname -d -t \
  -u $(id -u):$(id -g) -e HOME=$HOME -e USER=$USER -v $HOME:$HOME \
  -w $workdir -e mnistcode=$mnistcode \
  nvcr.io/nvidia/theano:17.05
 
sleep 1 # wait for container to come up
 
docker exec -it $dname \
	bash -c 'source ~/.virtualenvs/py-keras/bin/activate
	source ~/venvfns.sh
	enablevenvglobalsitepackages
	KERAS_BACKEND=theano python $mnistcode
	disablevenvglobalsitepackages'
 
docker stop $dname && docker rm $dname

7.4.5. run_kerastf_cifar10.sh

#!/bin/bash
# file: run_kerastf_cifar10.sh
 
# run example:
# 	./run_kerastf_cifar10.sh --epochs=3 --datadir=/datasets/cifar
# Get usage help via:
# 	./run_kerastf_cifar10.sh --help 2>/dev/null
 
_basedir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 
# specify workdirectory for the container to run scripts or work from.
workdir=$_basedir
cifarcode=${_basedir}/examples/keras/cifar10_cnn_filesystem.py
 
function join { local IFS="$1"; shift; echo "$*"; }
 
script_args=$(join : "$@")
 
dname=${USER}_keras
 
nvidia-docker run --name=$dname -d -t \
  -u $(id -u):$(id -g) -e HOME=$HOME -e USER=$USER -v $HOME:$HOME \
  -v /datasets/cifar:/datasets/cifar:ro -w $workdir \
  -e cifarcode=$cifarcode -e script_args="$script_args" \
  nvcr.io/nvidia/tensorflow:17.05
 
sleep 1 # wait for container to come up
 
docker exec -it $dname \
	bash -c 'source ~/.virtualenvs/py-keras/bin/activate
	source ~/venvfns.sh
	enablevenvglobalsitepackages
	python $cifarcode ${script_args//:/ }
	disablevenvglobalsitepackages'
 
docker stop $dname && docker rm $dname

7.4.6. run_keras_script

#!/bin/bash
# file: run_keras_script.sh
 
_basedir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 
# specify workdirectory for the container to run scripts or work from.
workdir=$_basedir
 
function join { local IFS="$1"; shift; echo "$*"; }
 
container="nvcr.io/nvidia/tensorflow:17.05"
backend="tensorflow"
script=''
datamnt=''
 
usage() {
cat <<EOF
Usage: $0 [-h|--help] [--container=container] [--script=script]
	[--<remain_args>]
 
	Sets up a keras environment. The keras environment is setup in a
	virtualenv and mapped into the docker container with a chosen
	--backend. Then runs the specified --script.
 
	--container - Specify desired container. Use "=" equal sign.
    	Default: ${container}
 
	--backend - Specify the backend for Keras: tensorflow or theano.
    	Default: ${backend}
 
	--script - Specify a script. Specify scripts with full or relative
    	paths (relative to current working directory). Ex.:
            --script=examples/keras/cifar10_cnn_filesystem.py
 
	--datamnt - Data directory to mount into the container.
 
	--<remain_args> - Additional args to pass through to the script.
 
	-h|--help - Displays this help.
 
EOF
}
 
remain_args=()
 
while getopts ":h-" arg; do
	case "${arg}" in
	h ) usage
    	exit 2
    	;;
	- ) [ $OPTIND -ge 1 ] && optind=$(expr $OPTIND - 1 ) || optind=$OPTIND
    	eval _OPTION="\$$optind"
    	OPTARG=$(echo $_OPTION | cut -d'=' -f2)
    	OPTION=$(echo $_OPTION | cut -d'=' -f1)
    	case $OPTION in
    	--container ) larguments=yes; container="$OPTARG"  ;;
    	--script ) larguments=yes; script="$OPTARG"  ;;
    	--backend ) larguments=yes; backend="$OPTARG"  ;;
    	--datamnt ) larguments=yes; datamnt="$OPTARG"  ;;
    	--help ) usage; exit 2 ;;
    	--* ) remain_args+=($_OPTION) ;;
	    esac
   	OPTIND=1
   	shift
  	;;
	esac
done
 
script_args="$(join : ${remain_args[@]})"
 
dname=${USER}_keras
 
# formulate -v option for docker if datamnt is not empty.
mntdata=$([[ ! -z "${datamnt// }" ]] && echo "-v ${datamnt}:${datamnt}:ro" )
 
nvidia-docker run --name=$dname -d -t \
  -u $(id -u):$(id -g) -e HOME=$HOME -e USER=$USER -v $HOME:$HOME \
  $mntdata -w $workdir \
  -e backend=$backend -e script=$script -e script_args="$script_args" \
  $container
 
sleep 1 # wait for container to come up
 
docker exec -it $dname \
	bash -c 'source ~/.virtualenvs/py-keras/bin/activate
	source ~/venvfns.sh
	enablevenvglobalsitepackages
	KERAS_BACKEND=$backend python $script ${script_args//:/ }
	disablevenvglobalsitepackages'
 
docker stop $dname && docker rm $dname

7.4.7. cifar10_cnn_filesystem.py

#!/usr/bin/env python
# file: cifar10_cnn_filesystem.py
'''
Train a simple deep CNN on the CIFAR10 small images dataset.
'''
 
from __future__ import print_function
import sys
import os
 
from argparse import (ArgumentParser, SUPPRESS)
from textwrap import dedent
 
 
import numpy as np
 
# from keras.utils.data_utils import get_file
from keras.utils import to_categorical
from keras.datasets import cifar10
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
import keras.layers as KL
from keras import backend as KB
 
from keras.optimizers import RMSprop
 
 
def parser_(desc):
	parser = ArgumentParser(description=dedent(desc))
 
	parser.add_argument('--epochs', type=int, default=200,
                    	help='Number of epochs to run training for.')
 
	parser.add_argument('--aug', action='store_true', default=False,
                    	help='Perform data augmentation on cifar10 set.\n')
 
	# parser.add_argument('--datadir', default='/mnt/datasets')
	parser.add_argument('--datadir', default=SUPPRESS,
                    	help='Data directory with Cifar10 dataset.')
 
	args = parser.parse_args()
 
	return args
 
 
def make_model(inshape, num_classes):
	model = Sequential()
    model.add(KL.InputLayer(input_shape=inshape[1:]))
	model.add(KL.Conv2D(32, (3, 3), padding='same'))
	model.add(KL.Activation('relu'))
	model.add(KL.Conv2D(32, (3, 3)))
	model.add(KL.Activation('relu'))
	model.add(KL.MaxPooling2D(pool_size=(2, 2)))
	model.add(KL.Dropout(0.25))
 
	model.add(KL.Conv2D(64, (3, 3), padding='same'))
	model.add(KL.Activation('relu'))
	model.add(KL.Conv2D(64, (3, 3)))
	model.add(KL.Activation('relu'))
	model.add(KL.MaxPooling2D(pool_size=(2, 2)))
	model.add(KL.Dropout(0.25))
 
	model.add(KL.Flatten())
	model.add(KL.Dense(512))
	model.add(KL.Activation('relu'))
	model.add(KL.Dropout(0.5))
	model.add(KL.Dense(num_classes))
	model.add(KL.Activation('softmax'))
 
	return model
 
 
def cifar10_load_data(path):
	"""Loads CIFAR10 dataset.
 
	# Returns
    	Tuple of Numpy arrays: `(x_train, y_train), (x_test, y_test)`.
	"""
	dirname = 'cifar-10-batches-py'
	# origin = 'http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz'
	# path = get_file(dirname, origin=origin, untar=True)
	path_ = os.path.join(path, dirname)
 
	num_train_samples = 50000
 
	x_train = np.zeros((num_train_samples, 3, 32, 32), dtype='uint8')
	y_train = np.zeros((num_train_samples,), dtype='uint8')
 
	for i in range(1, 6):
    	fpath = os.path.join(path_, 'data_batch_' + str(i))
    	data, labels = cifar10.load_batch(fpath)
    	x_train[(i - 1) * 10000: i * 10000, :, :, :] = data
    	y_train[(i - 1) * 10000: i * 10000] = labels
 
	fpath = os.path.join(path_, 'test_batch')
	x_test, y_test = cifar10.load_batch(fpath)
 
	y_train = np.reshape(y_train, (7, 1))
	y_test = np.reshape(y_test, (6, 1))
 
	if KB.image_data_format() == 'channels_last':
 	   x_train = x_train.transpose(0, 2, 3, 1)
    	x_test = x_test.transpose(0, 2, 3, 1)
 
	return (x_train, y_train), (x_test, y_test)
 
 
def main(argv=None):
	'''
	'''
	main.__doc__ = __doc__
	argv = sys.argv if argv is None else sys.argv.extend(argv)
	desc = main.__doc__
	# CLI parser
	args = parser_(desc)
 
	batch_size = 32
	num_classes = 10
	epochs = args.epochs
	data_augmentation = args.aug
 
	datadir = getattr(args, 'datadir', None)
 
	# The data, shuffled and split between train and test sets:
	(x_train, y_train), (x_test, y_test) = cifar10_load_data(datadir) \
    	if datadir is not None else cifar10.load_data()
	print(x_train.shape[0], 'train samples')
	print(x_test.shape[0], 'test samples')
 
	# Convert class vectors to binary class matrices.
	y_train = to_categorical(y_train, num_classes)
	y_test = to_categorical(y_test, num_classes)
 
	x_train = x_train.astype('float32')
	x_test = x_test.astype('float32')
	x_train /= 255
	x_test /= 255
 
	callbacks = None
 
	print(x_train.shape, 'train shape')
	model = make_model(x_train.shape, num_classes)
 
	print(model.summary())
 
	# initiate RMSprop optimizer
	opt = RMSprop(lr=0.0001, decay=1e-6)
 
	# Let's train the model using RMSprop
    model.compile(loss='categorical_crossentropy',
              	optimizer=opt,
              	metrics=['accuracy'])
 
	nsamples = x_train.shape[0]
	steps_per_epoch = nsamples // batch_size
 
	if not data_augmentation:
    	print('Not using data augmentation.')
    	model.fit(x_train, y_train,
              	batch_size=batch_size,
              	epochs=epochs,
              	validation_data=(x_test, y_test),
              	shuffle=True,
              	callbacks=callbacks)
 
	else:
    	print('Using real-time data augmentation.')
    	# This will do preprocessing and realtime data augmentation:
    	datagen = ImageDataGenerator(
        	# set input mean to 0 over the dataset
        	featurewise_center=False,
        	samplewise_center=False,  # set each sample mean to 0
        	# divide inputs by std of the dataset
            featurewise_std_normalization=False,
        	# divide each input by its std
        	samplewise_std_normalization=False,
        	zca_whitening=False,  # apply ZCA whitening
        	# randomly rotate images in the range (degrees, 0 to 180)
        	rotation_range=0,
        	# randomly shift images horizontally (fraction of total width)
       	 width_shift_range=0.1,
        	# randomly shift images vertically (fraction of total height)
        	height_shift_range=0.1,
        	horizontal_flip=True,  # randomly flip images
        	vertical_flip=False)  # randomly flip images
 
  	  # Compute quantities required for feature-wise normalization
    	# (std, mean, and principal components if ZCA whitening is applied).
    	datagen.fit(x_train)
 
    	# Fit the model on the batches generated by datagen.flow().
    	model.fit_generator(datagen.flow(x_train, y_train,
                                         batch_size=batch_size),
                            steps_per_epoch=steps_per_epoch,
                        	epochs=epochs,
                        	validation_data=(x_test, y_test),
                            callbacks=callbacks)
 
 
if __name__ == '__main__':
	main()

Notices

Notice

THE INFORMATION IN THIS GUIDE AND ALL OTHER INFORMATION CONTAINED IN NVIDIA DOCUMENTATION REFERENCED IN THIS GUIDE IS PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE INFORMATION FOR THE PRODUCT, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the product described in this guide shall be limited in accordance with the NVIDIA terms and conditions of sale for the product.

THE NVIDIA PRODUCT DESCRIBED IN THIS GUIDE IS NOT FAULT TOLERANT AND IS NOT DESIGNED, MANUFACTURED OR INTENDED FOR USE IN CONNECTION WITH THE DESIGN, CONSTRUCTION, MAINTENANCE, AND/OR OPERATION OF ANY SYSTEM WHERE THE USE OR A FAILURE OF SUCH SYSTEM COULD RESULT IN A SITUATION THAT THREATENS THE SAFETY OF HUMAN LIFE OR SEVERE PHYSICAL HARM OR PROPERTY DAMAGE (INCLUDING, FOR EXAMPLE, USE IN CONNECTION WITH ANY NUCLEAR, AVIONICS, LIFE SUPPORT OR OTHER LIFE CRITICAL APPLICATION). NVIDIA EXPRESSLY DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY OF FITNESS FOR SUCH HIGH RISK USES. NVIDIA SHALL NOT BE LIABLE TO CUSTOMER OR ANY THIRD PARTY, IN WHOLE OR IN PART, FOR ANY CLAIMS OR DAMAGES ARISING FROM SUCH HIGH RISK USES.

NVIDIA makes no representation or warranty that the product described in this guide will be suitable for any specified use without further testing or modification. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to ensure the product is suitable and fit for the application planned by customer and to do the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this guide. NVIDIA does not accept any liability related to any default, damage, costs or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this guide, or (ii) customer product designs.

Other than the right for customer to use the information in this guide with the product, no other license, either expressed or implied, is hereby granted by NVIDIA under this guide. Reproduction of information in this guide is permissible only if reproduction is approved by NVIDIA in writing, is reproduced without alteration, and is accompanied by all associated conditions, limitations, and notices.

Trademarks

NVIDIA, the NVIDIA logo, and cuBLAS, CUDA, cuDNN, cuFFT, cuSPARSE, DIGITS, DGX, DGX-1, DGX Station, GRID, Jetson, Kepler, NVIDIA GPU Cloud, Maxwell, NCCL, NVLink, Pascal, Tegra, TensorRT, Tesla and Volta are trademarks and/or registered trademarks of NVIDIA Corporation in the Unites States and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.