This Best Practices User Guide covers DGX-1, DGX Station, NVIDIA GPU Cloud, Containers and Frameworks. It provides recommendations to help administrators and users work with Docker, extend frameworks, and administer and manage DGX products.

1. About This Guide

This guide provides recommendations to help administrators and users work with Docker®, extend frameworks, and administer and manage the DGX-1™ , DGX Station™ , and NVIDIA® GPU Cloud™ (NGC) products. Although this entire guide provides best practices, whenever possible, the reasons behind those recommendations, the most effective recommendations are labeled as such:

This guide does not provide step-by-step instructions. For additional procedural instruction, see the Preparing To Use NVIDIA Containers Getting Started Guide and the NVIDIA Containers for Deep Learning Frameworks User Guide.

2. Introduction To nvidia-docker And Docker

The DGX-1, DGX Station, and the NVIDIA NGC Cloud Services are designed to run containers. Containers hold the application as well as any libraries or code that are needed to run the application. Containers are portable within an operating system family. For example, you can create a container using Red Hat Enterprise Linux and run it on an Ubuntu system, or vice versa. The only common thread between the two operating systems is that they each need to have the container software so they can run containers.

Using containers allows you to create the software on whatever OS you are comfortable with and then run the application where ever you want. It also allows you to share the application with other users without having to rebuild the application on the OS they are using.

Containers are different than a virtual machine (VM) such as VMware. A VM has a complete operating system and possibly applications and data files. Containers do not contain a complete operating system. They only contain the software needed to run the application. The container relies on the host OS for things such as file system services, networking, and an OS kernel. The application in the container will always run the same anywhere, regardless of the OS/compute environment.

All three products, the DGX-1, the DGX Station, and the NVIDIA NGC Cloud Services uses Docker. Docker is one of the most popular container services available and is very commonly used by developers in the Artificial Intelligence (AI) space. There is a public Docker repository that holds pre-built Docker containers. These containers can be a simple base OS such as CentOS, or they may be a complete application such as TensorFlow™ . You can use these Docker containers for running the applications that they contain. You can use them as the basis for creating other containers, for example for extending a container.

To enable portability in Docker images that leverage GPUs, NVIDIA developed the Docker® Engine Utility for NVIDIA® GPUs, also known as nvidia-docker. We will refer to the Docker® Engine Utility for NVIDIA® GPUs simply as nvidia-docker for the remainder of this guide.

With the three products, the DGX-1, DGX Station, and the NVIDIA NGC Cloud Services, NVIDIA provides access to Docker containers that have been especially built, tuned, and optimized for NVIDIA GPUs. This is done through NVIDIA’s private Docker Repository, nvcr.io. Some of these containers are for deep learning frameworks and some contain the building blocks of GPU applications. They are there for your use, but are only licensed for use on these three systems, the DGX-1, DGX Station, and the NVIDIA NGC Cloud Services. You are not restricted to using only the nvidia-docker containers, you can use public Docker containers or other Docker containers on these systems as well.

Containers are not difficult to use. There are just a few basic commands. It’s also not difficult to build a container, particularly if you are starting with an existing container and building upon it. If you are new to containers, especially Docker containers, the next section provides some best practices around Docker and its commands.

3. Docker Best Practices with NVIDIA Containers

The following sections highlight the best practices to using Docker with NVIDIA containers.
  1. Prerequisites. See Prerequisites.
  2. Log into Docker. See Logging into Docker.
  3. List the Docker images on the DGX-1, DGX Station, or the NVIDIA NGC Cloud Services. See Listing Docker Images.
  4. Pull a container. See Pulling a Container.
  5. Run the container. See Running a Container.
  6. Verifying the container is running properly. See Verifying.

3.1. Prerequisites

You can access NVIDIA’s GPU accelerated containers from all three products, the DGX-1, DGX Station, and the NVIDIA NGC Cloud Services. If you own a DGX-1 or DGX Station then you should use the NVIDIA® DGX™ container registry at https://compute.nvidia.com. This is a web interface to the Docker hub, nvcr.io (NVIDIA DGX container registry). You can pull the containers from there and you can also push containers there into your own account in the registry.

If you are accessing the NVIDIA containers from the NVIDIA® GPU Cloud™ (NGC) container registry via a cloud services provide such as Amazon Web Services (AWS), then you should use NGC container registry at https://ngc.nvidia.com. This is also a web interface to the same Docker repository as for the DGX-1 and DGX Station. After you create an account, the commands to pull containers are the same as if you had a DGX-1 in your own data center. However, currently, you cannot save any containers to the NGC container registry. Instead you have to save the containers to your own Docker repository.
Note: The containers are exactly the same, whether you pull them from the NVIDIA DGX container registry or the NGC container registry.

For all three products, the DGX-1, DGX Station, and the NVIDIA NGC Cloud Services, the location of the framework source is in /opt/<framework> in the container.

Before you can pull a container from the NGC container registry, you must have Docker and nvidia-docker installed as explained in Preparing to use NVIDIA Containers Getting Started Guide. You must also have access and logged into the NGC container registry as explained in the NGC Getting Started Guide.

3.1.1. Hello World For Containers

To make sure you have access to the NVIDIA containers, start with the proverbial “hello world” of Docker commands.

For the DGX-1 and DGX Station, just log into the system. For the NVIDIA NGC Cloud Services consult the NGC Getting Started Guide for details about your specific cloud provider. In general, you will start a cloud instance with your cloud provider using the NVIDIA Volta Deep Learning Image. After the instance has booted, log into the instance.

Next, you can issue the docker --version command to list the version of Docker for all three products, the DGX-1, DGX Station, and the NVIDIA NGC Cloud Services. The output of this command tells you the version of Docker on the system (17.05-ce, build 89658be).
Figure 1. Listing of Docker version Listing of Docker version

At any time, if you are not sure about a Docker command, issue the $ docker --help command.

3.2. Logging Into Docker

If you have a DGX-1 or a DGX Station on premise, then the first time you log into the DGX-1 or DGX Station, you are required to set up access to the containers using https://compute.nvidia.com. This requires that the DGX-1 or DGX Station be connected to the Internet. For more information, see the DGX Container Registry User Guide.

In the case of NVIDIA NGC Cloud Services where you are running nvidia-docker containers in the Cloud, the first time you login you are required to set up access to the NVIDIA NGC Cloud Services at https://ngc.nvidia.com. This requires that the cloud instance be connected to the Internet. For more information, see the Preparing To Use NVIDIA Containers Getting Started Guide and NGC Getting Started Guide.

3.3. Listing Docker Images

Typically, one of the first things you will want to do is get a list of all the Docker images that are currently available. When the Docker containers are stored in a repository, they are said to be a container. When you pull the container from a repository to a system, such as the DGX-1, it is then said to be a Dockerimage. This means the image is local.

Issue the $ docker images command to list the images on the server. Your screen will look similar to the following:
Figure 2. Listing of Docker images Listing of Docker images
In this example, there are a few Docker containers that have been pulled down to this system. Each image is listed along with its tag, the corresponding Image ID. There are two other columns that list when the container was created (approximately), and the approximate size of the image in GB. These columns have been cropped to improve readability.
Note: The output from the command will vary. The above screen capture is just an example.

At any time, if you need help, issue the $ docker images --help command.

3.4. Pulling A Container

A Docker container is composed of layers. The layers are combined to create the container. You can think of layers as intermediate images that add some capability to the overall container. If you make a change to a layer through a DockerFile (see Building Containers), than Docker rebuilds that layer and all subsequent layers but not the layers that are not affected by the build. This reduces the time to create containers and also allows you to keep them modular.

Docker is also very good about keeping one copy of the layers on a system. This saves space and also greatly reduces the possibility of version skew so that layers that should be the same are not duplicated.

Pulling a container to the system makes the container an image. When the container is pulled to become an image, all of the layers are downloaded. Depending upon how many layers are in the container and how the system is connected to the Internet, it may take some time to download.

The $ docker pull nvcr.io/nvidia/tensorflow:17.06 command pulls the container from the NVIDIA repository to the local system where the command is run. At that point, it is a Docker image. The structure of the pull command is:
$ docker pull <repository>/nvidia/<container>:17.06
  • <repository> is the path to where the container is stored (the Docker repo). In the following example, the repository is nvcr.io/nvidia (NVIDIA’s private repository).
  • <container> is the name of the container. In the following example we use tensorflow.
  • <xx.xx> is the specific version of the container. In the following example we use 17.06.
Below is an image when a TensorFlow container is pulled using the following command:
$ docker pull nvrc.io/nvidia/tensorflow:17.06
Figure 3. Example of pulling TensorFlow 17.06 Example of pulling TensorFlow 17.06
As you can tell, the container had already been pulled down on this particular system (some of the output from the command has been cut off). At this point the image is ready to be run.
Note: The example uses the 17.06 container as an example. The command is the same for other container versions, however, the exact output will differ.
In most cases, you will not find a container already downloaded to the system. Below is some sample output for the case when the container has to be pulled down from the registry, using the command:
$ docker pull nvrc.io/nvidia/tensorflow:17.06
Figure 4. Example of pulling TensorFlow 17.06 that had not already been loaded onto the server Example of pulling TensorFlow 17.06 that had not already been loaded onto the server
Below is the output after the pull is finished, using the command:
$ docker pull nvrc.io/nvidia/tensorflow:17.06
Figure 5. Pulling of the container is complete Pulling of the container is complete
Note: The screen capture has been cropped in the interest of readability.

3.5. Running A Container

After the nvidia-docker container is pulled down to the system, creating a Docker image, you can run or execute the image.
Important: Use the nvidia-docker command to ensure that the correct NVIDIA drivers and libraries are used. The next section discusses nvidia-docker.
A typical command to run the container is:
nvidia-docker run -it --rm -v local_dir:container_dir 

  • -it means interactive
  • --rm means delete the image when finished
  • –v means mount directory
  • local_dir is the directory or file from your host system (absolute path) that you want to access from inside your container. For example, the local_dir in the following path is /home/jsmith/data/mnist.
    -v /home/jsmith/data/mnist:/data/mnist 

    If you are inside the container, for example, ls /data/mnist, you will see the same files as if you issued the ls /home/jsmith/data/mnist command from outside the container.

  • container_dir is the target directory when you are inside your container. For example, /data/mnist is the target directory in the example:
    -v /home/jsmith/data/mnist:/data/mnist
  • <container> is the name of the container.
  • <xx.xx> is the tag. For example, 17.06.

3.6. Verifying

After a Docker image is running, you can verify by using the classic *nix option ps. For example, issue the $ docker ps -a command.
Figure 6. Verifying a Docker image is running Verifying a Docker image is running
Without the -a option, only running instances are listed.
Important: It is best to include the -a option in case there are hung jobs running or other performance problems.
You can also stop a running container if you want. For example:
Figure 7. Stopping a container from running Stopping a container from running
Note: This screen capture has been cropped to improve readability.

Notice that you need the Container ID of the image you want to stop. This can be found using the $ docker ps -a command.

Another useful command or Docker option is to remove the image from the server. Removing or deleting the image saves space on the server. For example, issue the following command:
$ docker rmi nvcr.io/nvidia.tensorflow:1706
Figure 8. Removing an image from the server Removing an image from the server
If you list the images, $ docker images, on the server, then you will see that the image is no longer there.
Figure 9. Confirming the image is removed from the server Confirming the image is removed from the server
Note: This screen capture has been cropped to improve readability.

4. Docker Best Practices

You can run an nvidia-docker container on any platform that is Docker compatible allowing you to move your application to wherever you need. The containers are platform-agnostic, and therefore, hardware agnostic as well. To get the best performance and to take full advantage of the tremendous performance of a NVIDIA GPU, specific kernel modules and user-level libraries are needed. NVIDIA GPUs introduce some complexity because they require kernel modules and user-level libraries to operate.

One approach to solving this complexity when using containers is to have the NVIDIA drivers installed in the container and have the character devices mapped corresponding to the NVIDIA GPUs such as /dev/nvidia0. For this to work, the drivers on the host (the system that is running the container), must match the version of the driver installed in the container. This approach drastically reduces the portability of the container.

4.1. nvidia-docker Containers Best Practices

To make things easier for Docker® containers that are built for GPUs, NVIDIA® has created nvidia-docker. It is and open-source project hosted on GitHub. It is basically a wrapper around the docker command that takes care of orchestrating the GPU containers that are needed for your container to run.
Important: It is highly recommended you use nvidia-docker when running a Docker container that uses GPUs.
Specifically, it provides two components for portable GPU-based containers.
  1. Driver-agnostic Compute Unified Device Architecture® (CUDA) images
  2. A Docker command-line wrapper that mounts the user mode components of the driver and the GPUs (character devices) into the container at launch.
The nvidia-docker containers focus solely on helping you run images that contain GPU dependent applications. Otherwise, it passes the arguments to the regular Docker commands. A good introduction to nvidia-docker is here.
Important: Some things to always remember:
  • Use the nvidia-docker command when you are running and executing containers.
  • When building containers for NVIDIA GPUs, use the base containers in the repository. This will ensure the containers are compatible with nvidia-docker.
Let’s assume the TensorFlow 17.06 container has been pulled down to the system and is now an image that is ready to be run. The following command can be used to execute it.
$ nvidia-docker run --rm -ti nvcr.io/nvidia/tensorflow:17.06
Figure 10. Executing the run command Executing the run command Executing the run command
This takes you to a command prompt inside the container.
Remember: You are root inside the container.

The option --rm tells nvidia-docker to remove the container instance when the image is finished. If you make any changes to the image while it’s running, they will be lost.

The option -ti tells docker to run in interactive mode and associate a tty with the instance (basically, a shell).

Running the TensorFlow image didn’t really do anything; it just brought up a command line inside the image where you are root. Below is a better example where the CUDA container is pulled down and the image is executed along with a simple command. This view at least gives you some feedback.
Figure 11. Running an image to give you feedback Running an image to give you feedback
This docker image actually executed a command, nvcc --version, which provides some output, for example, the version of the nvcc compiler). If you want to get a bash shell in the image then you can run bash within the image.
Figure 12. Getting a bash shell in the image Getting a bash shell in the image
Note: This screen capture has been cropped to improve readability.

The frameworks that are part of the nvidia-docker repository, nvcr.io, have some specific options for achieving the best performance. This is true for all three systems, the DGX-1, DGX Station, and the NVIDIA NGC Cloud Services. For more information, see Frameworks Best Practices.

In the section Using And Mounting File Systems, some options for mounting external file systems in the running image are explained.
Important: This allows you to keep data and code stored in one place on the system outside of the containers, while keeping the containers intact.
This allows the containers to stay generic so they don’t start proliferating when each user creates their own version of the container for their data and code.

4.2. docker exec

There are times when you will need to connect to a running container. You can use the docker exec command to connect to a running container to run commands. You can use the bash command to start an interactive command line terminal or bash shell. The format of the command is:
$ docker exec -ti <CONTAINER_ID_OR_NAME> bash
As an example, suppose one starts a Deep Learning GPU Training System™ (DIGITS) container with the following command:
$ nvidia-docker run -d --name test-digits \
  -u $(id -u):$(id -g) -e HOME=$HOME -e USER=$USER -v $HOME:$HOME \
After the container is running, you can now connect to the container instance with the following command.
$ docker exec -it test-digits bash
Note:test-digits is the name of the container. If you don’t specifically name the container, you will have to use the container ID.
Important: Using docker exec one can execute a snippet of code, a script, or attach interactively to the container making the docker exec command very useful.

For detailed usage of the docker exec command, see docker exec.

4.3. nvcr.io

Building deep learning frameworks can be quite a bit of work and can be very time consuming. Moreover, these frameworks are being updated weekly if not daily. On top of this, is the need to optimize and tune the frameworks or GPUs. NVIDIA has created a Docker repository, named nvcr.io, where deep learning frameworks are tuned, optimized, and containerized for your use.

NVIDIA creates an updated set of nvidia-docker containers for the frameworks monthly. Included in the container is source (these are open-source frameworks), scripts for building the frameworks, Dockerfiles for creating containers based on these containers, markdown files that contain text about the specific container, and tools and scripts for pulling down data sets that can be used for testing or learning. Customers who purchase a DGX-1 or DGX Station have access to this repository for pushing containers (storing containers). When using NVIDIA NGC Cloud Services with a cloud provider, currently you cannot push or save a container to nvcr.io. Instead, you need to save them to a private Docker repository.

To get started with the DGX-1 or DGX Station, you need to create a system admin account for accessing nvcr.io. This account should be treated as an admin account so that users cannot access it. Once this account is created, the system admin can create accounts for projects that belong to the account. They can then give users access to these projects so that they can store or share any containers that they create.

When using the NVIDIA containers with a cloud provider, you are using the NGC container registry that is part of the NVIDIA NGC Cloud Services. It uses the exact same containers as those in nvcr.io.

4.4. Building Containers

You can build containers for the DGX systems and you can even store them in the nvcr.io registry as a project within your account if you have a DGX-1 or DGX Station (for example, no one else can access the container unless you give them access). Currently, only the DGX-1 and DGX Station can store containers in nvcr.io. If you are running on NVIDIA NGC Cloud Services using a cloud provider, you can only pull containers from nvcr.io. You must save the containers to a private Docker repository (not nvcr.io).

This section of the document applies to Docker containers in general. You can use the general approach for your own Docker repository as well, but be cautious of the details.

Using a DGX-1 or DGX Station, you can either:
  1. Create your container from scratch
  2. Base your container on an existing Docker container
  3. Base your container on containers in nvcr.io.
Any one of the three approaches are valid and will work, however, since the goal is to run the containers on a system which has eight GPUs. Moreover, these containers are already tuned for the DGX systems and the GPU topology. All of them also include the needed GPU libraries, configuration files, and tools to rebuild the container.
Important: Based on these assumptions it is recommended that you start with a container from nvcr.io.

An existing container in nvcr.io should be used as a starting point. As an example, the TensorFlow 17.06 container will be used and Octave will be added to the container so that some post-processing of the results can be accomplished.

  1. Pull the container from the NGC container registry to the server. See Pulling A Container.
  2. On the server, create a subdirectory called mydocker.
    Note: This is an arbitrary directory name.
  3. Inside this directory, create a file called Dockerfile (capitalization is important). This is the default name that Docker looks for when creating a container. The Dockerfile should look similar to the following:
    Figure 13. Example of a Dockerfile Example of a Dockerfile
    There are three lines in the Dockerfile.
    • The first line in the Dockerfile tells Docker to start with the container nvcr.io/nvidia/tensorflow:17.06. This is the base container for the new container.
    • The second line in the Dockerfile performs a package update for the container. It doesn’t update any of the applications in the container but just updates the apt-get database. This is needed before we install new applications in the container.
    • The third and last line in the Dockerfile tells Docker to install the package octave into the container using apt-get.
    The Docker command to create the container is:
    $ docker build -t nvcr.io/nvidian_sas/tensorflow_octave:17.06_with_octave
    Note: This command uses the default file Dockerfile for creating the container.
    In the following screen capture, the command starts with docker build. The -t option creates a tag for this new container. Notice that the tag specifies the project in the nvcr.io repository where the container is to be stored. As an example, the project nvidian_sas was used along with the repository nvcr.io. Projects can be created by your local administrator who controls access to nvcr.io, or they can give you permission to create them. This is where you can store your specific containers and even share them with your colleagues.
    Figure 14. Creating a container using the Dockerfile Creating a container using the Dockerfile
    Note: This screen capture has been cropped to improve readability.

    In the brief output from the docker build … command seen above, each line in the Dockerfile is a Step. In the screen capture, you can see the first and second steps (commands). Docker echos these commands to the standard out (stdout) so you can watch what it is doing or you can capture the output for documentation.

    After the image is built, remember that we haven’t stored the image in a repository yet, therefore, it’s a docker image. Docker prints out the image id to stdout at the very end. It also tells you if you have successfully created and tagged the image.

    If you don’t see Successfully ... at the end of the output, examine your Dockerfile for errors (perhaps try to simplify it) or try a very simple Dockerfile to ensure that Docker is working properly.

  4. Verify that Docker successfully created the image.
    $ docker images
    Figure 15. Verifying Docker created the image Verifying Docker created the image
    Note: The screen capture has been cropped to make it more readable.

    The very first entry is the new image (about 1 minute old).

  5. Push the image into the repository, creating a container.
    docker push <name of image>
    Figure 16. Example of the docker push command Example of the docker push command

    The above screen capture is after the docker push … command pushes the image to the repository creating a container. At this point, you should log into the NGC container registry at https://ngc.nvidia.com and look under your project to see if the container is there.

    If you don’t see the container in your project, make sure that the tag on the image matches the location in the repository. If, for some reason, the push fails, try it again in case there was a communication issue between your system and the container registry (nvcr.io).

    To make sure that the container is in the repository, we can pull it to the server and run it. As a test, first remove the image from the DGX station using the command docker rmi …. Then pull down the container down to the server using docker pull …. The image can be run using nvidia-docker as shown below.
    Figure 17. Example of using nvidia-docker to pull container Example of using nvidia-docker to pull container
    Notice that the octave prompt came up so it is installed and functioning correctly within the limits of this testing.

4.5. Using And Mounting File Systems

One of the fundamental aspects of using Docker is mounting file systems inside the Docker container. These file systems can contain input data for the frameworks or even code to run in the container.

Docker containers have their own internal file system that is separate from file systems on the rest of the host.
Important: You can copy data into the container file system from outside if you want. However, it’s far easier to mount an outside file system into the container.
Mounting outside file systems is done using the nvidia-docker command using the -v option. For example, the following command mounts two file systems:
$ nvidia-docker run --rm -ti ... -v $HOME:$HOME \
  -v /datasets:/digits_data:ro \
Most of the command has been erased except for the volumes. This command mounts the user’s home directory from the external file system to the home directory in the container (-v $HOME:$HOME). It also takes the /datasets directory from the host and mounts it on /digits_data inside the container (-v /datasets:/digits_data:ro).
Remember: The user has root privileges with Docker, therefore you will mount almost anything from the host system to anywhere in the container.
For this particular command, the volume command takes the form of:
-v <External FS Path>:<Container FS Path>(options) \

The first part of the option is the path for the external file system. To be sure this works correctly, it’s best to use the fully qualified path (FQP). This is also true for the mount point inside the container <Container FS Path>.

After the last path, various options can be used in the parenthesis (). In the above example, the second file system is mounted read-only (ro) inside the container. The various options for the volume option are discussed here.

The DGX™ systems (DGX-1 and DGX Station), and the nvidia-docker containers use the Overlay2 storage driver to mount external file systems onto the container file system. Overlay2 is a union-mount file system driver that allows you to combine multiple file systems so that all the content appears to be combined into a single file system. It creates a union of the file systems rather than an intersection.

5. Frameworks Best Practices

As part of the DGX-1, DGX Station, and the NVIDIA NGC Cloud Services systems, NVIDIA makes available tuned, optimized, and ready to run nvidia-docker containers for the major deep learning frameworks. These containers are made available via the container registry, nvcr.io, so that you can use them directly or use them as a basis for creating your own containers.

This section presents tips for efficiently using these frameworks. This section does not explain how to use the frameworks for addressing your projects, rather, it presents best practices for starting them.

There are a few general best practices around the containers (the frameworks) in nvcr.io. As mentioned earlier, it’s possible to use one of the containers and build upon it. By doing this, you are, in a sense, fixing the new container to a specific framework and container version. This approach works well if you are creating a derivative of a framework or adding some capability that doesn’t exist in the framework or container.

Important: However, it is a best practice not to put datasets in a container. If possible also avoid storing business logic code in a container.
The reason is because by storing datasets and/or business logic code within a container, it becomes difficult to generalize the usage of the container. Instead, since one can mount file systems into a container that just mount desired data sets and directories that contain the business logic code to run. Decoupling the container from specific datasets and business logic enables one to easily change containers, such as framework or version of a container, without having to rebuild the container to hold the data or code.
Important: The main takeaway is to use volumes from outside the container for datasets and business logic code. Keep the container as generic as possible.
When applying this practice to deep learning workflows, the non-business logic code is the containerized framework, such as TensorFlow for example, and the business logic code would be a python file defining a TensorFlow network and code to read, process, and write data. The data is read from some readable mounted dataset and written to some writeable mounted volume (could be the same location as the mounted readable dataset).

The subsequent sections briefly present some best practices around the major frameworks that are in containers on the container registry (nvcr.io). There is also a section that discusses how to use Keras, a very popular high-level abstraction of deep learning frameworks, with some of the containers.

5.1. NVCaffe

NVCaffe™ can run using the DIGITS application or directly via a command line interface. Also, a Python interface for NVCaffe called pycaffe is available.

When running NVCaffe via the command line or pycaffe use the nvcr.io/nvidia/caffe:17.05 or later container. In section run_caffe_mnist.sh, try the script run_caffe_mnist.sh for an example using the MNIST data and the LeNet network to perform training via the NVCaffe command line. In the script, the data path is set to /datasets/caffe_mnist. You can modify the path to your desired location. To run you can use the following commands.
# or with multiple GPUs use -gpu flag: "-gpu=all" for all gpus or
#   comma list.
./run_caffe_mnist.sh -gpu=0,1

This script demonstrates how to orchestrate a container, pass external data to the container, and run NVCaffe training while storing the output in a working directory. Read through the run_caffe_mnist.sh script for more details. It is based on the MNIST training example.

The Python interface, pycaffe, is implemented via import NVCaffe in a Python script. For examples of using pycaffe and the Python interface, refer to the test scripts.

A description of orchestrating a Python script with Docker containers is described in section run_tf_cifar10.sh using the run_tf_cifar10.sh script.

An interactive session with NVCaffe can be setup with the following lines in a script:
mkdir -p $DATA
mkdir -p $CAFFEWORKDIR/mnist
# Orchestrate Docker container with user's privileges
nvidia-docker run -d -t --name=$dname \
  -u $(id -u):$(id -g) -e HOME=$HOME -e USER=$USER -v $HOME:$HOME \
  -e DATA=$DATA -v $DATA:$DATA \
  --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
  -w $CAFFEWORKDIR nvcr.io/nvidia/caffe:17.05
# enter interactive session
docker exec -it $dname bash
# After exiting the interactive container session, stop and rm
#   container.
# docker stop $dname && docker rm $dname
In the script, the following line has options for Docker to enable proper NVIDIA® Collective Communications Library ™ (NCCL) operation for running NVCaffe with multiple GPUs.
 --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864
You can use the NVCaffe command line or Python interface within the NVCaffe container. For example using the command line would looking similar to the following:
caffe device_query -gpu 0 # query GPU stats. Use "-gpu all" for all gpus
caffe train help # print out help/usage
Using a Python interface would look similar to the following:
# start python in container
>>> import caffe
>>> dir(caffe)
['AdaDeltaSolver', 'AdaGradSolver', 'AdamSolver', 'Classifier', 'Detector',
 'Layer', 'NesterovSolver', 'Net', 'NetSpec', 'RMSPropSolver', 'SGDSolver',
 'TEST', 'TRAIN', '__builtins__', '__doc__', 'docs/using_caffe.md', '__name__',
 '__package__', '__path__', '__version__', '_caffe', 'classifier', 'detector',
 'get_solver', 'io', 'layer_type_list', 'layers', 'net_spec', 'params',
 'proto', 'pycaffe', 'set_device', 'set_mode_cpu', 'set_mode_gpu', 'to_proto']

For more information about NVCaffe, see NVCaffe documentation.

5.2. Caffe2

Caffe2™ is a deep learning framework enabling simple and flexible deep learning. Built on the original BVLC Caffe™ , Caffe2 is designed with expression, speed, and modularity in mind, allowing for a more flexible way to organize computation.

Caffe2 aims to provide an easy and straightforward way for you to experiment with deep learning by leveraging community contributions of new models and algorithms. Caffe2 comes with native Python and C++ APIs that work interchangeably so you can prototype quickly now, and easily optimize later. Caffe2 is fine tuned from the ground up to take full advantage of the latest NVIDIA Deep Learning SDK libraries, CUDA® Deep Neural Network library™ (cuDNN), CUDA® Basic Linear Algebra Subroutines library™ (cuBLAS) and NCCL, to deliver high-performance, multi-GPU acceleration for desktop, data centers, and embedded edge devices.

There is an informative introduction of Caffe2 that includes some comparative tests. NVIDIA provides a web page for the release notes for the Caffe2 version that is included. If you want to build Caffe2 yourself or if you want to see test results with Caffe2, you can find information for it on NVIDIA’s GPU Ready App page for Caffe2. There is also a lab for Caffe2 in the NVIDIA Deep Learning Institute.

5.3. Microsoft Cognitive Toolkit

The Microsoft® Cognitive Toolkit™ , previously known as CNTK, allows users to to easily realize and combine popular model types such as feed-forward DNNs, convolutional nets (CNNs), and recurrent networks (RNNs/LSTMs). Version 2.1 was released on 7/30/2017 and included support for cuDNN 6 and Keras.

NVIDIA includes a pre-built release of the Microsoft Cognitive Toolkit in the container registry (nvcr.io). You can find the release notes here. The NVIDIA Deep Learning Institute (DLI) also has a course that utilizes the Microsoft Cognitive Toolkit, although it may be referred to as CNTK.


DIGITS is a popular training workflow manager provided by NVIDIA. Using DIGITS, one can manage image data sets and training through an easy to use web interface for the NVCaffe, Torch™ , and TensorFlow frameworks.

For more information, see NVIDIA DIGITS, DIGITS source and DIGITS documentation.

5.4.1. Setting Up DIGITS

The following directories, files and ports are useful in running the DIGITS container.
Table 1. Running DIGITS container details
Description Value Notes
DIGITS working directory $HOME/digits_workdir You must create this directory.
DIGITS job directory $HOME/digits_workdir/jobs You must create this directory.
DIGITS config file $HOME/digits_workdir/digits_config_env.sh Used to pass job directory and log file.
DIGITS port 5000 Choose a unique port if multi-user.
Important: It is recommended to specify a list of environment variables in a single file that can be passed to the nvidia-docker run command via the --env file option.
In section digits_config_env.sh, is a script that declares the location of the DIGITS job directory and log file. This script is very popular when running DIGITS. Below is an example of defining these two variables in the simple bash script.
# DIGITS Configuration File

For more information about configuring DIGITS, see Configuration.md.

5.4.2. Running DIGITS

To run DIGITS, refer to the example script in section run_digits.sh. However, if you want to run DIGITS from the command line, there is a sample nvidia-docker command that has most of the needed details to effectively run DIGITS.
Note: You will have to create the jobs directory if it doesn’t already exist.
$ mkdir -p $HOME/digits_workdir/jobs
$ NV_GPU=0,1 nvidia-docker run --rm -ti --name=${USER}_digits -p 5000:5000 \
  -u $(id -u):$(id -g) -e HOME=$HOME -e USER=$USER -v $HOME:$HOME \
  --env-file=${HOME}/digits_workdir/digits_config_env.sh \
  -v /datasets:/digits_data:ro \
  --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
This command has several options of which you might need, but you may not need all of them. In the table below is a list of the parameters and their description.
Table 2. nvidia-docker run command options
Parameter Description
NV_GPU Optional environment variable specifying GPUs available to the container.
--name Name to associate with the Docker container instance.
--rm Tells Docker to remove the container instance when done.
-ti Tells Docker to run in interactive mode and associate tty with the instance.
-d Tells Docker to run in daemon mode; no tty, run in background (not shown in the command and not recommended for running with DIGITS).
-p p1:p2 Tells Docker to map host port p1 to container port p2 for external access. This is useful for pushing DIGITS output through a firewall.
-u id:gid Tells Docker to run the container with user id and group id for file permissions.
-v d1:d2 Tells Docker to map host directory d1 into the container at directory d2.
Important: This is a very useful option because it allows you to store the data outside of the container.
--env-file Tells Docker which environment variables to set for the container.
--shm-size ... This line is a temporary workaround for a DIGITS multi-GPU error you might encounter.
container Tells Docker which container instance to run (for example, nvcr.io/nvidia/digits:17.05).
command Optional command to run after the container is started. This option is not used in the example.
After DIGITS starts running, open a browser using the IP address and port of the system. For example, the URL would be, http://dgxip:5000/. If the port is blocked and an SSH tunnel has been setup (see SSH Tunneling), then you can use the URL http://localhost:5000/.
In this example, the datasets are mounted to /digits_data (inside the container) via the option, -v /datasets:/digits_data:ro. Outside the container, the datasets reside in /datasets (this can be any path on the system). Inside the container the data is mapped to /digits_data. It is also mounted read-only (ro) with the option :ro.
Important: For both paths, it is highly recommended to use the fully qualified path name for outside the container and inside the container.

If you are looking for datasets for learning how to use the system and the containers, there are some standard datasets that can be downloaded via DIGITS.

Included in the DIGITS container is a Python script that can be used to download specific sample datasets. The tool is called digits.download_data. It can be used to download the MNIST data set, the CIFAR-10 dataset, and the CIFAR-100 dataset. You can also use this script in the command to run DIGITS so that it pulls down the sample dataset. Below is an example for the MNIST dataset.
$ nvidia-docker run --rm -ti \
  -u $(id -u):$(id -g) -e HOME=$HOME -e USER=$USER -v $HOME:$HOME \
  --env-file=${HOME}/digits_workdir/digits_config_env.sh \
  -v /datasets:/digits_data \
  --entrypoint=bash \
  nvcr.io/nvidia/digits:17.05 \
  -c 'python -m digits.download_data mnist /digits_data/digits_mnist'

In the download example above, the entry point to the container was overridden to run a bash command to download the dataset (the -c option). You should adjust the datasets paths as needed.

An example of running DIGITS on MNIST data can be found here.

More DIGITS examples can be found here.

5.5. Keras And Containerized Frameworks

Keras is a popular Python frontend for TensorFlow, Theano, and Microsoft Cognitive Toolkit v.2.x release. Keras implements a high-level neural network API to the frameworks listed. Keras is not included in the containers in nvcr.io because it is evolving so quickly. You can add it to any of the containers if you like but there are ways to start one of the nvcr.io containers and install Keras during the launch process. This section also provides some scripts for using Keras in a virtual Python environment.

Before jumping into Keras and best practices around how to use it, a good background for Keras is to familiarize yourself with virtualenv and virtualenvwrapper.

When you run Keras, you have to specify the desired framework backend. This can be done using either the $HOME/.keras/keras.json file or by an environment variable KERAS_BACKEND=<backend> where the backend choices are: theano, tensorflow, or cntk. The ability to choose a framework with minimal changes to the Python code makes Keras very popular.

There are several ways to configure Keras to work with containerized frameworks.
Important: The most reliable approach is to create a container with Keras or install Keras within a container.
Setting up a container with Keras might be preferable for deployed containerized services.
Important: Another approach that works well in development environments is to setup a virtual python environment with Keras.
This virtual environment can then be mapped into the container and the Keras code can run against the desired framework backend.

The advantage of decoupling Python environments from the containerized frameworks is that given M containers and N environments instead of having to create M * N containers, one can just create M + N configurations. The configuration then is the launcher or orchestration script that starts the desired container and activates the Keras Python environment within that container. The disadvantage with such an approach is that one cannot guarantee the compatibility of the virtual Python environment and the framework backend without testing. If the environment is incompatible then one would need to re-create the virtual Python environment from within the container to make it compatible.

5.5.1. Adding Keras To Containers

If you choose, you can add Keras to an existing container. Like the frameworks, Keras changes fairly rapidly so you will have to watch for changes in Keras.

There are two good choices for installing Keras into an existing container. Before proceeding with either approach, ensure you are familiar with the Docker Best Practices section of this document to understand how to build on existing containers.

The first approach is to use the OS version of Python to install Keras using the python tool pip.
# sudo pip install keras

Ensure you check the version of Keras that has been installed. This may be an older version to better match the system OS version but it may not be the version you want or need. If that is the case, the next paragraph describes how to install Keras from source code.

The second approach is to build Keras from source. It is recommended that you download one of the releases rather than download from the master branch. A simple step-by-step process is to:
  1. Download a release in .tar.gz format (you can always use .zip if you want).
  2. Start up a container with either TensorFlow, Microsoft Cognitive Toolkit v2.x, or Theano.
  3. Mount your home directory as a volume in the container (see Using And Mounting File Systems).
  4. Navigate into the container and open a shell prompt.
  5. Uncompress and untar the Keras release (or unzip the .zip file).
  6. Issue cd into the directory.
    # cd keras
    # sudo python setup.py install
If you want to use Keras as part of a virtual Python environment, the next section will explain how you can achieve that.

5.5.2. Creating Keras Virtual Python Environment

Before jumping into Keras in a virtual Python environment, it’s always a good idea to review the installation dependencies of Keras. The dependencies are common for data science Python environments, NumPy, SciPy, YAML, and h5py. It can also use cuDNN, but this is already included in the framework containers.

You will be presented with several scripts for running Keras in a virtual Python environment. These scripts are included in the document and provides a better user experience than having to do things by hand.

In the section venvfns.sh, the script is a master script. It needs to be put in a directory on the system that is accessible from all users. For example, it could be placed in /usr/share/virtualenvwrapper/. An administrator needs to put this script in the desired location since it has to be in a directory that every user can access.

In the section setup_keras.sh, the script creates a py-keras virtual Python environment in ~/.virtualenvs directory (this is in the user’s home directory). Each user can run the script as:
In this script, you launch the nvcr.io/nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04 container as the local user with your home directory mounted into the container. The salient parts of the script are below:
nvidia-docker run --name=$dname -d -t \
  -u $(id -u):$(id -g) -e HOME=$HOME -e USER=$USER -v $HOME:$HOME \
Important: When creating the Keras files, ensure you have the correct privileges set when using the -u or --user options. The -d and -t options daemonize the container process. This way the container runs in the background as a daemon service and one can execute code against it.
You can use docker exec to execute a snippet of code, a script, or attach interactively to the container. Below is the portion of the script that sets up a Keras virtual Python environment.
docker exec -it $dname \
  bash -c 'source /usr/share/virtualenvwrapper/virtualenvwrapper.sh
  mkvirtualenv py-keras
  pip install --upgrade pip
  pip install keras --no-deps
  pip install PyYaml
  # pip install -r /pathto/requirements.txt
  pip install numpy
  pip install scipy
  pip install ipython'
If the list of Python packages is extensive, you can write a requirements.txt file listing those packages and install via:
pip install -r /pathto/requirements.txt --no-deps
Note: This particular line is in the previous command, however, it has been commented out because it was not needed.
The --no-deps option specifies that dependencies of packages should not be installed. It is used here because by default installing Keras will also install Theano or TensorFlow.
Important: On a system where you don’t want to install non-optimized frameworks such as Theano and TensorFlow, the --no-deps option prevents this from happening.
Notice the line in the script that begins with bash -c …. This points to the script previously mentioned (venvfns.sh) that needs to be put in a common location on the system. If some time later, more packages are needed, one can relaunch the container and add those new packages as above or interactively. The code snippet below illustrates how to do so interactively.
nvidia-docker run --name=$dname -d -t \
  -u $(id -u):$(id -g) -e HOME=$HOME -e USER=$USER -v $HOME:$HOME \
sleep 2  # wait for above container to come up
docker exec -it $dname bash
You can now log into the interactive session where you activated the virtual Python environment and install what is needed. The example below installs h5py which is used by Keras for saving models in HDF5 format.
source ~/.virtualenvs/py-keras/bin/activate
pip install h5py

If the installation fails because some underlying library is missing, one can attach to the container as root and install the missing library.

The next example illustrates installing the python-dev package which will install Python.h if it is missing.
$ docker exec -it -u root $dname \
  bash -c 'apt-get update &&  apt-get install -y python-dev # anything else...'
The container can be stopped or removed when you are done using the following command.
$ docker stop $dname && docker rm $dname

5.5.3. Using Keras Virtual Python Environment With Containerized Frameworks

The following examples assume that a py-keras venv (Python virtual environment) has been created per the instructions in the previous section. All of the scripts for this section can be found in the Scripts section.

In the section run_kerastf_mnist.sh, the script demonstrates how the Keras venv is enabled and is then used to run the Keras MNIST code mnist_cnn.py with the default backend TensorFlow. Standard Keras examples can be found here.

Compare the run_kerastf_mnist.sh script to the run_kerasth_mnist.sh (in section run_kerasth_mnist.sh) that uses Theano. There are primarily two differences:
  1. The backend container nvcr.io/nvidia/theano:17.05 is used instead of nvcr.io/nvidia/tensorflow:17.05.
  2. In the code launching section of the script, specify KERAS_BACKEND=theano. You can run these scripts as:
    $./run_kerasth_mnist.sh  # Ctrl^C to stop running
In section run_kerastf_cifar10.sh, the script has been modified to accept parameters and demonstrates how one would specify an external data directory for the CIFAR-10 data. In section cifar10_cnn_filesystem.py, the script has been modified from the original cifar10_cnn.py. The command line example to run this code on a system is the following:
$./run_kerastf_cifar10.sh --epochs=3 --datadir=/datasets/cifar
The above assumes the storage is mounted on a system at /datasets/cifar.
Important: The key takeaway is that running some code within a container involves setting up a launcher script.
These scripts can be generalized and parameterized for convenience and it is up to the end user or developer to write these scripts for their custom application or their custom workflow.
For example:
  1. The parameters in the example script were joined to a temporary variable via the following:
    function join { local IFS="$1"; shift; echo "$*"; }
    script_args=$(join : "$@")
  2. The parameters were passed to the container via the option:
    -e script_args="$script_args"
  3. Within the container, these parameters are split and passed through to the computation code by the line:
    python $cifarcode ${script_args//:/ }
  4. The external system NFS/storage was passed as read-only to the container via the following option to the launcher script:
    -v /datasets/cifar:/datasets/cifar:ro
    and by

In the section run_kerastf_cifar10.sh, the script can be improved by parsing parameters to generalize the launcher logic and avoid duplication. There are several ways to parse parameters in bash via getopts or a custom parser. One can write a non-bash launcher as well as using Python, Perl, or something else.

The final script, in section run_keras_script implements a high-level parameterized bash launcher. The following examples illustrate how to use it to run the previous MNIST and CIFAR examples above.
# running Tensorflow MNIST
./run_keras_script.sh \
  --container=nvcr.io/nvidia/tensorflow:17.05 \
# running Theano MNIST
./run_keras_script.sh \
  --container=nvcr.io/nvidia/theano:17.05 --backend=theano \
# running Tensorflow Cifar10
./run_keras_script.sh \
  --container=nvcr.io/nvidia/tensorflow:17.05 --backend=tensorflow \
  --datamnt=/datasets/cifar \
  --script=examples/keras/cifar10_cnn_filesystem.py \
	--epochs=3 --datadir=/datasets/cifar
# running Theano Cifar10
./run_keras_script.sh \
  --container=nvcr.io/nvidia/theano:17.05 --backend=theano \
  --datamnt=/datasets/cifar \
  --script=examples/keras/cifar10_cnn_filesystem.py \
	--epochs=3 --datadir=/datasets/cifar
Important: If the code is producing output that needs to be written to a filesystem and persisted after the container stops, that logic needs to be added.
The examples above show containers where their home directory is mounted and is "writeable". This ensures that the code can write the results somewhere within the user’s home path. The filesystem paths need to be mounted into the container and specified or passed to the computational code.
These examples serve to illustrate how one goes about orchestrating computational code via Keras or even non-Keras.
Important: In practice, it is often convenient to launch containers interactively, attach to them interactively, and run code interactively.
During these interactive sessions, it is easier to (automate via helper scripts) debug and develop code. An interactive session might look like the following sequence of commands typed manually into the terminal:
# in bash terminal
nvidia-docker run --name=$dname -d -t \
  -u $(id -u):$(id -g) -e HOME=$HOME -e USER=$USER -v $HOME:$HOME \
  -v /datasets/cifar:/datasets/cifar:ro -w $workdir \
docker exec -it $dname bash
# now interactively in the container.
source ~/.virtualenvs/py-keras/bin/activate
source ~/venvfns.sh
./run_kerastf_cifar10.sh --epochs=3 --datadir=/datasets/cifar
# change some parameters or code in cifar10_cnn_filesystem.py and run again
./run_kerastf_cifar10.sh --aug --epochs=2 --datadir=/datasets/cifar
exit # exit interactive session in container
docker stop $dname && docker rm $dname # stop and remove container

5.5.4. Working With Containerized VNC Desktop Environment

The need for a containerized desktop varies depending on the data center setup. If your system is setup behind a login node for an on-premise system, or a head node for an on-premise system, typically data centers will provide a VNC login node or run X Windows on the login node to facilitate running visual tools such as text editors or an IDE (integrated development environment).

For a cloud based system (NGC), there may already be firewalls and security rules available. In this case, you may want to ensure that the proper ports are open for VNC or something similar.

If the system serves as the primary resource for both development and computing, then it is possible to setup a desktop like environment on it via containerized desktop. The instructions and Dockerfile for this can be found here. Notice that these instructions are primarily for the DGX-1 but should work for the DGX Station.

You can download the latest release of the container to the system. The next step is to modify the Dockerfile by changing the FROM field to be:
FROM nvcr.io/nvidia/cuda:8.0-cudnn6-devel-ubuntu16.04

This is not an officially supported container by the DGX product team, in other words, it is not available on nvcr.io and was provided as an example of how to setup a desktop like environment on a system for convenient development with eclipse or sublime-text (suggestion, try visual studio code which is very like sublime text but free) or any other GUI driven tool.

An example script, build_run_dgxdesk.sh, is available on the GitHub site to build and run a containerized desktop as shown in the Scripts. Other systems such as the DGX Station and NGC would follow a similar process.

To connect to the system, you can download a VNC client for your system from RealVnc, or use a web-browser.
=> connect via VNC viewer hostip:5901, default password: vncpassword
=> connect via noVNC HTML5 client: http://hostip:6901/?password=vncpassword

5.6. MXNet

MXNet™ is part of the Apache Incubator project. The MXNet library is portable and can scale to multiple GPUs and multiple machines. MXNet is supported by major Public Cloud providers including Amazon Web Services (AWS) and Azure Amazon has chosen MXNet as its deep learning framework of choice at AWS. It supports multiple languages (C++, Python, Julia, Matlab, JavaScript, Go, R, Scala, Perl, Wolfram Language).

NVIDIA includes a release of MXNet as well. You can read the release notes here. NVIDIA also has a page in the GPU Ready Apps catalog for MXNEMXNetT that explains how you can build it outside of the container registry (nvcr.io). It also presents some test results for MXNet.

To get started with MXNet, the NVIDIA Deep Learning Institute (DLI) has some courses that utilize MXNet. In the this list, there are some courses that utilize MXNet.

5.7. PyTorch

PyTorch™ is designed to be deeply integrated with Python. It is used naturally as you would use NumPy, SciPy and scikit-learn, or any other Python extension. You can even write the neural network layers in Python using libraries such as Cython and Numba. Acceleration libraries such as NVIDIA cuDNN and NCCL along with Intel MKL are included to maximize performance.

NVIDIA has a release of PyTorch as well. You can read the release notes here. There is also a good blog that discusses recursive neural networks using PyTorch.

5.8. TensorFlow

An efficient way to run TensorFlow on the GPU system involves setting up a launcher script to run the code using a TensorFlowDocker container. For an example of how to run CIFAR-10 on multiple GPUs on system using cifar10_multi_gpu_train.py, see TensorFlow models.

If you prefer to use a script for running TensorFlow, see run_tf_cifar10.sh. It is a bash script that you can run on a system. It assumes you have pulled the Docker container from the nvcr.io repository to the system. It also assumes you have the CIFAR-10 data stored in /datasets/cifar on the system and are mapping it to /datasets/cifar in the container. You can also pass arguments to the script such as the following:
$./run_tf_cifar10.sh --data_dir=/datasets/cifar --num_gpus=8

The details of the run_tf_cifar10.sh script parameterization is explained in the Keras section of this document (see Keras And Containerized Frameworks). You can modify the /datasets/cifar path in the script for the site specific location to CIFAR data. If the CIFAR-10 dataset for TensorFlow is not available, then run the example with writeable volume -v /datasets/cifar:/datasets/cifar (without ro) and the data will be downloaded on the first run.

If you want to parallelize the CIFAR-10 training, basic data-parallelization for TensorFlow via Keras can be done as well. Refer to the example cifar10_cnn_mgpu.py on GitHub.

A description of orchestrating a Python script with Docker containers is described in section run_tf_cifar10.sh using the run_tf_cifar10.sh script.

5.9. Theano

Theano is an open source project primarily developed by a machine learning group at the Université de Montréal. It is really focused on Python and is primarily a Python library or module. It has it’s own Python frontend and Keras can also be used as a frontend. Interestingly, Theano combines aspects of a Computer Algebra System (CAS) with aspects of an optimizing compiler. It can generate customized C code for aspects of the problem that are being solved which is very useful for repetitive computations. Moreover, it can still provide symbolic features such as automatics differentiation, for expressions that may be evaluated once, to improve the performance.

NVIDIA includes a release of Theano as well. You can read the release notes here. To get started with Theano, the NVIDIA Deep Learning Institute (DLI) provides online courses that utilize Theano.

5.10. Torch

Torch is an open-source deep learning framework that uses Lua as a scripting language. It can also be used with DIGITS.

NVIDIA includes a release of Torch as well. You can read the release notes here. To get started with Torch, the NVIDIA Deep Learning Institute (DLI) provides online courses that utilize Torch.

If you want to build Torch from scratch or if you are interested in test results with Torch, you can find more information on the GPU Ready App site for Torch.

6. DGX-1 Best Practices

NVIDIA has created the DGX-1 as an appliance to make administration and operation as simple as possible. However, like any computational resource it still requires administration. This section discusses some of the best practices around configuring and administering a single DGX-1 or several DGX-1 appliances.

There is also some discussion about how to plan for external storage, networking, and other configuration aspects for the DGX-1.

6.1. Storage

In order for deep learning to be effective and to take full advantage of the DGX-1, the various aspects of the DGX-1 have to be balanced. This includes storage and IO. This is particularly important for feeding data to the GPUs to keep them busy and dramatically reduce run times for models. This section presents some best practices for storage within and outside of, the DGX-1. It also talks about storage considerations as the number of DGX-1 units are scaled out.

6.1.1. Internal Storage

The first storage consideration is storage within the DGX-1 itself. For the best possible performance, a NFS read cache has been included in the DGX-1 appliance using the Linux cacheFS capability. It uses four SSD’s in a RAID-0 group. The drives are connected to a dedicated hardware RAID controller.

Deep learning I/O patterns typically consist of multiple iterations of reading the training data. The first pass through the data is sometimes referred to the cold start. Subsequent passes through the data can avoid rereading the data from the filesystem if adequate local caching is provided on the node. If you can estimate the maximum size of your data, you can architect your system to provide enough cache so that the data only needs to be read once during any training job. A set of very fast SSD disks can provide an inexpensive and scalable way of providing adequate caching for your applications.

The purpose of this cache is for storing training and validation data for reading by the frameworks. During the first epoch of training a framework, the training data is read and used to start training the model. The NFS cache is a read cache so that all of the data that is read for the first epoch is cached on the RAID-0 group. Subsequent reads of the data are done from the NFS cache and not the central repository that was used in the first epoch. As a result, the IO is much faster after the first epoch.

The benefit of adequate caching is that your external filesystem does not have to provide maximum performance during a cold start (the first epoch), since this first pass through the data is only a small part of the overall training time. For example, typical training sessions can iterate over the data 100 times. If we assume a 5x slower read access time during the first cold start iteration vs the remaining iterations with cached access, then the total run time of training increases by the following amount.
  • 5x slower shared storage 1st iteration + 99 local cached storage iterations
    • > 4% increase in runtime over 100 iterations

Even if your external file system cannot sustain peak training IO performance, it has only a small impact on overall training time. This should be considered when creating your storage system to allow you to develop the most cost-effective storage systems for your workloads.

By default, the DGX-1 comes with four SSD devices connected to the RAID controller.
There are more slots that are open in the DGX-1 for other drives but you cannot put additional drives into the system without voiding your warranty.

6.1.2. External Storage

As an organization scales out their GPU enabled data center, there are many shared storage technologies which pair well with GPU applications. Since the performance of a GPU enabled server is so much greater than a traditional CPU server, special care needs to be taken to ensure the performance of your storage system is not a bottleneck to your workflow.

Different data types require different considerations for efficient access from filesystems. For example:
  • Running parallel HPC applications may require the storage technology to support multiple processes accessing the same files simultaneously.
  • To support accelerated analytics, storage technologies often need to support many threads with quick access to small pieces of data.
  • For vision based deep learning, accessing images or video used in classification, object detection or segmentation may require high streaming bandwidth, fast random access, or fast memory mapped (mmap()) performance.
  • For other deep learning techniques, such as recurrent networks, working with text or speech can require any combination of fast bandwidth with random and small files.

HPC workloads typically drive high simultaneous multi-system write performance and benefit greatly from traditional scalable parallel file system solutions. Size HPC storage and network performance to meet the increased dense compute needs of GPU servers. It is not uncommon to see per-node performance increases from between 10-40x for a 4 GPU system vs a CPU system for many HPC applications.

Data Analytics workloads, similar to HPC, drive high simultaneous access, but more read focused than HPC. Again it is important to size Data Analytics storage to match the dense compute performance of GPU servers. As you adopt accelerated analytics technologies such as GPU-enabled in-memory databases, make sure that you can populate the database from your data warehousing solution quickly to minimize startup time when you change database schemas. This may require a network with 10 Gbe for greater performance. To support clients at this rate, you may have to revisit your data warehouse architecture to identify and eliminate bottlenecks.

Deep learning is a fast evolving computational paradigm and it is important to know what your requirements are in the near and long term to properly architect a storage system. The ImageNet database is often used as a reference when benchmarking deep learning frameworks and networks. The resolution of the images in ImageNet are 256x256. It is more common to find images at 1080p or 4k. Images in 1080p resolution are 30 times larger than those in ImageNet. Images in 4k resolution are 4 times larger than that (120X the size of ImageNet images). Uncompressed images are 5-10 times larger than compressed images. If your data cannot be compressed for some reason, for example if you are using a custom image formats, the bandwidth requirements increase dramatically.

For AI-Driven Storage, it is suggested that you make use of deep learning framework features that build databases and archives versus accessing small files directly; reading and writing many small files will reduce performance on the network and local file systems. Storing files in formats such as HDF5, LMDB or LevelDB can reduce metadata access to the filesystem helping performance. However, these formats can lead to their own challenges with additional memory overhead or requiring support for fast mmap() performance. All this means that you should plan to be able to read data at 150-200 MB/s per GPU for files at 1080p resolution. Consider more if you are working with 4k or uncompressed files. NFS Storage

NFS can provide a good starting point for AI workloads on small GPU server configurations with properly sized storage and network bandwidth. NFS based solutions can scale well for larger deployments, but be aware of possible single node and aggregate bandwidth requirements and make sure that matches your vendor of choice. As you scale your data center to need more than 10 GB/s or your data center grows to hundreds or thousands of nodes, other technologies may be more efficient and scale better.

Generally, it is a good idea to start with NFS using one or more of the 10 Gb/s Ethernet connections on the DGX-1. After this is configured, it is recommended that you run your applications and check if IO performance is a bottleneck. Typically, NFS over 10Gb/s Ethernet provides up to 1.25 GB/s of IO throughput for large block sizes. If, in your testing, you see NFS performance that is significantly lower than this, check the network between the NFS server and the DGX-1 to make sure there are no bottlenecks (for example, a 1 GigE network connection somewhere, a misconfigured NFS server, or a smaller MTU somewhere in the network).

There are a number of online articles, such as this one, that list some suggestions for tuning NFS performance on both the client and the server. For example:
  • Increasing Read, Write buffer sizes
  • TCP optimizations including larger buffer sizes
  • Increasing the MTU size to 9000
  • Sync vs. Async
  • NFS Server options
  • Increasing the number of NFS server daemons
  • Increasing the amount of NFS server memory
Linux is very flexible and by default most distributions are conservative about their choice of IO buffer sizes since the amount of memory on the client system is unknown. A quick example is increasing the size of the read buffers on the DGX-1 (the NFS client). This can be achieved with the following system parameters:
  • net.core.rmem_max=67108864
  • net.core.rmem_default=67108864
  • net.core.optmem_max=67108864

The values after the variable are example values (they are in bytes). You can change these values on the NFS client and the NFS server, and then run experiments to determine if the IO performance improves.

The previous examples are for the kernel read buffer values. You can also do the same thing for the write buffers where you use wmem instead rmem.

You can also tune the TCP parameters in the NFS client to make them larger. For example, you could change the net.ipv4.tcp_rmem=”4096 87380 33554432” system parameter.

This changes the TCP buffer size, for ipv4, to 4,096 bytes as a minimum, 87,380 bytes as the default, and 33,554,432 bytes as the maximum.

If you can control the NFS server, one suggestion is to increases the number of NFS daemons on the server. By default, NFS only starts with eight nfsd processes (eight threads), which, given that CPUs today have very large core counts, is not really enough.

You can find the number of NFS daemons in two ways. The first is to look at the process table and count the number of NFS processes via the $ ps -aux | grep nfs command.

The second way is to look at the NFS config file (for example, /etc/sysconfig/nfs) for an entry that says RPCNFSDCOUNT. This tells you the number of NFS daemons for the server.

If the NFS server has a large number of cores and a fair amount of memory, you can increase RPCNFSDCOUNT. There are cases where good performance has been achieved using 256 on an NFS server with 16 cores and 128GB of memory.

You should also increase RPCNFSDCOUNT when you have a large number of NFS clients performing I/O at the same time. For this situation, it is recommended that you should also increase the amount of memory on the NFS server to a larger number, such as 128 or 256GB. Don't forget that if you change the value of RPCNFSDCOUNT, you will have to restart NFS for the change to take effect.

One way to determine whether more NFS threads helps performance is to check the data in /proc/net/rpc/nfs entry for the load on the NFS daemons. The output line that starts with th lists the number of threads, and the last 10 numbers are a histogram of the number of seconds the first 10% of threads were busy, the second 10%, and so on.

Ideally, you want the last two numbers to be zero or close to zero, indicating that the threads are busy and you are not "wasting" any threads. If the last two numbers are fairly high, you should add NFS daemons, because the NFS server has become the bottleneck. If the last two, three, or four numbers are zero, then some threads are probably not being used.

One other option, while a little more complex, can prove to be useful if the IO pattern becomes more write intensive. If you are not getting the IO performance you need, change the mount behavior on the NFS clients from “sync” to “async”.
By default, NFS file systems are mounted as “sync” which means the NFS client is told the data is on the NFS server after it has actually been written to the storage indicating the data is safe. Some systems will respond that the data is safe if it has made it to the write buffer on the NFS server and not the actual storage.

Switching from “sync” to “async” means that the NFS server responds to the NFS client that the data has been received when the data is in the NFS buffers on the server (in other words, in memory). The data hasn’t actually been written to the storage yet, it’s still in memory. Typically, writing to the storage is much slower than writing to memory, so write performance with “async” is much faster than with “sync”. However, if, for some reason, the NFS server goes down before the data in memory is written to the storage, then the data is lost.

If you try using “async” on the NFS client (in other words, the DGX-1), ensure that the data on the NFS server is replicated somewhere else so that if the server goes down, there is always a copy of the original data. The reason is if the NFS clients are using “async” and the NFS server goes down, data that is in memory on the NFS server will be lost and cannot be recovered.

NFS “async” mode is very useful for write IO, both streaming (sequential) and random IO. It is also very useful for “scratch” file systems where data is stored temporarily (in other words, not permanent storage or storage that is not replicated or backed up).

If you find that the IO performance is not what you expected and your applications are spending a great deal of time waiting for data, then you can also connect NFS to the DGX-1 over InfiniBand using IPoIB (IP over IB). This is part of the DGX-1 software stack and can be easily configured. The main point is that the NFS server should be InfiniBand attached as well as the NFS clients. This can greatly improve IO performance. Parallel File Systems

Other network file systems that require the installation of additional software or modification of the kernel itself are not supported by NVIDIA. This includes file systems such as Lustre, BeeGFS, General Parallel File System (formerly known as GPFS), and Gluster among others. These file systems can improve the aggregate IO performance as well as the reliability (fault tolerance).
If you require technical support from NVIDIA for your DGX-1, it is possible, although unlikely, that NVIDIA would ask you to uninstall the parallel file system and revert the kernel back to a baseline kernel, to help debug the problem. Scaling Out Recommendations

Based on the general IO patterns of deep learning frameworks (see External Storage), below are suggestions for storage needs based on the use case. These are suggestions only and are to be viewed as general guidelines.
Table 3. Scaling out suggestions and guidelines
Use Case Adequate Read Cache? Network Type Recommended Network File System Options
Data Analytics NA 10 Gbe Object-Storage, NFS, or other system with good multithreaded read and small file performance
HPC NA 10/40/100 GBe, InfiniBand NFS or HPC targeted filesystem with support for large numbers of clients and fast single-node performance
DL, 256x256 images yes 10 Gbe NFS or storage with good small file support
DL, 1080p images yes 10/40 Gbe, InfiniBand High-end NFS, HPC filesystem or storage with fast streaming performance
DL, 4k images yes 40 Gbe, InfiniBand HPC filesystem, high-end NFS or storage with fast streaming performance capable of 3+ GB/s per node
DL, uncompressed Images yes InfiniBand, 40/100 Gbe HPC filesystem, high-end NFS or storage with fast streaming performance capable of 3+ GB/s per node
DL, Datasets that are not cached no InfiniBand, 10/40/100 Gbe Same as above, aggregate storage performance must scale to meet the all applications simultaneously

As always, it is best to understand your own applications’ requirements to architect the optimal storage system.

Lastly, this discussion has focused only on performance needs. Reliability, resiliency and manageability are as important as the performance characteristics. When choosing between different solutions that meet your performance needs, make sure that you have considered all aspects of running a storage system and the needs of your organization to select the solution that will provide the maximum overall value.

6.2. Authenticating Users

To make the DGX useful, users need to be added to the system in some fashion so they can be authenticated to use the system. Generally, this is referred to as user authentication. There are several different ways this can be accomplished, however, each method has its own pros and cons.

6.2.1. Local

The first way is to create users directly on the DGX-1 server using the useradd command. Let’s assume you want to add a user dgxuser. You would first add the user via the following command.
$ useradd -m -s /bin/bash dgxuser
Where -s refers to the default shell for the user and -m creates the user’s home directory. After creating the user you need to add them to the docker group on the DGX.
$ sudo usermod -aG docker dgxuser

This adds the user dgxuser to the group docker which is required for running Docker containers on the DGX.

Using authentication on the DGX is simple but not without its issues. First, there have been occasions when an OS upgrade on the DGX requires the reformatting of all the drives in the appliance. If this happens, you first must make sure all user data is copied somewhere off the DGX-1 before the upgrade. Second, you will have to recreate the users and add them to the docker group and copy their home data back to the DGX-1. This adds work and time to upgrading the system.
Important: Moreover, there is no RAID-1 on the OS drive so if it fails, you will lose all the users and everything in the home directories. It is highly recommended that you backup the pertinent files on the DGX-1 as well as /home for the users.

6.2.2. NIS or NIS+

Another authentication option is to use NIS or NIS+. In this case, the DGX-1 would be a client in the NIS/NIS+ configuration. As with using local authentication as previously discussed, there is the possibility that the OS drive in the DGX-1 could be overwritten during an upgrade (not all upgrades reformat the drives, but it’s possible). This means that the administrator may have to reinstall the NIS configuration on the DGX-1.

Also, remember that the DGX-1 has a single OS drive. If this drive fails, the administrator will have to re-configure the NIS/NIS+ configuration, therefore, backups are encouraged.
Note: It is possible that if, in the unlikely event that technical support for the DGX-1 is needed, the NVIDIA engineers may require the administrator to disconnect from the NIS/NIS+ server.

6.2.3. LDAP

A third option for authentication is LDAP (Lightweight Directory Access Protocol). It has become very popular in the clustering world, particularly for Linux. You can configure LDAP on the DGX-1 for user information and authentication from an LDAP server. However, as with NIS, there are possible repercussions.
  • The first is that the OS drive is a single drive. If the drive fails, you will have to rebuild the LDAP configuration (backups are highly recommended).
  • The second is that, as previously mentioned, if, in the unlikely event of needing tech support, you may be asked to disconnect the DGX-1 from the LDAP server so that the system can be triaged.

6.2.4. Active Directory

One other option for user authentication is connecting the DGX-1 to an Active Directory (AD) server. This may require the system administrator to install some extra tools into the DGX-1. This means that this approach should also include the two cautions that were repeated before where the single OS drive may be reformatted for an upgrade or that it may fail (again, backups are highly recommended). It also means that in the unlikely case of needing to involve NVIDIA technical support, you may be asked to take the system off the AD network and remove any added software (this is unlikely but possible).

6.3. Monitoring

Being able to monitor your systems is the first step in being able to manage them. NVIDIA provides some very useful command line tools that can be used specifically for monitoring the GPUs.

6.3.1. DCGM

NVIDIA Data Center GPU Manager™ (DCGM) simplifies GPU administration in the data center. It improves resource reliability and uptime, automates administrative tasks, and helps drive overall infrastructure efficiency. It can perform the following tasks with very low overhead on the appliance.
  • Active health monitoring
  • Diagnostics
  • System validation
  • Policies
  • Power and clock management
  • Group configuration and accounting

The DCGM Toolkit comes with a User Guide that explains how to use the command-line tool called dcgmi, as well as an API Guide. In addition to the command-line tool, DCGM also comes with headers and libraries for writing your own tools in Python or C.

Rather than treat each GPU as a separate resource, DCGM allows you to group them and then apply policies or tuning options to the group. This also includes being able to run diagnostics on the group.

There are several best practices for using DCGM with the DGX-1 appliance. The first is that the command line tool can run diagnostics on the GPUs. You could create a simple cron job on the DGX-1 to check the GPUs and store the results either into a simple flat file or into a simple database.

There are three levels of diagnostics that can be run starting with level 1.
  • Level 1 runs in just a few seconds.
  • Level 3 takes about 4 minutes to run. An example of the output from running a level 3 diagnostic is below.
    Figure 18. Levels of diagnostics Levels of diagnostics

It is fairly easy to parse this output looking for Error in the output. You can easily send an email or raise some other alert if an Error is discovered.

A second best practice for utilizing DCGM is if you have a resource manager (in other words, a job scheduler) installed. Before the user’s job is run, the resource manager can usually perform what is termed a prologue. That is, any system calls before the user’s job is executed. This is a good place to run a quick diagnostic and also use DCGM to start gathering statistics on the job. Below is an example of statistics gathering:
Figure 19. Statistics gathering Statistics gathering

When the user’s job is complete, the resource manager can run something called an epilogue. This is a place where the system can run some system calls for doing such things as cleaning up the environment or summarizing the results of the run including the GPU stats as from the above command. Consult the user’s guide to learn more about stats with DCGM.

If you create a set of prologue and epilogue scripts that run diagnostics you might want to consider storing the results in a flat file or a simple database. This allows you to keep a history of the diagnostics of the GPUs so you can pinpoint any issues (if there are any).

A third way to effectively use DCGM is to combine it with a parallel shell tool such as pdsh. With a parallel shell you can run the same command across all of the nodes in a cluster or a specific subset of nodes. You can use it to run dcgmi to run diagnostics across several DGX-1 appliances or a combination of DGX-1 appliances and non-GPU enabled systems. You can easily capture this output and store it in a flat file or a database. Then you can parse the output and create warnings or emails based on the output.

Having all of this diagnostic output is also an excellent source of information for creating reports regarding topics such as utilization.

For more information about DCGM, see NVIDIA Data Center GPU Manager Simplifies Cluster Administration.

6.3.2. Using ctop For Monitoring

Containers can make monitoring a little more challenging than the classic system monitoring. One of the classic tools used by system administrators is top. By default, top displays the load on the system as well as the ordered list of processes on the system.

There is a top-like tool for Docker containers and runC, named ctop. It lists real-time metrics for more than one container and is easy to install and update the resource usage for the running containers.
Attention: ctop runs on a single DGX-1 only. Most likely you will have to log into the specific node and run ctop. A best practice is to use tmux and create a pane for ctop for each DGX-1 if the number of systems is fairly small (approximately less than 10).

6.3.3. Monitoring A Specific DGX Using nvidia-smi

As previously discussed, DCGM is a great tool for monitoring GPUs across multiple nodes. Sometimes, a system administrator may want to monitor a specific DGX in real-time. An easy way to do this is to login into the DGX and run nvidia-smi in conjunction with the watch command.

For example, you could run the command watch -n 1 nvidia-smi that runs the nvidia-smi command every second (-n 1 means to run the command with 1 second intervals). You could also add the -d option to watch so that it highlights changes or differences since the last time it was run. This allows you to easily see what has changed.

Just like ctop, you can use nvidia-smi and watch in a pane in a tmux terminal to keep an eye on a relatively small number of DGX servers.

6.4. Managing Resources

One of the common questions from DGX-1 customers is how can they effectively share the DGX-1 between users without any inadvertent problems or data exchange. The generic phrase for this is resource management, the tools are called resource managers. They can also be called schedulers or job schedulers. These terms are oftentimes used interchangeably.

You can view everything on the DGX as a resource. This includes memory, CPUs, GPUs, and even storage. Users submit a request to the resource manager with their requirements and the resource manager assigns the resources to the user if they are available and not being used. Otherwise, the resource manager puts the request in a queue to wait for the resources to become available. When the resources are available, the resource manager assigns the resources to the user request.

Resource management so that users can effectively share a centralized resource (in this case, the DGX-1 appliance) has been around a long time. There are many open-source solutions, mostly from the HPC world, such as PBS Pro, Torque, SLURM, Openlava, SGE, HTCondor, and Mesos. There are also commercial resource management tools such as UGE and IBM Spectrum LSF.

For more information about getting started, see Job scheduler.

If you haven’t used job scheduling before you should perform some simple experiments first to understand how it works. For example, take a single server and install the resource manager. Then try running some simple jobs using the cores on the server.

6.4.1. SLURM Example

As an example, SLURM is installed and configured on a DGX-1 or DGX Station. The first step is to plan how you want to use the DGX system. The first, and by far the easiest configuration, is to assume that a user gets exclusive access to the entire node. In the case the user gets the entire DGX, access to all 8 GPUs and all cores is given. No other users can use the resources while the first user is using them.

The second way, is to make the GPUs a consumable resource. The user will then ask for the number of GPUs they need ranging from 1 to 8.

There are two public git repositories containing information on SLURM and GPUs, that can help you get started with scheduling jobs.
Note: You may have to configure SLURM to match your specifications.

At a high level, there are two basic options for configuring SLURM with GPU’s and DGX systems. The first is to use what is called exclusive mode access and the second allows each GPU to be scheduled independently of the others. Simple GPU Scheduling With Exclusive Node Access

If you're not interested in allowing multiple jobs per compute node, you many not necessarily need to make SLURM aware of the GPUs in the system, and the configuration can be greatly simplified.

One way of scheduling GPUs without making use of GRES (Generic Resource Scheduling) is to create partitions or queues for logical groups of GPUs. For example, grouping nodes with P100 GPUs into a P100 partition would result in something like the following:
$ sinfo -s
p100     up   infinite         4/9/3/16  node[212-213,215-218,220-229]
The corresponding partition configuration via the SLURM configuration file, slurm.conf, would be something like the following:
PartitionName=p100 Default=NO DefaultTime=01:00:00 State=UP Nodes=node[212-213,215-218,220-229]

If a user requests a node from the p100 partition, then they would have access to all of the resources in that node, and other users would not. This is what it is called exclusive access.

This approach can be advantageous if you are concerned that sharing resources might result in performance issues on the node or if you are concerned about overloading the node resources. For example, in the case of a DGX-1, if you think multiple users might overwhelm the 8TB NFS read cache, then you might want to consider using exclusive mode. Of if you are concerned that the users may use all of the physical memory causing page swapping with a corresponding reduction in performance, then exclusive mode might be useful. Scheduling Resources At The Per GPU Level

A second option for using SLURM, is to treat the GPUs like a consumable resource and allow users to request them in integer units (i.e. 1, 2, 3, etc.). SLURM can be made aware of GPUs as a consumable resource to allow jobs to request any number of GPU’s. This feature requires job accounting to be enabled first; for more info, see Accounting and Resource Limits. A very quick overview is below.

The SLURM configuration file, slurm.conf, needs parameters set to enable cgroups for resource management and GPU resource scheduling. An example is the following:
# General

# Scheduling

# Logging and Accounting
DebugFlags=CPU_Bind,gres                # show detailed information in Slurm logs about GPU binding and affinity
The partition information in slurm.conf defines the available GPUs for each resource. Here is an example:
# Partitions
NodeName=slurm-node-0[0-1] Gres=gpu:2 CPUs=10 Sockets=1 CoresPerSocket=10 ThreadsPerCore=1 RealMemory=30000 State=UNKNOWN
PartitionName=compute Nodes=ALL Default=YES MaxTime=48:00:00 DefaultTime=04:00:00 MaxNodes=2 State=UP DefMemPerCPU=3000
The way that resource management is enforced is through cgroups. The cgroups configuration require a separate configuration file, cgroup.conf, such as the following:

To schedule GPU resources requires a configuration file to define the available GPUs and their CPU affinity. An example configuration file, gres.conf, is below:
Name=gpu File=/dev/nvidia0 CPUs=0-4
Name=gpu File=/dev/nvidia1 CPUs=5-9
To run a job utilizing GPU resources requires using the --gres flag with the srun command. For example, to run a job requiring a single GPU the following srun command can be used.
$ srun --gres=gpu:1 nvidia-smi

You also may want to restrict memory usage on shared nodes so that a user doesn’t cause swapping with other user or system processes. A convenient way to do this is with memory cgroups.

Using memory cgroups can be used to restrict jobs to allocated memory resources requires setting kernel parameters. On Ubuntu systems this is configurable via the file /etc/default/grub.
GRUB_CMDLINE_LINUX="cgroup_enable=memory swapaccount=1"

6.5. Networking

Networking DGX-1 appliances is an important topic because of the need to provide data to the GPUs for processing. GPUs are remarkably faster than CPUs for many tasks, particularly deep learning. Therefore, the network principles used for connecting CPU servers may not be sufficient for DGX-1 appliances. This is particularly important as the number of DGX-1 appliances grows over time.

To understand best practices for networking the DGX-1 and for planning for future growth, it is best to start with a brief review of the DGX-1 appliance itself. Recall that the DGX-1 comes with four EDR InfiniBand cards (100 Gb/s each) and two 10Gb/s Ethernet cards (copper). These networking interfaces can be used for connecting the DGX-1 to the network for both communications and storage.
Figure 20. Networking interfaces Networking interfaces

Notice that every two GPUs is connected to a single PCIe switch that is on the system board. The switch also connects to an InfiniBand (IB) network card. To reduce latency and improve throughput, and network traffic from these two GPUs should go to the associated IB card. This is why there are four IB cards in the DGX-1 appliance.

6.5.1. InfiniBand Networking

If you want to use the InfiniBand (IB) network to connect DGX appliances, theoretically, you only have to use one of the IB cards. However, this will push data traffic over the QPI link between the CPUs, which is a very slow link for GPU traffic (i.e. it becomes a bottleneck). A better solution would be to use two IB cards, one connected to each CPU. This could be IB0 and IB2, or IB1 and IB3, or IB0 and IB3, or IB1 and IB2. This would greatly reduce the traffic that has to traverse the QPI link. The best performance is always going to be using all four of the IB links to an IB switch.

The best approach for using IB links to connect all four IB cards to an IB fabric. This will result in the best performance (full bisectional bandwidth and lowest latency) if you are using multiple DGX appliances for training.

Typically, the smallest IB switch comes with 36-ports. This means a single IB switch could accommodate nine (9) DGX-1 appliances using all four IB cards. This allows 400 Gb/s of bandwidth from the DGX-1 to the switch.

If your applications do not need the bandwidth between DGX-1 appliances, you can use two IB connections per DGX-1 as mentioned previously. This allows you to connect up to 18 DGX-1 appliances to a single 36-port IB switch.
Note: It is not recommended to use only a single IB card, but if for some reason that is the configuration, then you can connect up to 36 DGX-1 appliances to a single switch.

For larger numbers of DGX-1 appliances, you will likely have to use two levels of switching. The classic HPC configuration is to use 36-port IB switches for the first level (sometimes called leaf switches) and connect them to a single large core switch, which is sometimes called a director class switch. The largest director class InfiniBand switch has 648 ports. You can use more than one core switch but the configuration will get rather complex. If this is something you are considering, please contact your NVIDIA sales team for a discussion.

For two tiers of switching, if all four IB cards per DGX-1 appliance are used to connect to a 36-port switch, and there is no over-subscription, the largest number of DGX-1 appliances per switch is 4. This is 4 ports from each DGX-1 into the switch for a total of 16. Then, there are 16 uplinks from the leaf switch to the core switch (the director class switch). A total of 40x 36-port leaf switches can be connected to the 648-port core switch (648/16). This results in 160 DGX-1 appliances being connected with full bi-sectional bandwidth.

You can also use what is termed over-subscription in designing the IB network. Over-subscription means that the bandwidth from an uplink is less than the bandwidth coming into the unit (in other words, poorer bandwidth performance). If we use 2:1 over-subscription from the DGX-1 appliances to the first level of switches (36-port leaf switches), then each DGX-1 appliance is only using two IB cards to connect to the switches. This results in less bandwidth than if we used all four cards and also higher latency.

If we keep the network bandwidth from the leaf switches to the core directory switch as 1:1 (in other words, no over-subscription, full bi-sectional bandwidth), then we can put nine (9) DGX-1 appliances into a single leaf switch (a total of 18 ports into the leaf switch from the DGX appliances and 18 uplink ports to the core switch). The result is that a total of 36 leaf switches can be connected to the core switch. This allows a grand total of 324 DGX-1 appliances to be connected together.

You can tailor the IB network even further by using over-subscription from the leaf switches to the core switch. This can be done using four IB connections to a leaf switch from each DGX appliance and then doing 2:1 over-subscription to the core switch or even using two IB connections to the leaf switches and then 2:1 over-subscription to the core switch. These designs are left up to the user to determine but if this is something you want to consider, please contact your NVIDIA sales team for a discussion.

Another important aspect of InfiniBand networking is the Subnet Manager (SM). The SM simply manages the IB network. There is one SM that manages the IB fabric at any one time but you can have other SM’s running and ready to take over if the first SM crashes. Choosing how many SM’s to run and where to run them can have a major impact on the design of the cluster.

The first decision to make is where you want to run the SM’s. They can be run on the IB switches if you desire. This is called hardware SM since it runs on the switch hardware. The advantage of this is that you do not need any other servers which could also run the SM. Running the SM on a node is called a software SM. A disadvantage to running a hardware SM is that if the IB traffic is large, the SM could have a difficult time. For lots of IB traffic and for larger networks, it is a best practice to use a software SM on a dedicated server.

The second decision to make is how many SM’s you want to run. At a minimum, you will have to run one SM. The least expensive solution is to run a single hardware SM. This will work fine for small clusters of DGX-1 appliances (perhaps 2-4). As the number of units grow, you will want to consider running two SM’s at the same time to get HA (High Availability) capability. The reason you want HA is that more users are on the cluster and having it go down has a larger impact than just a small number of appliances.

As the number of appliances grow, consider running the SM’s on dedicated servers (software SM). You will also want to run at least two SM’s for the cluster. Ideally, this means two dedicated servers for the SM’s, but there may be a better solution that solves some other problems; a master node.

6.5.2. Ethernet Networking

Each DGX-1 system comes with two 10Gb/s NICs. These can be used to connect the systems to the local network for a variety of functions such as logins and storage traffic. As a starting point, it is recommended to push NFS traffic over these NICs to the DGX-1. You should monitor the impact of IO on the performance of your models in this configuration.

If you need to go to more than one level of Ethernet switching to connect all of the DGX-1 units and the storage, be careful of how you configure the network. More than likely, you will have to enable the spanning tree protocol to prevent loops in the network. The spanning tree protocol can impact network performance, therefore, you could see a decrease in application performance.

The InfiniBand NICs that come with the DGX-1 can also be used as Ethernet NICs running TCP. The ports on the cards are QSFP28 so you can plug them into a compatible Ethernet network or a compatible InfiniBand network. You will have to add some software to the appliance and change the networking but you can use the NICs as 100GigE Ethernet cards.

For more information, see Switch Infiniband and Ethernet in DGX-1.

6.5.3. Bonded NICs

The DGX-1 provides two 10GbE ports. Out of the factory these two ports are not bonded but they can be bonded if desired. In particular, VLAN Tagged, Bonded NICs across the two 10 GbE cards can be accomplished.

Before bonding the NICs together, ensure you are familiar with the following:
  • Ensure your network team is involved because you will need to choose a bonding mode for the NICs.
  • Ensure you have a working network connection to pull down the VLAN packages. To do so, first setup a basic, single NIC network (no VLAN/bonding) connection and download the appropriate packages. Then, reconfigure the switch for LACP/VLANs.
Tip: Since the networking goes up and down throughout this process, it's easier to work from a remote console.
The process below walks through the steps of an example for bonding the two NICs together.
  1. Edit the /etc/network/interfaces file to setup an interface on a standard network so that we can access required packages.
    auto em1
    	iface em1 inet static
  2. Bring up the updated interface.
    sudo ifdown em1 && sudo ifup em1
  3. Pull down the required bonding and VLAN packages.
    sudo apt-get install vlan
    sudo apt-get install ifenslave
  4. Shut down the networking.
    sudo stop networking
  5. Add the following lines to /etc/modules to load appropriate drivers.
    sudo echo "8021q" >> /etc/modules
    sudo echo "bonding" >> /etc/modules
  6. Load the drivers.
    sudo modprobe 8021q
    sudo modprobe bonding
  7. Reconfigure your /etc/network/interfaces file. There are some configuration parameters that will be customer network dependent and you will want to work with one of your network engineers.
    The following example creates a bonded network over em1/em2 with IP and VLAN ID 430. You specify the VLAN ID in the NIC name (bond0.###). Also notice that this example uses a bond-mode of 4. Which mode you use is up to you and your situation.
    auto lo
    iface lo inet loopback
    # The following 3 sections create the bond (bond0) and associated network ports (em1, em2)
    auto bond0
    iface bond0 inet manual
    bond-mode 4
    bond-miimon 100
    bond-slaves em1 em2
    auto em1
    iface em1 inet manual
    bond-master bond0
    bond-primary em1
    auto em2
    iface em2 inet manual
    bond-master bond0
    # This section creates a VLAN on top of the bond.  The naming format is device.vlan_id
    auto bond0.430
    iface bond0.430 inet static
    dns-search company.net
    vlan-raw-device bond0
  8. Restart the networking.
    sudo start networking
  9. Bring up the bonded interfaces.
    ifup bond0
  10. Engage your network engineers to re-configure LACP and VLANs on switch.
  11. Test the configuration.

6.6. SSH Tunneling

Some environments are not configured or limit access (firewall or otherwise) to computer nodes within an intranet. When running a container with a service or application exposed on a port, such as DIGITS, remote access must be enabled on the remote system to that port on the DGX-1. The following steps use PuTTY to create SSH tunnel from a remote system into the DGX-1. If you are using an SSH utility, one can set up tunneling via the -L option.
Note: A PuTTY SSH tunnel session must be up, logged in, and running for tunnel to function. SSH tunnels are commonly used for the following applications (with listed port numbers).
Table 4.
Application Port Notes
DIGITS 5000 If multiple users, each selects own port
VNC Viewer 5901, 6901 5901 for VNC app, 6901 for web app
To create an SSH Tunnel session with PuTTY, perform the following steps:
  1. Run the PuTTY application.
  2. In the Host Name field, enter the host name you want to connect to.
  3. In the Saved Sessions section, enter a name to save the session under and click Save.
  4. Click Category > Connection, click + next to SSH to expand the section.
  5. Click Tunnels for Tunnel configuration.
  6. Add the DIGITS port for forwarding.
    1. In the Source Port section, enter 5000, which is the port you need to forward for DIGITS.
  7. In the Destination section, enter localhost:5000 for the local port that you will connect to.
  8. Click Add to save the added Tunnel.
  9. In the Category section, click Session.
  10. In the Saved Sessions section, click the name you previously created, then click Save to save the added Tunnels.
To use PuTTY with tunnels, perform the following steps:
  1. Run the PuTTY application.
  2. In the Saved Sessions section, select the Save Session that you created.
  3. Click Load.
  4. Click Open to start session and login. The SSH tunnel is created and you can connect to a remote system via tunnel. As an example, for DIGITS, you can start a web browser and connect to http://localhost:5000.

6.7. Master Node

A master node, also sometimes called a head node, is a very useful server within a cluster. Typically, it runs the cluster management software, the resource manager, and any monitoring tools that are used. For smaller clusters, it is also used as a login node for users to create and submit jobs.

For clusters of any size that include the DGX-1, a master node can be very helpful. It allows the DGX-1 to focus solely on computing rather than any interactive logins or post-processing that users may be doing. As the number of nodes in a cluster increases, it is recommended to use a master node.

It is recommended to size the master node for things such as:
  • Interactive user logins
  • Resource management (running a job scheduler)
  • Graphical pre-processing and post-processing
    • Consider a GPU in the master node for visualization
  • Cluster monitoring
  • Cluster management

Since the master node becomes an important part of the operation of the cluster, consider using RAID-1 for the OS drive in the master node as well as redundant power supplies. This can help improve the uptime of the master node.

For smaller clusters, you can also use the master node as an NFS server by adding storage and more memory to the master node and NFS export the storage to the cluster clients. For larger clusters, it is recommended to have dedicated storage, either NFS or a parallel file system.

For InfiniBand networks, the master node can also be used for running the software SM. If you want some HA for the SM, run the primary SM on the master node and use an SM on the IB switch as a secondary SM (hardware SM).

As the cluster grows, it is recommended to consider splitting the login and data processing functions from the master node to one or more dedicated login nodes. This is also true as the number of users grows. You can run the primary SM on the master node and other SM’s on the login nodes. You could even use the hardware SM’s on the switches as backups.

7. NVIDIA NGC Cloud Services Best Practices For AWS

The NVIDIA® GPU Cloud™ (NGC) runs on associated cloud providers such as Amazon Web Services (AWS). This section provides some tips and best practices for using NVIDIA NGC Cloud Services.

The following tips and best practices are from NVIDIA and should not be taken as best practices from AWS. It’s best to consult with AWS before implementing any of these best practices. For specific AWS documentation, see the Amazon Web Services web page.

7.1. Users And Authentication

The first step in using NGC is to follow the instructions provided in the NGC Getting Started Guide. Your AWS credentials are tied to a specific region; therefore, if you are going to change regions, be sure you use the correct key for that region. A good practice is to name the key file with the region in the actual name.

Next, spend some time getting to know AWS IAM (Identity, Authentication, and Management). At a high level, IAM allows you to securely create, manage, and control user (individual) and group access to your AWS account. It is very flexible and provides a rich set of tools and policies for managing your account.

AWS provides some best practices around IAM that you should read immediately after creating your AWS account. There are some very important points in regard to IAM. The first thing you should be aware of is that when you create your account on AWS, you are essentially creating a root account. If someone gains access to your root credentials, they can do anything they want to your account including locking you out and running up a large bill. Therefore, you should immediately lock away your root account access keys.

After you've secured your root credentials, create an individual IAM user. This is very similar to creating a user on a *nix system. It allows you to create a unique set of security credentials which can be applied to a group of users or to individual users.

You should also assign a user to a group. The groups can have pre-assigned permissions to resources - much like giving permissions to *nix users. This allows you to control access to resources. AWS has some pre-defined groups that you can use. For more information about pre-defined groups on AWS, see Creating Your First IAM Admin User and Group. For IAM best practices, see the AWS Identity And Access Management User Guide.

7.1.1. User Credentials In Other Regions

The credentials that you created are only good for the region where you created them. If you created them in us-east-1 (Virginia), then you can’t use them for the region in Japan. If you want to only use the region where you created your credentials, then no action is needed. However, if you want the option to run in different regions, then you have two choices:
  • Option 1: create credentials in every region where you plan to run, or
  • Option 2: copy your credentials from your initial region to all other regions.

Option 1 isn’t difficult but it can be tedious depending upon how many regions you might use. To keep track of the different keys, you should include the region name in the key name.

Option 2 isn’t too difficult thanks to a quick and simple bash script:
if [ ! -f "${myKEYFILE}" ]; then
  echo "I can't find that file: ${myKEYFILE}"
  exit 2
myKEY=`cat ${myKEYFILE}`
for region in $( aws --output text ec2 describe-regions | cut -s -f3 | sort ); do
        echo "importing ${myKEYNAME} into region ${region}"
        aws --region ${region} ec2 import-key-pair --key-name ${myKEYNAME} --public-key-material "${myKEY}"

In this script, the keyname for your first region is bb-key and is assigned to myKEYNAME. The file that contains the key is located in ~/.ssh/ida_rsa.pub. After defining those two variables, you can run the script and it will import that key to all other AWS regions.

7.2. Data Transfer Best Practices

One of the fundamental questions users have around best practices for AWS is uploading and downloading data from AWS. This can be a very complicated question and it’s best to engage with AWS to discuss the various options. For more information about uploading, downloading, and managing objects, see the Amazon Simple Storage Service Console User Guide.

In the meantime, to help you get started, the following sections offer ideas for how to upload data to AWS.

7.2.1. Upload Directly To EC2 Instance

When you first begin to use AWS, you may have some data on their laptop, workstation or company system, and want to upload it to an EC2 instance that is running. This means that you want to directly upload data to the compute instance they started. A quick and easy way to do this is to use scp to copy the data from your local system to the running instance. You’ll need the IP address or name of the instance as well as your AWS key. An example command using scp is the following:
$ cd data
$ ssh -i my-key-pair.pem -r * ubuntu@public-dns-name:/home/ubuntu

In this example, the training data is located in a subdirectory called data on your system. You cd into that directory and then recursively uploaded all the data in that directory to the EC2 instance that has been started with the NVIDIA Volta Deep Learning AMI. You will need to use your AWS keys to upload to the instance. The -r option means recursive so everything in the data directory, including subdirectories, are copied to the AWS instance.

Finally, you need to specify the user on the instance (ubuntu), the machine name (NVAWS_DNS) and the fill path where the data is to be uploaded (/home/ubuntu which is the default home directory for the ubuntu user).

There are a few key points in using scp. The first is that you need to have the SSH port (port 22) open on your AWS instance and your local system. This is done via security groups.
Note: There are other ways to open and block ports in AWS, however, they are not covered in this guide.

The second thing to note is that scp is single-threaded. That is, a single thread on your system is doing the data transfer. This many not be enough to saturate your NIC (Network Interface Card) on your system. In that case, you might want to break up the data into chunks and upload them to the instance. You can upload them serially (one after the other), or you can load them in groups (in essence in parallel).

There are a couple of options you can use for uploading the data that might help. The first one is using tar to create a tar file of a directory and all subdirectories. You can then upload that tar file to the running AWS EC2 instance.

Another option is to compress the tar file using one of many compression utilities (for example, gzip, bzip2, xz, lzma, or 7zip). There are also parallel versions of compression tools such as pigz (parallel gzip), lzip (uses lzlib), pbzip2 (parallel bzip2), pxz (parallel xz), or lrzip (parallel lzma utility).
Note: You can use your favorite compression tool in combination with tar via the following option:
$ tar --use-compress-program=… cf file.tar  
The combination allows you to specify the path to the compression utility you want to use with the --use-compress-program option.
After taring and compressing the data, upload the file to the instance using scp. Then, ssh into the instance and uncompress and untar the file before running your framework.
Note: When compressing or creating a tar file, the process actually encrypts it. Encryption is not covered in this guide, however, scp will encrypt the file during the transfer unless you have specifically told it not to encrypt.

Another utility that might increase the upload speed is to use bbcp. It is a point-to-point network file copy application that can use multiple threads to increase the upload speed.

As explained, there are many options for uploading data directly to an AWS EC2 instance. There are also some things working against you to reduce the upload speed. One big impediment to improving upload speeds is your connection to the Internet and the network between you and the AWS instance.

If you have a 100Mbps connection to the Internet or are connecting from home using a cable or phone modem, then your upload speeds might be limited compared to a 1 Gbps connection (or faster). The best advice is to test data transfer speeds using a variety of file sizes and number of files. You don’t have to do an exhaustive search but running some tests should help you get a feel for data upload speeds.

Another aspect you have to consider is the packet size on your network. The network inside your company or inside your home may be using jumbo frames which set the frame size to 9,000 (MTU of 9,000). This is great for closed networks because the frame size can be controlled so that you get jumbo frames from one system to the next. However, as soon as those data packets hit the Internet, they drop to the normal frame size of 1,500. This means you have to send many more packets to upload the data. This causes more CPU usage on both sides of the data transfer.

Jumbo frames also reduce the percentage of the packet that is devoted to overhead (not data). Jumbo frames are therefore more efficient when sending data from system to system. But as soon as the data hits the Internet, the percentage devoted to overhead increases and you end up having to send more packets to transfer the data.

7.2.2. Upload Data To S3

Another option is to upload the data to an AWS S3 bucket. S3 is an object store that basically has unlimited capacity. It is a very resilient and durable storage system so it’s not necessary to store your data in multiple locations. However, S3 is not POSIX compliant so you can’t use applications that read and write directly to S3 without rewriting the IO portions of your code.

S3 is a solution for storing your input and output data for your applications because it’s so reliable and durable. To use the data, you copy it from S3 to the instances you are using and copy data from the instance to S3 for longer-term storage. This allows you to shut down your instances and only pay for the data stored in S3.

Fundamentally, S3 is an object store (not POSIX compliant), that can scale to extremely large sizes and is very durable and resilient. S3 does not understand the concept of directories or folders, meaning the storage is flat. However, you still use folders and directories to create a hierarchy. These directories just become part of the name of the object in S3. Applications that understand how to read the object names can present you a view of the objects that includes directories or folders.

There are a multiple ways to copy data into S3 before you start up your instances. AWS makes a set of CLI (Command Line Interface) tools available that can do data transfer for you. The basic command is simple. Here is an example:
$ aws s3 cp <local-file> s3://mybucket/<location>

This command copies a local file on your laptop or server to your S3 bucket. In the command, s3://mybucket/<location> is the location in your S3 bucket. This command doesn’t use any directories or folders, instead, it puts everything into the root of your S3 bucket.

A slightly more complex command might look like the following:
$ aws s3 cp /work/TEST s3://data-compression/example2 -recursive -include “*”

This copies an entire directory on your host system (such as your laptop), to a directory on S3 with the name data-compression/example. It copies the entire contents of the local directory because of the -recursive flag and the -include “*” option. The command will create subdirectories on S3 as needed. Remember, subdirectories don’t really exist, therefore they are part of the object name on S3.

S3 has the concept of a multi-part upload. This was designed for uploading large files to S3 so that if a network error is encountered you don’t have to start the upload all over again. It breaks the object into parts and uploads these parts, sometimes in parallel, to S3 and re-assembles them into the object once all of the uploads are done.

Each part is a contiguous portion of the object’s data. If you want to do multi-part upload manually, then you can control how the object parts are uploaded to S3. They can be in any order and can even be done in parallel. After all of the parts are uploaded, you then have to assemble them into the final object. The general rule of thumb is that when the object is greater than 100MB, using multi-part upload is a good idea. For objects larger than 5GB, multi-part upload is mandatory.

While multi-part upload was designed to upload large files, it also helps improve your throughput since you can upload the parts in parallel. One of the nice features of the AWS CLI tools is that all aws s3 cp commands use multi-part automatically. This includes aws s3 mv and aws s3 sync. You don’t have to do anything manually. Consequently, any uploads using this tool can be very fast.

Another option is to use open-source tools for uploading data to S3 in parallel. The concept is not to use multi-part upload but to upload objects in parallel to improve performance .One tool that is worth examining is s3-parallel-put.

You can also use tar and a compression tool to collect many objects and compress them before uploading to S3. This can result in fast performance because the number of files has been reduced and the amount of data to be transferred is reduced. However, S3 isn’t a POSIX compliant file system so you cannot uncompress nor untar the data within S3 itself. You would need to copy the data to a POSIX file system first and then perform the actions. Alternatively, you could use AWS Lambda to perform these operations, but that is outside the scope of this document.

For a video tutorial about S3, see AWS S3 Tutorial For Beginners - Amazon Simple Storage Service. S3 Data Upload Examples

To understand how the various upload options impact performance, let’s look at three examples. All three examples test uploading data from an EC2 instance to an S3 bucket. A d2.8xlarge instance is used because it has a large amount of memory (244GB). The instance has a 10GbE connection along with 36 cores (18 HT cores).

All data is created using /dev/urandom. Each example has a varying number of files and file sizes. Example 1: Testing s3-parallel-put For Uploading

This example is fairly simple. It follows an astronomy pattern for the sake of discussion. It has two file sizes, 500MB and 5GB. For every three 500MB files, there are two 5GB files. All of the files were created in a single directory with a total 50 files consuming 115GB. In total there are 20x 5GB files and 30x 500 MB files.

This test used the s3-parallel-put tool to upload all of the files. The wall time was recorded when the process started and when it ended giving an elapsed time for the upload. The number of simultaneous uploads was varied from 1 to 32 which indicated how many files were being uploaded at the same time. The data was then normalized by the run time for uploading one file at a time.

The results are presented in the chart below along with the theoretical speedup (perfect scaling).
Figure 21. Using the s3-parallel-put tool to upload Using the s3-parallel-put tool to upload

Notice that the scaling is fairly good to about 8 processes. After that, the results from using the tool are slower than the theoretical from 24 to 32 processes. There is basically no improvement in upload time. Example 2: Testing s3-parallel-put, AWS CLI, And Tar For Uploading

This example uploads a large number of smaller files and uploads them from the instance to an S3 bucket. For this test, the following file distribution was used:
  • 500,000 2KB files
  • 9,500,000 100KB files
All of the files were evenly split across 10 directories.
The tests uploaded the files individually but creating a compress tar file and uploading it to S3 was also tested. The specific tests were:
  1. Upload using s3-parallel-put
  2. Upload using AWS CLI tools
  3. Tar all of the files first, then use AWS CLI tools to upload the tar.gz file (no data compression)
The tests were run with the wall clock time recorded at the start and at the end. The results are shown below.
Note: The y-axis has been normalized to an arbitrary value but the larger the value, the longer it takes to upload the data.
Figure 22. Comparing the s3-parallel-put tool, AWS CLI, and tar to upload Comparing the s3-parallel-put tool, AWS CLI, and tar to upload
From the chart you can see that the AWS CLI tool is about 3x faster than s3-parallel-put. However, the fastest upload time is when all of the files were first tarred together and then uploaded. That is about 33% faster than not tarring the files.
Note: Only the actual upload of the tar file is about ¼ of the time to upload all of the files.
Remember that instead of having individual files in S3 (individual objects), you have one large object which is a tar file. Example 3: Testing The AWS CLI For Uploading

This examples goes back to the first example, increases the number of files in the same proportion, and adds a very small file (less than 1.9KB). There are 40x 5GB files, 60x 500MB files for a total of 100 files. Two files were added to the data set to force the uploads to contend with one very small file (1.9KB), and one large file (50GB). This is a grand total of 102 files.

The AWS CLI tools were tested. While using the CLI tool, a few combinations of using the tool along with tar and various compression tools were also used.
  1. Tar files into single .tar file, upload with CLI
  2. Tar files into single .tar file with compress, upload with CLI
  3. Tar files into single .tar file, compress it, upload with CLI
  4. Tar files into single .tar file with parallel compression (pigz), upload with CLI
  5. Tar files into single .tar file, parallel compress, upload with CLI
The time to complete the tar and to complete the data compression are including in the overall time.
Note: The y-axis has been normalized to an arbitrary value but the larger the value, the longer it takes to upload the data.
Figure 23. Using the AWS CLI tool to upload Using the AWS CLI tool to upload
From the testing results, the following observations were made:
  • The CLI tool alone is the fastest
  • Using serial compression tools such as bzip and tar, greatly increases the total upload time (fourth bar from the right).
  • Tarring all of the files together while using pigz (parallel gzip), results in the second fastest upload time (second bar from the right). Just remember that the files are now in one large, compressed file on S3.
  • Using separate tasks for tar and then compression slows down the overall upload time
  • pigz appears to be about 6 times faster than gzip on this EC2 instance

7.2.3. S3 Object Keys

Since S3 does not understand the concept of directories or folders, the full path becomes part of the object name. In essence, S3 acts like a key-value store so that each key points to its associated object (S3 is more sophisticated than a key-value store but at a fundamental level, it acts simply like a key-value store). An object key might be something like the following:
The directory has several directories (folders) before the actual file name which is jtables.js.

Keys are unique with a bucket. They do not contain meta-data beyond the key name. Meta-data is kept in a different object that is also associated with the object. While patterns in keys will not necessarily improve upload performance, they can improve subsequent read/write performance.

An S3 bucket begins with a single index to all objects. Depending upon what your data looks like, this can become a performance bottleneck. The reason is that all queries go through the partition associated with the index regardless of the key name. Ideally, it would be good to spread your objects across multiple partitions (multiple indices) to improve performance. This means that more storage servers are used which has more effective CPU resources, memory, and network performance.

The partitions are based on the object key (plus the bucket key and an version number that might be associated with the object). If the first few characters of the object key are all the same, then S3 will assign them to the same partition, resulting in only one server servicing any data requests.

S3 will try to spread the object keys across multiple partitions as best it can to satisfy a number of constraints but also trying to increase the number of partitions. However, as you are uploading data, S3 will not be able to create partitions on the fly. You are likely to be using one partition. You can contact AWS prior to your data upload and discuss “pre-warming” partitions for a bucket. Over time, as you use the data, more partitions are added as needed. The exact rules that determine how and when partitions are created (or not), is a proprietary implementation of AWS. But it is guaranteed that if the object keys are fairly similar, particularly in the first few characters, they will all be on a single partition. The best way around this is to add some randomness to the first few characters of an object key. S3 Object Key Example

To better understand how you might introduce some randomness into key names let’s look at a simple example. Below is a table of objects in a bucket. It includes the bucket key and the object key for each object. In storing data, a common pattern is to the date as the first part of the name. This carries over to the object keys as shown below.
Figure 24. Objects in a bucket Objects in a bucket

Notice that the object key is the same for each file for the first 5 characters since the “year” was used first. There is not much variation in the year especially if you are working with recent data within the last year or two.

One option to improve randomness for the first few characters is to reverse the date on the object keys.
Figure 25. Objects in a bucket - reverse date Objects in a bucket - reverse date

This results in more randomness in the object keys. There are now 30 options for the first two characters (01 to 31). This gives S3 more leeway in creating partitions for the data.

One problem you might have with introducing more randomness into the object key is you might have to change your applications to read the date backwards and then convert it. But, if you are doing a great deal of data processing, it might be worth the code change to get improved S3 performance.

Another way to introduce randomness in the object key is to use make the first four characters of each object, the first four characters from the md5sum of the object as below.
Figure 26. Objects in a bucket - reverse first four characters Objects in a bucket - reverse first four characters

This introduces a great deal of randomness in the first four characters while still allowing you keep the classic format for the date. But again, you may have to modify your application to drop the first 5 characters from the file name.

7.3. Storage

AWS has many storage options both network storage, object storage, network block storage, and even local storage in the instance which is referred to as ephemeral storage. To plan your use of storage within AWS, it’s best to discuss the options with your AWS contact. The following sections discuss your storage options for artificial intelligence, deep learning, and machine learning.

When you use the NVIDIA Volta Deep Learning AMI by default, you have a single AWS EBS volume (Elastic Block Storage) that is formatted with a file system and mounted as / on the instance. The EBS is a general purpose (gp2) type volume. While not entirely accurate, you can think of an EBS volume as an iSCSI volume that you might mount on a system from a SAN. However, EBS volumes have some features that you might not get from a SAN, such as very high durability and availability, encryption (at rest and in transit), snapshots, and elastically scalable.

Note about your options:
  • Using encryption at rest and in transit will impact throughput. It’s up to you to make the decision between performance and security.
  • You can resize EBS volumes on the fly. However, this doesn’t resize the file system that is using volumes. Therefore, you will need to know how to grow the file system using the EBS volumes:
    • The current size limit on EBS volumes is 16TB. For anything greater than that, you either need to use EFS or use a second volume with a RAID level.
  • Snapshots are a great way to save current data for the future.
  • There are performance limits of an EBS volume. This includes throughput and IOPS. Remember that in general, deep learning IO is usually fairly heavy IOPS driven (read IOPS).
  • Some instances are EBS optimized, meaning they have a much better connection to EBS volumes for improved performance.

Before training your model, you may have to upload your data from a local host to the running instance. Before uploading your data, ensure you have enough EBS capacity to store everything. You might estimate the size on your local host first before starting the instance. Then you can increase the EBS volume size so that it’s larger than the data set. Another option is to upload your data to Amazon’s S3 object storage. Your applications won’t be able to read or write directly to S3, but you can copy the data from S3 to your NVIDIA Volta Deep Learning instance, train the model, upload any results back to S3. This keeps all of the data within AWS and if the instance and your S3 bucket are in the same region which can help data transfer throughput. You can upload your data file by file to S3 or you can create a tar file or compress tar file, and upload everything to S3. Your S3 data can also be encrypted.

7.3.1. Network Storage

In the previous section, the simple option of using a single EBS volume with the NVIDIA Volta Deep Learning AMI was discussed. In this section, other options will be presented such as using EFS or using multiple EBS volumes in a RAID group. Elastic File System (EFS)

The AWS Elastic File System (EFS) can be thought of as “NFS as a service”. It allows you to create an NFS service that has very high durability and available that you can mount on instances in a specific region in multiple AZ’s (in other words, EFS is a regionally based service). The amount of storage space in EFS is elastic so that it can grow into the Petabyte scale region. It uses NFSv4.1 (NFSv3 is not supported) to improve security. As you add data to EFS, it’s performance increases. It also allows you to encrypt the data in the file system for more security.

Perhaps the best feature of EFS is that it is fully managed. You don’t have to create an instance to act as a NFS server and allocate storage to attach to that server. Instead, you create an EFS file system and a mount point for the clients. As you add data, EFS will automatically increase the storage as needed, in other words, you don’t have to add storage and extend the file system.

For a brand new EFS file system, the performance is likely to be fairly low. The AWS documentation indicates that for every TB of data, you get about 50 MB/s of guaranteed throughput. For NGC, EFS is a great AWS product for easily creating a very durable NFS storage system but the performance may be low until the file system contains a large amount of data (multiple TB’s).

One important thing to remember is that your throughput performance will be governed by the network of your instance type. If your instance type has a 10Gbs network, then that will govern your NFS performance. EBS Volumes In RAID-0

As mentioned previously, the NVIDIA Volta Deep Learning AMI comes with a single EBS volume. Currently, EBS volumes are limited to 16TB. To get more than 16TB, you will have to take two or more EBS volumes and combine them with Linux Software RAID (mdadm).

Linux Software RAID (mdadm) allows you to create all kinds of RAID levels. EBS volumes are already durable and available, which means that the RAID levels provide some resiliency in the effect of a block device failure such as RAID-5, are not necessary. Therefore, it’s recommended to use RAID-0.

You can combine a fairly large number of EBS into a RAID group. This allows you to create a 160TB RAID group for an instance. However, this should be done for capacity reasons only. Adding EBS volumes doesn’t improve the IO performance of single threaded applications. Single thread IO is very common in deep learning applications.

8. Scripts


8.1.1. run_digits.sh

# file: run_digits.sh
mkdir -p $HOME/digits_workdir/jobs
cat <<EOF > $HOME/digits_workdir/digits_config_env.sh
# DIGITS Configuration File
nvidia-docker run --rm -ti --name=${USER}_digits -p 5000:5000 \
  -u $(id -u):$(id -g) -e HOME=$HOME -e USER=$USER -v $HOME:$HOME \
  --env-file=${HOME}/digits_workdir/digits_config_env.sh \
  -v /datasets:/digits_data:ro \
  --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \

8.1.2. digits_config_env.sh

# DIGITS Configuration File

8.2. NVCaffe

8.2.1. run_caffe_mnist.sh

# file: run_caffe_mnist.sh
_basedir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
function join { local IFS="$1"; shift; echo "$*"; }
# arguments to passthrough to caffe such as "-gpu all" or "-gpu 0,1"
script_args="$(join : $@)"
mkdir -p $DATA
mkdir -p $CAFFEWORKDIR/mnist
# Backend storage for Caffe data.
# Orchestrate Docker container with user's privileges
nvidia-docker run -d -t --name=$dname \
  -u $(id -u):$(id -g) -e HOME=$HOME -e USER=$USER -v $HOME:$HOME \
  -e DATA=$DATA -v $DATA:$DATA \
  -e BACKEND=$BACKEND -e script_args="$script_args" \
  --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
  -w $CAFFEWORKDIR nvcr.io/nvidia/caffe:17.05
sleep 1 # wait for container to come up
# download and convert data into lmdb format.
docker exec -it $dname bash -c '
  pushd $DATA
  for fname in train-images-idx3-ubyte train-labels-idx1-ubyte \
  	t10k-images-idx3-ubyte t10k-labels-idx1-ubyte ; do
	if [ ! -e ${DATA}/$fname ]; then
    	wget --no-check-certificate http://yann.lecun.com/exdb/mnist/${fname}.gz
    	gunzip ${fname}.gz
  if [ ! -d "$TRAINDIR" ]; then
	convert_mnist_data \
  	$DATA/train-images-idx3-ubyte $DATA/train-labels-idx1-ubyte \
  	$TRAINDIR --backend=${BACKEND}
  if [ ! -d "$TESTDIR" ]; then
	convert_mnist_data \
  	$DATA/t10k-images-idx3-ubyte $DATA/t10k-labels-idx1-ubyte \
  	$TESTDIR --backend=${BACKEND}
# =============================================================================
# =============================================================================
cat <<EOF > $CAFFEWORKDIR/mnist/lenet_train_test.prototxt
name: "LeNet"
layer {
  name: "mnist"
  type: "Data"
  top: "data"
  top: "label"
  include {
	phase: TRAIN
  transform_param {
	scale: 0.00390625
  data_param {
	source: "$DATA/mnist_train_lmdb"
	batch_size: 64
	backend: LMDB
layer {
  name: "mnist"
  type: "Data"
  top: "data"
  top: "label"
  include {
	phase: TEST
  transform_param {
	scale: 0.00390625
  data_param {
	source: "$DATA/mnist_test_lmdb"
	batch_size: 100
	backend: LMDB
layer {
  name: "conv1"
  type: "Convolution"
  bottom: "data"
  top: "conv1"
  param {
	lr_mult: 1
  param {
	lr_mult: 2
  convolution_param {
	num_output: 20
	kernel_size: 5
	stride: 1
	weight_filler {
  	type: "xavier"
	bias_filler {
  	type: "constant"
layer {
  name: "pool1"
  type: "Pooling"
  bottom: "conv1"
  top: "pool1"
  pooling_param {
	pool: MAX
	kernel_size: 2
	stride: 2
layer {
  name: "conv2"
  type: "Convolution"
  bottom: "pool1"
  top: "conv2"
  param {
	lr_mult: 1
  param {
	lr_mult: 2
  convolution_param {
	num_output: 50
	kernel_size: 5
	stride: 1
	weight_filler {
  	type: "xavier"
	bias_filler {
  	type: "constant"
layer {
  name: "pool2"
  type: "Pooling"
  bottom: "conv2"
  top: "pool2"
  pooling_param {
	pool: MAX
	kernel_size: 2
	stride: 2
layer {
  name: "ip1"
  type: "InnerProduct"
  bottom: "pool2"
  top: "ip1"
  param {
	lr_mult: 1
  param {
	lr_mult: 2
  inner_product_param {
	num_output: 500
	weight_filler {
  	type: "xavier"
	bias_filler {
  	type: "constant"
layer {
  name: "relu1"
  type: "ReLU"
  bottom: "ip1"
  top: "ip1"
layer {
  name: "ip2"
  type: "InnerProduct"
  bottom: "ip1"
  top: "ip2"
  param {
	lr_mult: 1
  param {
	lr_mult: 2
  inner_product_param {
	num_output: 10
	weight_filler {
  	type: "xavier"
	bias_filler {
  	type: "constant"
layer {
  name: "accuracy"
  type: "Accuracy"
  bottom: "ip2"
  bottom: "label"
  top: "accuracy"
  include {
	phase: TEST
layer {
  name: "loss"
  type: "SoftmaxWithLoss"
  bottom: "ip2"
  bottom: "label"
  top: "loss"
cat <<EOF > $CAFFEWORKDIR/mnist/lenet_solver.prototxt
# The train/test net protocol buffer definition
net: "mnist/lenet_train_test.prototxt"
# test_iter specifies how many forward passes the test should carry out.
# In the case of MNIST, we have test batch size 100 and 100 test iterations,
# covering the full 10,000 testing images.
test_iter: 100
# Carry out testing every 500 training iterations.
test_interval: 500
# The base learning rate, momentum and the weight decay of the network.
base_lr: 0.01
momentum: 0.9
weight_decay: 0.0005
# The learning rate policy
lr_policy: "inv"
gamma: 0.0001
power: 0.75
# Display every 100 iterations
display: 100
# The maximum number of iterations
max_iter: 10000
# snapshot intermediate results
snapshot: 5000
snapshot_prefix: "mnist/lenet"
# solver mode: CPU or GPU
solver_mode: GPU

# RUN TRAINING WITH CAFFE ---------------------------------------------------
docker exec -it $dname bash -c '
  # workdir is CAFFEWORKDIR when container was started.
  caffe train --solver=mnist/lenet_solver.prototxt ${script_args//:/ }
docker stop $dname && docker rm $dname


8.3.1. run_tf_cifar10.sh

# file: run_tf_cifar10.sh
# run example:
# 	./run_kerastf_cifar10.sh --epochs=3 --datadir=/datasets/cifar
# Get usage help via:
# 	./run_kerastf_cifar10.sh --help 2>/dev/null
_basedir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# specify workdirectory for the container to run scripts or work from.
# cifarcode=${_basedir}/examples/tensorflow/cifar/cifar10_train.py
function join { local IFS="$1"; shift; echo "$*"; }
script_args=$(join : "$@")
nvidia-docker run --name=$dname -d -t \
  --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
  -u $(id -u):$(id -g) -e HOME=$HOME -e USER=$USER -v $HOME:$HOME \
  -v /datasets/cifar:/datasets/cifar:ro -w $workdir \
  -e cifarcode=$cifarcode -e script_args="$script_args" \
sleep 1 # wait for container to come up
docker exec -it $dname bash -c 'python $cifarcode ${script_args//:/ }'
docker stop $dname && docker rm $dname

8.4. Keras

8.4.1. venvfns.sh

# file: venvfns.sh
# functions for virtualenv
[[ "${BASH_SOURCE[0]}" == "${0}" ]] && \
  echo Should be run as : source "${0}" && exit 1
enablevenvglobalsitepackages() {
	if ! [ -z ${VIRTUAL_ENV+x} ]; then
    	_libpypath=$(dirname $(python -c \
  "from distutils.sysconfig import get_python_lib; print(get_python_lib())"))
   	if ! [[ "${_libpypath}" == *"$VIRTUAL_ENV"* ]]; then
      	return # VIRTUAL_ENV path not in the right place
   	if [ -f $no_global_site_packages_file ]; then
       	rm $no_global_site_packages_file;
       	echo "Enabled global site-packages"
       	echo "Global site-packages already enabled"
disablevenvglobalsitepackages() {
	if ! [ -z ${VIRTUAL_ENV+x} ]; then
    	_libpypath=$(dirname $(python -c \
  "from distutils.sysconfig import get_python_lib; print(get_python_lib())"))
   	if ! [[ "${_libpypath}" == *"$VIRTUAL_ENV"* ]]; then
      	return # VIRTUAL_ENV path not in the right place
   	if ! [ -f $no_global_site_packages_file ]; then
       	touch $no_global_site_packages_file
       	echo "Disabled global site-packages"
       	echo "Global site-packages were already disabled"

8.4.2. setup_keras.sh

# file: setup_keras.sh
nvidia-docker run --name=$dname -d -t \
  -u $(id -u):$(id -g) -e HOME=$HOME -e USER=$USER -v $HOME:$HOME \

docker exec -it -u root $dname \
  bash -c 'apt-get update && apt-get install -y virtualenv virtualenvwrapper'
docker exec -it $dname \
  bash -c 'source /usr/share/virtualenvwrapper/virtualenvwrapper.sh
  mkvirtualenv py-keras
  pip install --upgrade pip
  pip install keras --no-deps
  pip install PyYaml
  pip install numpy
  pip install scipy
  pip install ipython'
docker stop $dname && docker rm $dname

8.4.3. run_kerastf_mnist.sh

# file: run_kerastf_mnist.sh
_basedir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# specify workdirectory for the container to run scripts or work from.
nvidia-docker run --name=$dname -d -t \
  -u $(id -u):$(id -g) -e HOME=$HOME -e USER=$USER -v $HOME:$HOME \
  -w $workdir -e mnistcode=$mnistcode \
sleep 1 # wait for container to come up
docker exec -it $dname \
	bash -c 'source ~/.virtualenvs/py-keras/bin/activate
	source ~/venvfns.sh
	python $mnistcode
docker stop $dname && docker rm $dname

8.4.4. run_kerasth_mnist.sh

# file: run_kerasth_mnist.sh
_basedir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# specify workdirectory for the container to run scripts or work from.
nvidia-docker run --name=$dname -d -t \
  -u $(id -u):$(id -g) -e HOME=$HOME -e USER=$USER -v $HOME:$HOME \
  -w $workdir -e mnistcode=$mnistcode \
sleep 1 # wait for container to come up
docker exec -it $dname \
	bash -c 'source ~/.virtualenvs/py-keras/bin/activate
	source ~/venvfns.sh
	KERAS_BACKEND=theano python $mnistcode
docker stop $dname && docker rm $dname

8.4.5. run_kerastf_cifar10.sh

# file: run_kerastf_cifar10.sh
# run example:
# 	./run_kerastf_cifar10.sh --epochs=3 --datadir=/datasets/cifar
# Get usage help via:
# 	./run_kerastf_cifar10.sh --help 2>/dev/null
_basedir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# specify workdirectory for the container to run scripts or work from.
function join { local IFS="$1"; shift; echo "$*"; }
script_args=$(join : "$@")
nvidia-docker run --name=$dname -d -t \
  -u $(id -u):$(id -g) -e HOME=$HOME -e USER=$USER -v $HOME:$HOME \
  -v /datasets/cifar:/datasets/cifar:ro -w $workdir \
  -e cifarcode=$cifarcode -e script_args="$script_args" \
sleep 1 # wait for container to come up
docker exec -it $dname \
	bash -c 'source ~/.virtualenvs/py-keras/bin/activate
	source ~/venvfns.sh
	python $cifarcode ${script_args//:/ }
docker stop $dname && docker rm $dname

8.4.6. run_keras_script

# file: run_keras_script.sh
_basedir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# specify workdirectory for the container to run scripts or work from.
function join { local IFS="$1"; shift; echo "$*"; }
usage() {
cat <<EOF
Usage: $0 [-h|--help] [--container=container] [--script=script]
	Sets up a keras environment. The keras environment is setup in a
	virtualenv and mapped into the docker container with a chosen
	--backend. Then runs the specified --script.
	--container - Specify desired container. Use "=" equal sign.
    	Default: ${container}
	--backend - Specify the backend for Keras: tensorflow or theano.
    	Default: ${backend}
	--script - Specify a script. Specify scripts with full or relative
    	paths (relative to current working directory). Ex.:
	--datamnt - Data directory to mount into the container.
	--<remain_args> - Additional args to pass through to the script.
	-h|--help - Displays this help.
while getopts ":h-" arg; do
	case "${arg}" in
	h ) usage
    	exit 2
	- ) [ $OPTIND -ge 1 ] && optind=$(expr $OPTIND - 1 ) || optind=$OPTIND
    	eval _OPTION="\$$optind"
    	OPTARG=$(echo $_OPTION | cut -d'=' -f2)
    	OPTION=$(echo $_OPTION | cut -d'=' -f1)
    	case $OPTION in
    	--container ) larguments=yes; container="$OPTARG"  ;;
    	--script ) larguments=yes; script="$OPTARG"  ;;
    	--backend ) larguments=yes; backend="$OPTARG"  ;;
    	--datamnt ) larguments=yes; datamnt="$OPTARG"  ;;
    	--help ) usage; exit 2 ;;
    	--* ) remain_args+=($_OPTION) ;;
script_args="$(join : ${remain_args[@]})"
# formulate -v option for docker if datamnt is not empty.
mntdata=$([[ ! -z "${datamnt// }" ]] && echo "-v ${datamnt}:${datamnt}:ro" )
nvidia-docker run --name=$dname -d -t \
  -u $(id -u):$(id -g) -e HOME=$HOME -e USER=$USER -v $HOME:$HOME \
  $mntdata -w $workdir \
  -e backend=$backend -e script=$script -e script_args="$script_args" \
sleep 1 # wait for container to come up
docker exec -it $dname \
	bash -c 'source ~/.virtualenvs/py-keras/bin/activate
	source ~/venvfns.sh
	KERAS_BACKEND=$backend python $script ${script_args//:/ }
docker stop $dname && docker rm $dname

8.4.7. cifar10_cnn_filesystem.py

#!/usr/bin/env python
# file: cifar10_cnn_filesystem.py
Train a simple deep CNN on the CIFAR10 small images dataset.
from __future__ import print_function
import sys
import os
from argparse import (ArgumentParser, SUPPRESS)
from textwrap import dedent
import numpy as np
# from keras.utils.data_utils import get_file
from keras.utils import to_categorical
from keras.datasets import cifar10
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
import keras.layers as KL
from keras import backend as KB
from keras.optimizers import RMSprop
def parser_(desc):
	parser = ArgumentParser(description=dedent(desc))
	parser.add_argument('--epochs', type=int, default=200,
                    	help='Number of epochs to run training for.')
	parser.add_argument('--aug', action='store_true', default=False,
                    	help='Perform data augmentation on cifar10 set.\n')
	# parser.add_argument('--datadir', default='/mnt/datasets')
	parser.add_argument('--datadir', default=SUPPRESS,
                    	help='Data directory with Cifar10 dataset.')
	args = parser.parse_args()
	return args
def make_model(inshape, num_classes):
	model = Sequential()
	model.add(KL.Conv2D(32, (3, 3), padding='same'))
	model.add(KL.Conv2D(32, (3, 3)))
	model.add(KL.MaxPooling2D(pool_size=(2, 2)))
	model.add(KL.Conv2D(64, (3, 3), padding='same'))
	model.add(KL.Conv2D(64, (3, 3)))
	model.add(KL.MaxPooling2D(pool_size=(2, 2)))
	return model
def cifar10_load_data(path):
	"""Loads CIFAR10 dataset.
	# Returns
    	Tuple of Numpy arrays: `(x_train, y_train), (x_test, y_test)`.
	dirname = 'cifar-10-batches-py'
	# origin = 'http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz'
	# path = get_file(dirname, origin=origin, untar=True)
	path_ = os.path.join(path, dirname)
	num_train_samples = 50000
	x_train = np.zeros((num_train_samples, 3, 32, 32), dtype='uint8')
	y_train = np.zeros((num_train_samples,), dtype='uint8')
	for i in range(1, 6):
    	fpath = os.path.join(path_, 'data_batch_' + str(i))
    	data, labels = cifar10.load_batch(fpath)
    	x_train[(i - 1) * 10000: i * 10000, :, :, :] = data
    	y_train[(i - 1) * 10000: i * 10000] = labels
	fpath = os.path.join(path_, 'test_batch')
	x_test, y_test = cifar10.load_batch(fpath)
	y_train = np.reshape(y_train, (7, 1))
	y_test = np.reshape(y_test, (6, 1))
	if KB.image_data_format() == 'channels_last':
 	   x_train = x_train.transpose(0, 2, 3, 1)
    	x_test = x_test.transpose(0, 2, 3, 1)
	return (x_train, y_train), (x_test, y_test)
def main(argv=None):
	main.__doc__ = __doc__
	argv = sys.argv if argv is None else sys.argv.extend(argv)
	desc = main.__doc__
	# CLI parser
	args = parser_(desc)
	batch_size = 32
	num_classes = 10
	epochs = args.epochs
	data_augmentation = args.aug
	datadir = getattr(args, 'datadir', None)
	# The data, shuffled and split between train and test sets:
	(x_train, y_train), (x_test, y_test) = cifar10_load_data(datadir) \
    	if datadir is not None else cifar10.load_data()
	print(x_train.shape[0], 'train samples')
	print(x_test.shape[0], 'test samples')
	# Convert class vectors to binary class matrices.
	y_train = to_categorical(y_train, num_classes)
	y_test = to_categorical(y_test, num_classes)
	x_train = x_train.astype('float32')
	x_test = x_test.astype('float32')
	x_train /= 255
	x_test /= 255
	callbacks = None
	print(x_train.shape, 'train shape')
	model = make_model(x_train.shape, num_classes)
	# initiate RMSprop optimizer
	opt = RMSprop(lr=0.0001, decay=1e-6)
	# Let's train the model using RMSprop
	nsamples = x_train.shape[0]
	steps_per_epoch = nsamples // batch_size
	if not data_augmentation:
    	print('Not using data augmentation.')
    	model.fit(x_train, y_train,
              	validation_data=(x_test, y_test),
    	print('Using real-time data augmentation.')
    	# This will do preprocessing and realtime data augmentation:
    	datagen = ImageDataGenerator(
        	# set input mean to 0 over the dataset
        	samplewise_center=False,  # set each sample mean to 0
        	# divide inputs by std of the dataset
        	# divide each input by its std
        	zca_whitening=False,  # apply ZCA whitening
        	# randomly rotate images in the range (degrees, 0 to 180)
        	# randomly shift images horizontally (fraction of total width)
        	# randomly shift images vertically (fraction of total height)
        	horizontal_flip=True,  # randomly flip images
        	vertical_flip=False)  # randomly flip images
  	  # Compute quantities required for feature-wise normalization
    	# (std, mean, and principal components if ZCA whitening is applied).
    	# Fit the model on the batches generated by datagen.flow().
    	model.fit_generator(datagen.flow(x_train, y_train,
                        	validation_data=(x_test, y_test),
if __name__ == '__main__':





NVIDIA makes no representation or warranty that the product described in this guide will be suitable for any specified use without further testing or modification. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to ensure the product is suitable and fit for the application planned by customer and to do the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this guide. NVIDIA does not accept any liability related to any default, damage, costs or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this guide, or (ii) customer product designs.

Other than the right for customer to use the information in this guide with the product, no other license, either expressed or implied, is hereby granted by NVIDIA under this guide. Reproduction of information in this guide is permissible only if reproduction is approved by NVIDIA in writing, is reproduced without alteration, and is accompanied by all associated conditions, limitations, and notices.


NVIDIA, the NVIDIA logo, and cuBLAS, CUDA, cuDNN, cuFFT, cuSPARSE, DIGITS, DGX, DGX-1, DGX Station, GRID, Jetson, Kepler, NVIDIA GPU Cloud, Maxwell, NCCL, NVLink, Pascal, Tegra, TensorRT, Tesla and Volta are trademarks and/or registered trademarks of NVIDIA Corporation in the Unites States and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.