Abstract

The DIGITS User Guide provides a detailed overview about installing and running DIGITS. This guide also provides examples using DIGITS with Caffe and Torch deep learning frameworks.

1. Overview of DIGITS

The NVIDIA® Deep Learning GPU Training System (DIGITS™) puts the power of deep learning into the hands of engineers and data scientists.

DIGITS is not a framework. DIGITS is a wrapper for Caffe and Torch; which provides a graphical web interface to those frameworks rather than dealing with them directly on the command-line.

DIGITS can be used to rapidly train highly accurate deep neural network (DNNs) for image classification, segmentation and object detection tasks. DIGITS simplifies common deep learning tasks such as managing data, designing and training neural networks on multi-GPU systems, monitoring performance in real time with advanced visualizations, and selecting the best performing model from the results browser for deployment. DIGITS is completely interactive so that data scientists can focus on designing and training networks rather than programming and debugging.

1.1. Contents of the DIGITS Application

The container image available in DGX is pre-built and installed into the /usr/local/python/ directory.

DIGITS also includes the NVIDIA Caffe and Torch deep learning frameworks.

2. Downloading DIGITS

DIGITS is available through multiple channels such as:
  • a GitHub download
  • an Amazon Machine Image
  • a debian file package
The following instructions are specific to obtaining DIGITS within DGX.

You can pull (download) DIGITS that is already built, tested, tuned, and ready to run.

DIGITS is available for download from the DGX™ Container Registry. NVIDIA has provided a number of containers for download from the DGX™ Container Registry. If your organization has provided you with access to any custom containers, you can download them as well.

Before pulling DIGITS, ensure that the following prerequisites are met:
  • You have read access to the registry space that contains the application.
  • You are logged into DGX™ Container Registry as explained in the Quick Start Guide.
  • You are member of the docker group, which enables you to use docker commands.
Tip: To browse the available containers in the DGX™ Container Registry, use a web browser to log in to your NVIDIA® DGX™ Cloud Services account on the DGX Cloud Services website.

For step-by-step instructions on how to pull a container or application, see the Quick Start Guide. In general, use the docker pull command to pull images from the NVIDIA DGX Container Registry listed at https://compute.nvidia.com (nvcr.io).

After pulling DIGITS, you can run jobs in the container to run neural networks, deploy deep learning models, and perform AI analytics.

3. Running DIGITS

Before running the application, use the docker pull command to ensure an up-to-date image is installed. Once the pull is complete, you can run the application.

  1. Copy the command for the applicable release of the container that you want. For example:
    docker pull nvcr.io/nvidia/digits:17.04
  2. Open a command prompt and paste the pull command. The pulling of the container image begins. Ensure the pull completes successfully before proceeding to the next step.
  3. Run the application.
    1. To run the server as a daemon and expose port 5000 in the container to port 8888 on your host:
      nvidia-docker run --name digits -d -p 8888:5000 	
      nvcr.io/nvidia/digits
      
      Note: DIGITS™ 5.0 uses port 5000 by default.
    2. To mount one local directory containing your data (read-only), and another for writing your DIGITS jobs:
      nvidia-docker run --name digits -d -p 8888:5000 -v 	
      /home/username/data:/data:ro -v /home/username/digits-	
      jobs:/workspace/jobs nvcr.io/nvidia/digits
      Note: In order to share data between ranks, NVIDIA® Collective Communications Library (NCCL™) may require shared system memory for IPC and pinned (page-locked) system memory resources. The operating system’s limits on these resources may need to be increased accordingly. Refer to your system’s documentation for details.
      In particular, Docker containers default to limited shared and pinned memory resources. When using NCCL inside a container, it is recommended that you increase these resources by issuing:
      --shm-size=1g --ulimit memlock=-1
      in the command line to
      nvidia-docker run
  4. See /workspace/README.md inside the container for information on customizing your DIGITS application.

4. Deep Learning Frameworks for DIGITS

The DIGITS application in the NVIDIA Docker repository, nvcr.io, comes with DIGITS, but also comes with Caffe and Torch. You can read the details in the container release notes here http://docs.nvidia.com/deeplearning/dgx/index.html. For example, the 17.04 release of DIGITS includes the 17.04 release of Caffe and the 17.04 release of Torch.

DIGITS is a training platform that can be used with NVIDIA Caffe and Torch deep learning frameworks. Using either of these frameworks, DIGITS will train your deep learning models on your dataset.

The following sections include examples using DIGITS with a Caffe and Torch backend.

4.1. Caffe for DIGITS

4.1.1. Example 1: MNIST

  1. The first step in training a model with DIGITS and Caffe on a DGX-1 is to pull the DIGITS application from the nvcr.io registry (be sure you are logged into the DGX-1).
    $ docker pull nvcr.io/nvidia/digits:17.04
  2. After the application has been pulled, you can start DIGITS on the DGX-1. Because DIGITS is a web-based frontend for Caffe and Torch, we will run the DIGITS application in a non-interactive way using the following command.
    $ nvidia-docker run -d --name digits-17.04 -p 8888:5000
    --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864
    nvcr.io/nvidia/digits:17.04
    There are a number of options in this command.
    • The first option -d tells nvidia-docker to run the application in “daemon” mode.
    • The --name option names the running application (we will need this later).
    • The two ulimit options and the shmem option are to increase the amount of memory for Caffe since it shares data across GPUs using shared memory.
    • The -p 8888:5000 option maps the DIGITS port 5000 to port 8888 (you will see how this is used below).
    After you run this command you need to find the IP address of the DIGITS node. This can be found by running the command ifconfig as shown below.
    $ ifconfig
    docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
         inet 192.168.99.1  netmask 255.255.255.0  broadcast 0.0.0.0     
         inet6 fe80::42:5cff:fefb:1c30  prefixlen 64  scopeid 0x20<link>     
         ether 02:42:5c:fb:1c:30  txqueuelen 0  (Ethernet)     
         RX packets 22649  bytes 5171804 (4.9 MiB)     
         RX errors 0  dropped 0  overruns 0  frame 0     
         TX packets 29088  bytes 123439479 (117.7 MiB)     
         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
    
    enp1s0f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500     
         inet 10.31.229.99  netmask 255.255.255.128  broadcast 10.31.229.127     
         inet6 fe80::56ab:3aff:fed6:614f  prefixlen 64  scopeid 0x20<link>     
         ether 54:ab:3a:d6:61:4f  txqueuelen 1000  (Ethernet)     
         RX packets 8116350  bytes 11069954019 (10.3 GiB)     
         RX errors 0  dropped 9  overruns 0  frame 0     
         TX packets 1504305  bytes 162349141 (154.8 MiB)     
         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
    ...

    In this case, we want the Ethernet IP address since that is the address of the web server for DIGITS (10.31.229.56 for this example). Your IP address will be different.

  3. We now need to download the MNIST data set into the application. The DIGITS application has a simple script for downloading the data set into the application. As a check, run the following command to make sure the application is running.
    $ docker ps -a
    CONTAINER ID    IMAGE                       ...  NAMES
    c930962b9636    nvcr.io/nvidia/digits:17.04 ...  digits-17.04

    The application is running and has the name that we gave it (digits-17.04).

    Next you need to “shell” into the running application from another terminal on the DGX-1.
    $ docker exec -it digits-17.04 bash
    root@XXXXXXXXXXXX:/workspace#
    We want to put the data into the directory /data/mnist. There is a simple Python script in the application that will do this for us. It downloads the data in the correct format as well.
    # python -m digits.download_data mnist /data/mnist
    Downloading url=http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz ...
    Downloading url=http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz ...
    Downloading url=http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz ...
    Downloading url=http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz ...
    Uncompressing file=train-images-idx3-ubyte.gz ...
    Uncompressing file=train-labels-idx1-ubyte.gz ...
    Uncompressing file=t10k-images-idx3-ubyte.gz ...
    Uncompressing file=t10k-labels-idx1-ubyte.gz ...
    Reading labels from /data/mnist/train-labels.bin ...
    Reading images from /data/mnist/train-images.bin ...
    Reading labels from /data/mnist/test-labels.bin ...
    Reading images from /data/mnist/test-images.bin ...
    Dataset directory is created successfully at '/data/mnist'
    Done after 13.4188599586 seconds.
    
  4. You can now open a web browser to the IP address from the previous step. Be sure to use port 8888 since we mapped the DIGITS port from 5000 to port 8888. For this example, the URL would be the following.
    10.31.229.56:8888
    On the home page of DIGITS, in the top right corner it says that there are 8 of 8 GPUs available on this DGX-1.
    Figure 1. DIGITS home page DIGITS home page.
  5. Load a dataset. We are going to use the MNIST dataset as an example since it comes with the application.
    1. Click the Datasets tab.
    2. Click the Images drop down menu and select Classification. If DIGITS asks for a user name, you can enter anything you want. The New Image Classification Dataset window displays. After filling in the fields, your screen should look like the following.
      Figure 2. New Image Classification Dataset New Image Classification Dataset.
    3. Provide values for the Image Type and the Image size as shown in the above image.
    4. Give your dataset a name in the Dataset Name field. You can name the dataset anything you like. In this case the name is just “mnist”.
    5. Click Create. This tells DIGITS to tell Caffe to load the datasets. After the datasets are loaded, your screen should look similar to the following.
      Note: This screen capture has been truncated because the web page is very long.
      Figure 3. MNIST top level MNIST top level
      Figure 4. MNIST lower level MNIST lower level
      Note: There are two sections that allow you to “explore” the db (database). The Create DB (train) is for training data and Create DB (val) is for validating data. In either of these displays, you can click Explore the db for the training set.
  6. Train a model. We are going to use Yann Lecun’s LeNet model as an example since it comes with the application.
    1. Define the model. Click DIGITS in the upper left corner to be taken back to the home page.
    2. Click the Models tab.
    3. Click the Images drop down menu and select Classification. The New Image Classification Model window displays.
    4. Provide values for the Select Dataset and the training parameter fields.
    5. In the Standard Networks tab, click Caffe and select the LeNet radio button.
      Note: DIGITS allows you to use previous networks, pre-trained networks, and customer networks if you want.
    6. Click Create. The training of the LeNet model starts.
      Note: This screen capture has been truncated because the web page is very long.
      Figure 5. New Image Classification Model top level New Image Classification Model top level
      Figure 6. New Image Classification Model lower level New Image Classification Model lower level
      During the training, DIGITS displays the history of the training parameters, specifically, the loss function for the training data, the accuracy from the validation data set, and the loss function for the validation data. After the training completes, (all 30 epochs are trained), your screen should look similar to the following.
      Note: This screen capture has been truncated because the web page is very long.
      Figure 7. Image Classification Model top level Image Classification Model top level
      Figure 8. Image Classification Model lower level Image Classification Model lower level
  7. Optional: You can test some images (inference) against the trained model by scrolling to the bottom of the web page. For illustrative purposes, a single image is input from the test data set. You can always upload an image if you like. You can also input a list of “test” images if you want. The screen below does inference against a test image called /data/mnist/test/5/06206.png . Also, select the Statistics and Visualizations checkbox to ensure that you can see all of the details from the network as well as the network prediction.

    Figure 9. Trained Models Trained Models
    Note: You can select a model from any of the epochs if you want. To do so, click the Select Model drop down arrow and select a different epoch.
  8. Click Classify One. This opens another browser tab and displays predictions. The screen below is the output for the test image that is the number “5”.

    Figure 10. Classify One Image Classify One Image

4.2. Torch for DIGITS

4.2.1. Example 1: MNIST

  1. The first step in training a model with DIGITS and Torch on a DGX-1 is to pull the DIGITS application from the nvcr.io registry (be sure you are logged into the DGX-1).
    $ docker pull nvcr.io/nvidia/digits:17.04
  2. After the application has been pulled, you can start DIGITS on the DGX-1. Because DIGITS is a web-based frontend for Caffe and Torch, we will run the DIGITS application in a non-interactive way using the following command.
    $ nvidia-docker run -d --name digits-17.04 -p 8888:5000
    --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864
    nvcr.io/nvidia/digits:17.04
    There are a number of options in this command.
    • The first option -d tells nvidia-docker to run the application in “daemon” mode.
    • The --name option names the running application (we will need this later).
    • The two ulimit options and the shmem option are to increase the amount of memory for Torch since it shares data across GPUs using shared memory.
    • The -p 8888:5000 option maps the DIGITS port 5000 to port 8888 (you will see how this is used below).
    After you run this command you need to find the IP address of the DIGITS node. This can be found by running the command ifconfig as shown below.
    $ ifconfig
    docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
         inet 192.168.99.1  netmask 255.255.255.0  broadcast 0.0.0.0     
         inet6 fe80::42:5cff:fefb:1c30  prefixlen 64  scopeid 0x20<link>     
         ether 02:42:5c:fb:1c:30  txqueuelen 0  (Ethernet)     
         RX packets 22649  bytes 5171804 (4.9 MiB)     
         RX errors 0  dropped 0  overruns 0  frame 0     
         TX packets 29088  bytes 123439479 (117.7 MiB)     
         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
    
    enp1s0f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500     
         inet 10.31.229.99  netmask 255.255.255.128  broadcast 10.31.229.127     
         inet6 fe80::56ab:3aff:fed6:614f  prefixlen 64  scopeid 0x20<link>     
         ether 54:ab:3a:d6:61:4f  txqueuelen 1000  (Ethernet)     
         RX packets 8116350  bytes 11069954019 (10.3 GiB)     
         RX errors 0  dropped 9  overruns 0  frame 0     
         TX packets 1504305  bytes 162349141 (154.8 MiB)     
         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
    ...

    In this case, we want the Ethernet IP address since that is the address of the web server for DIGITS (10.31.229.56 for this example). Your IP address will be different.

  3. We now need to download the MNIST data set into the application. The DIGITS application has a simple script for downloading the data set into the application. As a check, run the following command to make sure the application is running.
    $ docker ps -a
    CONTAINER ID    IMAGE                       ...  NAMES
    c930962b9636    nvcr.io/nvidia/digits:17.04 ...  digits-17.04

    The application is running and has the name that we gave it (digits-17.04).

    Next you need to “shell” into the running application from another terminal on the DGX-1.
    $ docker exec -it digits-17.04 bash
    root@XXXXXXXXXXXX:/workspace#
    We want to put the data into the directory /data/mnist. There is a simple Python script in the application that will do this for us. It downloads the data in the correct format as well.
    # python -m digits.download_data mnist /data/mnist
    Downloading url=http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz ...
    Downloading url=http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz ...
    Downloading url=http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz ...
    Downloading url=http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz ...
    Uncompressing file=train-images-idx3-ubyte.gz ...
    Uncompressing file=train-labels-idx1-ubyte.gz ...
    Uncompressing file=t10k-images-idx3-ubyte.gz ...
    Uncompressing file=t10k-labels-idx1-ubyte.gz ...
    Reading labels from /data/mnist/train-labels.bin ...
    Reading images from /data/mnist/train-images.bin ...
    Reading labels from /data/mnist/test-labels.bin ...
    Reading images from /data/mnist/test-images.bin ...
    Dataset directory is created successfully at '/data/mnist'
    Done after 13.4188599586 seconds.
    
  4. You can now open a web browser to the IP address from the previous step. Be sure to use port 8888 since we mapped the DIGITS port from 5000 to port 8888. For this example, the URL would be the following.
    10.31.229.56:8888
    On the home page of DIGITS, in the top right corner it says that there are 8 of 8 GPUs available on this DGX-1.
    Figure 11. DIGITS home page DIGITS home page.
  5. Load a dataset. We are going to use the MNIST dataset as an example since it comes with the application.
    1. Click the Datasets tab.
    2. Click the Images drop down menu and select Classification. If DIGITS asks for a user name, you can enter anything you want. The New Image Classification Dataset window displays. After filling in the fields, your screen should look like the following.
      Figure 12. New Image Classification Dataset New Image Classification Dataset.
    3. Provide values for the Image Type and the Image size as shown in the above image.
    4. Give your dataset a name in the Dataset Name field. You can name the dataset anything you like. In this case the name is just “mnist”.
    5. Click Create. This tells DIGITS to tell Torch to load the datasets. After the datasets are loaded, your screen should look similar to the following.
      Note: This screen capture has been truncated because the web page is very long.
      Figure 13. MNIST top level MNIST top level
      Figure 14. MNIST lower level MNIST lower level
      Note: There are two sections that allow you to “explore” the db (database). The Create DB (train) is for training data and Create DB (val) is for validating data. In either of these displays, you can click Explore the db for the training set.
  6. Train a model. We are going to use Yann Lecun’s LeNet model as an example since it comes with the application.
    1. Define the model. Click DIGITS in the upper left corner to be taken back to the home page.
    2. Click the Models tab.
    3. Click the Images drop down menu and select Classification. The New Image Classification Model window displays.
    4. Provide values for the Select Dataset and the training parameter fields.
    5. In the Standard Networks tab, click Torch and select the LeNet radio button.
      Note: DIGITS allows you to use previous networks, pre-trained networks, and customer networks if you want.
    6. Click Create. The training of the LeNet model starts.
      Note: This screen capture has been truncated because the web page is very long.
      Figure 15. New Image Classification Model top level New Image Classification Model top level
      Figure 16. New Image Classification Model lower level New Image Classification Model lower level
      During the training, DIGITS displays the history of the training parameters, specifically, the loss function for the training data, the accuracy from the validation data set, and the loss function for the validation data. After the training completes, (all 30 epochs are trained), your screen should look similar to the following.
      Note: This screen capture has been truncated because the web page is very long.
      Figure 17. Image Classification Model top level Image Classification Model top level
      Figure 18. Image Classification Model lower level Image Classification Model lower level
  7. Optional: You can test some images (inference) against the trained model by scrolling to the bottom of the web page. For illustrative purposes, a single image is input from the test data set. You can always upload an image if you like. You can also input a list of “test” images if you want. The screen below does inference against a test image called /data/mnist/test/5/06206.png . Also, select the Statistics and Visualizations checkbox to ensure that you can see all of the details from the network as well as the network prediction.

    Figure 19. Trained Models Trained Models
    Note: You can select a model from any of the epochs if you want. To do so, click the Select Model drop down arrow and select a different epoch.
  8. Click Classify One. This opens another browser tab and displays predictions. The screen below is the output for the test image that is the number “5”.

    Figure 20. Classify One Image Classify One Image

5. Troubleshooting

5.1. Support

For the latest Release Notes, see the DIGITS Release Notes Documentation website.

For more information about DIGITS, see:
Note: There may be slight variations between the NVIDIA-docker images and this image.

Notices

Notice

THE INFORMATION IN THIS GUIDE AND ALL OTHER INFORMATION CONTAINED IN NVIDIA DOCUMENTATION REFERENCED IN THIS GUIDE IS PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE INFORMATION FOR THE PRODUCT, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the product described in this guide shall be limited in accordance with the NVIDIA terms and conditions of sale for the product.

THE NVIDIA PRODUCT DESCRIBED IN THIS GUIDE IS NOT FAULT TOLERANT AND IS NOT DESIGNED, MANUFACTURED OR INTENDED FOR USE IN CONNECTION WITH THE DESIGN, CONSTRUCTION, MAINTENANCE, AND/OR OPERATION OF ANY SYSTEM WHERE THE USE OR A FAILURE OF SUCH SYSTEM COULD RESULT IN A SITUATION THAT THREATENS THE SAFETY OF HUMAN LIFE OR SEVERE PHYSICAL HARM OR PROPERTY DAMAGE (INCLUDING, FOR EXAMPLE, USE IN CONNECTION WITH ANY NUCLEAR, AVIONICS, LIFE SUPPORT OR OTHER LIFE CRITICAL APPLICATION). NVIDIA EXPRESSLY DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY OF FITNESS FOR SUCH HIGH RISK USES. NVIDIA SHALL NOT BE LIABLE TO CUSTOMER OR ANY THIRD PARTY, IN WHOLE OR IN PART, FOR ANY CLAIMS OR DAMAGES ARISING FROM SUCH HIGH RISK USES.

NVIDIA makes no representation or warranty that the product described in this guide will be suitable for any specified use without further testing or modification. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to ensure the product is suitable and fit for the application planned by customer and to do the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this guide. NVIDIA does not accept any liability related to any default, damage, costs or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this guide, or (ii) customer product designs.

Other than the right for customer to use the information in this guide with the product, no other license, either expressed or implied, is hereby granted by NVIDIA under this guide. Reproduction of information in this guide is permissible only if reproduction is approved by NVIDIA in writing, is reproduced without alteration, and is accompanied by all associated conditions, limitations, and notices.

Trademarks

NVIDIA, the NVIDIA logo, and cuBLAS, CUDA, cuDNN, cuFFT, cuSPARSE, DIGITS, DGX, DGX-1, Jetson, Kepler, NVIDIA Maxwell, NCCL, NVLink, Pascal, Tegra, TensorRT, and Tesla are trademarks and/or registered trademarks of NVIDIA Corporation in the Unites States and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.