Abstract

The DIGITS User Guide provides a detailed overview about installing and running DIGITS. This guide also provides examples using DIGITS with Caffe, Torch, and TensorFlow deep learning frameworks.

1. Overview of DIGITS

The NVIDIA® Deep Learning GPU Training System (DIGITS™) puts the power of deep learning into the hands of engineers and data scientists.

DIGITS is not a framework. DIGITS is a wrapper for Caffe, Torch, and TensorFlow; which provides a graphical web interface to those frameworks rather than dealing with them directly on the command-line.

DIGITS can be used to rapidly train highly accurate deep neural network (DNNs) for image classification, segmentation, object detection tasks, and more. DIGITS simplifies common deep learning tasks such as managing data, designing and training neural networks on multi-GPU systems, monitoring performance in real time with advanced visualizations, and selecting the best performing model from the results browser for deployment. DIGITS is completely interactive so that data scientists can focus on designing and training networks rather than programming and debugging.

1.1. Contents of the DIGITS Application

The container image available in DGX is pre-built and installed into the /usr/local/python/ directory.

DIGITS also includes the NVIDIA Caffe, Torch, and TensorFlow deep learning frameworks.

2. Downloading DIGITS

DIGITS is available through multiple channels such as:
  • a GitHub download
  • an Amazon Machine Image
The following instructions are specific to obtaining DIGITS within DGX.

You can pull (download) DIGITS that is already built, tested, tuned, and ready to run.

DIGITS is available for download from the DGX™ Container Registry. NVIDIA has provided a number of containers for download from the DGX™ Container Registry. If your organization has provided you with access to any custom containers, you can download them as well.

Before pulling DIGITS, ensure that the following prerequisites are met:
  • You have read access to the registry space that contains the application.
  • You are logged into DGX™ Container Registry as explained in the Quick Start Guide.
  • You are member of the docker group, which enables you to use docker commands.
Tip: To browse the available containers in the DGX™ Container Registry, use a web browser to log in to your NVIDIA® DGX™ Cloud Services account on the DGX Cloud Services website.

For step-by-step instructions on how to pull a container or application, see the Quick Start Guide. In general, use the docker pull command to pull images from the NVIDIA DGX Container Registry listed at https://compute.nvidia.com (nvcr.io).

After pulling DIGITS, you can run jobs in the container to run neural networks, deploy deep learning models, and perform AI analytics.

3. Running DIGITS

There are two ways you can run DIGITS:
  1. Running DIGITS in DGX-1
  2. Running DIGITS from Developer Zone

3.1. Running DIGITS in DGX-1

Before running the application, use the docker pull command to ensure an up-to-date image is installed. Once the pull is complete, you can run the application.

  1. Copy the command for the applicable release of the container that you want.
    Table 1. docker pull commands for DIGITS
    Name docker pull command
    17.07 docker pull nvcr.io/nvidia/digits:17.07
    17.06 docker pull nvcr.io/nvidia/digits:17.06
    17.05 docker pull nvcr.io/nvidia/digits:17.05
    17.04 docker pull nvcr.io/nvidia/digits:17.04
    17.03 docker pull nvcr.io/nvidia/digits:17.03
    17.02 docker pull nvcr.io/nvidia/digits:17.02
    17.01 docker pull nvcr.io/nvidia/digits:17.01
    16.12 docker pull nvcr.io/nvidia/digits:16.12
  2. Open a command prompt and paste the pull command. The pulling of the container image begins. Ensure the pull completes successfully before proceeding to the next step.
  3. Run the application.
    1. To run the server as a daemon and expose port 5000 in the container to port 8888 on your host:
      nvidia-docker run --name digits -d -p 8888:5000 	
      nvcr.io/nvidia/digits
      
      Note: DIGITS™ 5.0 and earlier uses port 5000 by default.
    2. To mount one local directory containing your data (read-only), and another for writing your DIGITS jobs:
      nvidia-docker run --name digits -d -p 8888:5000 -v 	
      /home/username/data:/data:ro -v /home/username/digits-	
      jobs:/workspace/jobs nvcr.io/nvidia/digits
      Note: In order to share data between ranks, NVIDIA® Collective Communications Library (NCCL™) may require shared system memory for IPC and pinned (page-locked) system memory resources. The operating system’s limits on these resources may need to be increased accordingly. Refer to your system’s documentation for details.
      In particular, Docker containers default to limited shared and pinned memory resources. When using NCCL inside a container, it is recommended that you increase these resources by issuing:
      --shm-size=1g --ulimit memlock=-1
      in the command line to
      nvidia-docker run
  4. See /workspace/README.md inside the container for information on customizing your DIGITS application.
    For more information about DIGITS, see:

Running DIGITS from Developer Zone

For more information about downloading, running, and using DIGITS, see: NVIDIA DIGITS: Interactive Deep Learning GPU Training System.

4. Deep Learning Frameworks for DIGITS

The DIGITS application in the NVIDIA Docker repository, nvcr.io, comes with DIGITS, but also comes with Caffe, Torch, and TensorFlow. You can read the details in the container release notes here http://docs.nvidia.com/deeplearning/dgx/index.html. For example, the 17.08 release of DIGITS includes the 17.08 release of Caffe, 17.08 release of Torch, and the 17.08 release of TensorFlow.

DIGITS is a training platform that can be used with NVIDIA Caffe, Torch, and TensorFlow deep learning frameworks. Using any of these frameworks, DIGITS will train your deep learning models on your dataset.

The following sections include examples using DIGITS with a Caffe, Torch, or TensorFlow backend.

4.1. Caffe for DIGITS

4.1.1. Example 1: MNIST

The MNIST dataset comes with the DIGITS application.

  1. The first step in training a model with DIGITS and Caffe on a DGX-1 is to pull the DIGITS application from the nvcr.io registry (be sure you are logged into the DGX-1).
    $ docker pull nvcr.io/nvidia/digits:17.04
  2. After the application has been pulled, you can start DIGITS on the DGX-1. Because DIGITS is a web-based frontend for Caffe, Torch, and TensorFlow, we will run the DIGITS application in a non-interactive way using the following command.
    $ nvidia-docker run -d --name digits-17.04 -p 8888:5000
    --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864
    nvcr.io/nvidia/digits:17.04
    There are a number of options in this command.
    • The first option -d tells nvidia-docker to run the application in “daemon” mode.
    • The --name option names the running application (we will need this later).
    • The two ulimit options and the shmem option are to increase the amount of memory for Caffe since it shares data across GPUs using shared memory.
    • The -p 8888:5000 option maps the DIGITS port 5000 to port 8888 (you will see how this is used below).
    After you run this command you need to find the IP address of the DIGITS node. This can be found by running the command ifconfig as shown below.
    $ ifconfig
    docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
         inet 192.168.99.1  netmask 255.255.255.0  broadcast 0.0.0.0     
         inet6 fe80::42:5cff:fefb:1c30  prefixlen 64  scopeid 0x20<link>     
         ether 02:42:5c:fb:1c:30  txqueuelen 0  (Ethernet)     
         RX packets 22649  bytes 5171804 (4.9 MiB)     
         RX errors 0  dropped 0  overruns 0  frame 0     
         TX packets 29088  bytes 123439479 (117.7 MiB)     
         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
    
    enp1s0f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500     
         inet 10.31.229.99  netmask 255.255.255.128  broadcast 10.31.229.127     
         inet6 fe80::56ab:3aff:fed6:614f  prefixlen 64  scopeid 0x20<link>     
         ether 54:ab:3a:d6:61:4f  txqueuelen 1000  (Ethernet)     
         RX packets 8116350  bytes 11069954019 (10.3 GiB)     
         RX errors 0  dropped 9  overruns 0  frame 0     
         TX packets 1504305  bytes 162349141 (154.8 MiB)     
         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
    ...

    In this case, we want the Ethernet IP address since that is the address of the web server for DIGITS (10.31.229.56 for this example). Your IP address will be different.

  3. We now need to download the MNIST data set into the application. The DIGITS application has a simple script for downloading the data set into the application. As a check, run the following command to make sure the application is running.
    $ docker ps -a
    CONTAINER ID    IMAGE                       ...  NAMES
    c930962b9636    nvcr.io/nvidia/digits:17.04 ...  digits-17.04

    The application is running and has the name that we gave it (digits-17.04).

    Next you need to “shell” into the running application from another terminal on the DGX-1.
    $ docker exec -it digits-17.04 bash
    root@XXXXXXXXXXXX:/workspace#
    We want to put the data into the directory /data/mnist. There is a simple Python script in the application that will do this for us. It downloads the data in the correct format as well.
    # python -m digits.download_data mnist /data/mnist
    Downloading url=http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz ...
    Downloading url=http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz ...
    Downloading url=http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz ...
    Downloading url=http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz ...
    Uncompressing file=train-images-idx3-ubyte.gz ...
    Uncompressing file=train-labels-idx1-ubyte.gz ...
    Uncompressing file=t10k-images-idx3-ubyte.gz ...
    Uncompressing file=t10k-labels-idx1-ubyte.gz ...
    Reading labels from /data/mnist/train-labels.bin ...
    Reading images from /data/mnist/train-images.bin ...
    Reading labels from /data/mnist/test-labels.bin ...
    Reading images from /data/mnist/test-images.bin ...
    Dataset directory is created successfully at '/data/mnist'
    Done after 13.4188599586 seconds.
    
  4. You can now open a web browser to the IP address from the previous step. Be sure to use port 8888 since we mapped the DIGITS port from 5000 to port 8888. For this example, the URL would be the following.
    10.31.229.56:8888
    On the home page of DIGITS, in the top right corner it says that there are 8 of 8 GPUs available on this DGX-1.
    Figure 1. DIGITS home page DIGITS home page.
  5. Load a dataset. We are going to use the MNIST dataset as an example since it comes with the application.
    1. Click the Datasets tab.
    2. Click the Images drop down menu and select Classification. If DIGITS asks for a user name, you can enter anything you want. The New Image Classification Dataset window displays. After filling in the fields, your screen should look like the following.
      Figure 2. New Image Classification Dataset New Image Classification Dataset.
    3. Provide values for the Image Type and the Image size as shown in the above image.
    4. Give your dataset a name in the Dataset Name field. You can name the dataset anything you like. In this case the name is just “mnist”.
    5. Click Create. This tells DIGITS to tell Caffe to load the datasets. After the datasets are loaded, your screen should look similar to the following.
      Note: This screen capture has been truncated because the web page is very long.
      Figure 3. MNIST top level MNIST top level
      Figure 4. MNIST lower level MNIST lower level
      Note: There are two sections that allow you to “explore” the db (database). The Create DB (train) is for training data and Create DB (val) is for validating data. In either of these displays, you can click Explore the db for the training set.
  6. Train a model. We are going to use Yann Lecun’s LeNet model as an example since it comes with the application.
    1. Define the model. Click DIGITS in the upper left corner to be taken back to the home page.
    2. Click the Models tab.
    3. Click the Images drop down menu and select Classification. The New Image Classification Model window displays.
    4. Provide values for the Select Dataset and the training parameter fields.
    5. In the Standard Networks tab, click Caffe and select the LeNet radio button.
      Note: DIGITS allows you to use previous networks, pre-trained networks, and customer networks if you want.
    6. Click Create. The training of the LeNet model starts.
      Note: This screen capture has been truncated because the web page is very long.
      Figure 5. New Image Classification Model top level New Image Classification Model top level
      Figure 6. New Image Classification Model lower level New Image Classification Model lower level
      During the training, DIGITS displays the history of the training parameters, specifically, the loss function for the training data, the accuracy from the validation data set, and the loss function for the validation data. After the training completes, (all 30 epochs are trained), your screen should look similar to the following.
      Note: This screen capture has been truncated because the web page is very long.
      Figure 7. Image Classification Model top level Image Classification Model top level
      Figure 8. Image Classification Model lower level Image Classification Model lower level
  7. Optional: You can test some images (inference) against the trained model by scrolling to the bottom of the web page. For illustrative purposes, a single image is input from the test data set. You can always upload an image if you like. You can also input a list of “test” images if you want. The screen below does inference against a test image called /data/mnist/test/5/06206.png . Also, select the Statistics and Visualizations checkbox to ensure that you can see all of the details from the network as well as the network prediction.

    Figure 9. Trained Models Trained Models
    Note: You can select a model from any of the epochs if you want. To do so, click the Select Model drop down arrow and select a different epoch.
  8. Click Classify One. This opens another browser tab and displays predictions. The screen below is the output for the test image that is the number “5”.

    Figure 10. Classify One Image Classify One Image

4.1.2. Example 2: Siamese Network

  1. In order to train a siamese dataset, you must first have the MNIST dataset. To create the MNIST dataset, see Example 1: MNIST.
  2. Remember the Job Directory path, since this is needed in this task.
    Figure 11. Job directory Job directory
  1. Run the Python script available at: GitHub: mnist_siamese_train_test.prototxt. The script requires the following parameters:
    Create_db.py  <where to save results> <the job directory>  -c  <how many samples>
    Where:
    • <where to save results> is the directory path where you want to save your output.
    • <the job directory> is the name of the directory that you took note of in the prerequisites.
    • <how many samples> is where you define the number of samples. Set this number to 100000.
  2. Create the siamese dataset.
    1. On the Home page, click New Dataset > Images > Other.
      Figure 12. New dataset New dataset
    2. Provide the directory paths to the following fields:
      Note: The directory path should be the same location that was specified in <where to save results>.
      • The train image database
      • The train label database
      • The validation image database
      • The validation label database
      • The train image train_mean.binaryproto file
      Figure 13. New image dataset New image dataset
  3. Click New Model > Images > Other to create the model. In this example, we will use Caffe to train our siamese network.
  4. Train the model.
    1. Click the Custom Network tab and select Caffe.
    2. Copy and paste the following network definition: https://github.com/ethantang95/DIGITS/blob/master/examples/siamese/mnist_siamese_train_test.prototext
    3. Ensure the Base Learning Rate is set to 0.01, keep the default settings to the other fields, and click Train.
      Figure 14. New image model New image model
      Figure 15. Training on Caffe Caffe train
      After the model is trained, the graph output should look similar to the following:
      Figure 16. Caffe graph output Caffe graph
  5. Test an image by uploading one from the same directory location that you specified in the <where to save results> path.
    1. Select the Show visualization and statistics check box. In order to ensure that the network was trained correctly and everything worked, scroll down and look at the inference results.
      Figure 17. Verify Verify
    2. Scroll down to see the inference highlighting the numbers that were seen inside the given image.
      Figure 18. Inference result Inference result

4.2. Torch for DIGITS

4.2.1. Example 1: MNIST

The MNIST dataset comes with the DIGITS application.

  1. The first step in training a model with DIGITS and Torch on a DGX-1 is to pull the DIGITS application from the nvcr.io registry (be sure you are logged into the DGX-1).
    $ docker pull nvcr.io/nvidia/digits:17.04
  2. After the application has been pulled, you can start DIGITS on the DGX-1. Because DIGITS is a web-based frontend for Caffe, Torch, and TensorFlow, we will run the DIGITS application in a non-interactive way using the following command.
    $ nvidia-docker run -d --name digits-17.04 -p 8888:5000
    --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864
    nvcr.io/nvidia/digits:17.04
    There are a number of options in this command.
    • The first option -d tells nvidia-docker to run the application in “daemon” mode.
    • The --name option names the running application (we will need this later).
    • The two ulimit options and the shmem option are to increase the amount of memory for Torch since it shares data across GPUs using shared memory.
    • The -p 8888:5000 option maps the DIGITS port 5000 to port 8888 (you will see how this is used below).
    After you run this command you need to find the IP address of the DIGITS node. This can be found by running the command ifconfig as shown below.
    $ ifconfig
    docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
         inet 192.168.99.1  netmask 255.255.255.0  broadcast 0.0.0.0     
         inet6 fe80::42:5cff:fefb:1c30  prefixlen 64  scopeid 0x20<link>     
         ether 02:42:5c:fb:1c:30  txqueuelen 0  (Ethernet)     
         RX packets 22649  bytes 5171804 (4.9 MiB)     
         RX errors 0  dropped 0  overruns 0  frame 0     
         TX packets 29088  bytes 123439479 (117.7 MiB)     
         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
    
    enp1s0f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500     
         inet 10.31.229.99  netmask 255.255.255.128  broadcast 10.31.229.127     
         inet6 fe80::56ab:3aff:fed6:614f  prefixlen 64  scopeid 0x20<link>     
         ether 54:ab:3a:d6:61:4f  txqueuelen 1000  (Ethernet)     
         RX packets 8116350  bytes 11069954019 (10.3 GiB)     
         RX errors 0  dropped 9  overruns 0  frame 0     
         TX packets 1504305  bytes 162349141 (154.8 MiB)     
         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
    ...

    In this case, we want the Ethernet IP address since that is the address of the web server for DIGITS (10.31.229.56 for this example). Your IP address will be different.

  3. We now need to download the MNIST data set into the application. The DIGITS application has a simple script for downloading the data set into the application. As a check, run the following command to make sure the application is running.
    $ docker ps -a
    CONTAINER ID    IMAGE                       ...  NAMES
    c930962b9636    nvcr.io/nvidia/digits:17.04 ...  digits-17.04

    The application is running and has the name that we gave it (digits-17.04).

    Next you need to “shell” into the running application from another terminal on the DGX-1.
    $ docker exec -it digits-17.04 bash
    root@XXXXXXXXXXXX:/workspace#
    We want to put the data into the directory /data/mnist. There is a simple Python script in the application that will do this for us. It downloads the data in the correct format as well.
    # python -m digits.download_data mnist /data/mnist
    Downloading url=http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz ...
    Downloading url=http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz ...
    Downloading url=http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz ...
    Downloading url=http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz ...
    Uncompressing file=train-images-idx3-ubyte.gz ...
    Uncompressing file=train-labels-idx1-ubyte.gz ...
    Uncompressing file=t10k-images-idx3-ubyte.gz ...
    Uncompressing file=t10k-labels-idx1-ubyte.gz ...
    Reading labels from /data/mnist/train-labels.bin ...
    Reading images from /data/mnist/train-images.bin ...
    Reading labels from /data/mnist/test-labels.bin ...
    Reading images from /data/mnist/test-images.bin ...
    Dataset directory is created successfully at '/data/mnist'
    Done after 13.4188599586 seconds.
    
  4. You can now open a web browser to the IP address from the previous step. Be sure to use port 8888 since we mapped the DIGITS port from 5000 to port 8888. For this example, the URL would be the following.
    10.31.229.56:8888
    On the home page of DIGITS, in the top right corner it says that there are 8 of 8 GPUs available on this DGX-1.
    Figure 19. DIGITS home page DIGITS home page.
  5. Load a dataset. We are going to use the MNIST dataset as an example since it comes with the application.
    1. Click the Datasets tab.
    2. Click the Images drop down menu and select Classification. If DIGITS asks for a user name, you can enter anything you want. The New Image Classification Dataset window displays. After filling in the fields, your screen should look like the following.
      Figure 20. New Image Classification Dataset New Image Classification Dataset.
    3. Provide values for the Image Type and the Image size as shown in the above image.
    4. Give your dataset a name in the Dataset Name field. You can name the dataset anything you like. In this case the name is just “mnist”.
    5. Click Create. This tells DIGITS to tell Torch to load the datasets. After the datasets are loaded, your screen should look similar to the following.
      Note: This screen capture has been truncated because the web page is very long.
      Figure 21. MNIST top level MNIST top level
      Figure 22. MNIST lower level MNIST lower level
      Note: There are two sections that allow you to “explore” the db (database). The Create DB (train) is for training data and Create DB (val) is for validating data. In either of these displays, you can click Explore the db for the training set.
  6. Train a model. We are going to use Yann Lecun’s LeNet model as an example since it comes with the application.
    1. Define the model. Click DIGITS in the upper left corner to be taken back to the home page.
    2. Click the Models tab.
    3. Click the Images drop down menu and select Classification. The New Image Classification Model window displays.
    4. Provide values for the Select Dataset and the training parameter fields.
    5. In the Standard Networks tab, click Torch and select the LeNet radio button.
      Note: DIGITS allows you to use previous networks, pre-trained networks, and customer networks if you want.
    6. Click Create. The training of the LeNet model starts.
      Note: This screen capture has been truncated because the web page is very long.
      Figure 23. New Image Classification Model top level New Image Classification Model top level
      Figure 24. New Image Classification Model lower level New Image Classification Model lower level
      During the training, DIGITS displays the history of the training parameters, specifically, the loss function for the training data, the accuracy from the validation data set, and the loss function for the validation data. After the training completes, (all 30 epochs are trained), your screen should look similar to the following.
      Note: This screen capture has been truncated because the web page is very long.
      Figure 25. Image Classification Model top level Image Classification Model top level
      Figure 26. Image Classification Model lower level Image Classification Model lower level
  7. Optional: You can test some images (inference) against the trained model by scrolling to the bottom of the web page. For illustrative purposes, a single image is input from the test data set. You can always upload an image if you like. You can also input a list of “test” images if you want. The screen below does inference against a test image called /data/mnist/test/5/06206.png . Also, select the Statistics and Visualizations checkbox to ensure that you can see all of the details from the network as well as the network prediction.

    Figure 27. Trained Models Trained Models
    Note: You can select a model from any of the epochs if you want. To do so, click the Select Model drop down arrow and select a different epoch.
  8. Click Classify One. This opens another browser tab and displays predictions. The screen below is the output for the test image that is the number “5”.

    Figure 28. Classify One Image Classify One Image

4.2.2. Example 2: Siamese Network

  1. In order to train a siamese dataset, you must first have the MNIST dataset. To create the MNIST dataset, see Example 1: MNIST.
  2. Remember the Job Directory path, since this is needed in this task.
    Figure 29. Job directory Job directory
  1. Run the Python script available at: GitHub: mnist_siamese.lua. The script requires the following parameters:
    Create_db.py  <where to save results> <the job directory>  -c  <how many samples>
    Where:
    • <where to save results> is the directory path where you want to save your output.
    • <the job directory> is the name of the directory that you took note of in the prerequisites.
    • <how many samples> is where you define the number of samples. Set this number to 100000.
  2. Create the siamese dataset.
    1. On the Home page, click New Dataset > Images > Other.
      Figure 30. New dataset New dataset
    2. Provide the directory paths to the following fields:
      Note: The directory path should be the same location that was specified in <where to save results>.
      • The train image database
      • The train label database
      • The validation image database
      • The validation label database
      • The train image train_mean.binaryproto file
      Figure 31. New image dataset New image dataset
  3. Click New Model > Images > Other to create the model. In this example, we will use Torch to train our siamese network.
  4. Test the model.
    1. Click the Custom Network tab and select Torch.
    2. Copy and paste the following network definition: https://github.com/ethantang95/DIGITS/blob/master/examples/siamese/mnist_siamese_train_test.prototext
      Note: This model can only use a single GPU to train.
    3. Ensure the Base Learning Rate is set to 0.01, keep the default settings to the other fields, and click Train.
      Figure 32. New image model New image model
      Figure 33. Training on Torch Torch train
      After the model is trained, the graph output should look similar to the following:
      Figure 34. Torch graph output Torch graph
  5. Test an image by uploading one from the same directory location that you specified in the <where to save results> path.
    1. Select the Show visualization and statistics check box. In order to ensure that the network was trained correctly and everything worked, scroll down and look at the inference results.
      Figure 35. Verify Verify
    2. Scroll down to see the inference highlighting the numbers that were seen inside the given image.
      Figure 36. Inference result Inference result

4.3. TensorFlow for DIGITS

TensorFlow for DIGITS works with DIGITS v6.0.

4.3.1. Example 1: MNIST

  1. The first step in training a model with DIGITS and TensorFlow on a DGX-1 is to pull the DIGITS application from the nvcr.io registry (be sure you are logged into the DGX-1).
    $ docker pull nvcr.io/nvidia/digits:17.04
  2. After the application has been pulled, you can start DIGITS on the DGX-1. Because DIGITS is a web-based frontend for Caffe, Torch, and TensorFlow, we will run the DIGITS application in a non-interactive way using the following command.
    $ nvidia-docker run -d --name digits-17.04 -p 8888:5000
    --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864
    nvcr.io/nvidia/digits:17.04
    There are a number of options in this command.
    • The first option -d tells nvidia-docker to run the application in “daemon” mode.
    • The --name option names the running application (we will need this later).
    • The two ulimit options and the shmem option are to increase the amount of memory for Caffe since it shares data across GPUs using shared memory.
    • The -p 8888:5000 option maps the DIGITS port 5000 to port 8888 (you will see how this is used below).
    After you run this command you need to find the IP address of the DIGITS node. This can be found by running the command ifconfig as shown below.
    $ ifconfig
    docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
         inet 192.168.99.1  netmask 255.255.255.0  broadcast 0.0.0.0     
         inet6 fe80::42:5cff:fefb:1c30  prefixlen 64  scopeid 0x20<link>     
         ether 02:42:5c:fb:1c:30  txqueuelen 0  (Ethernet)     
         RX packets 22649  bytes 5171804 (4.9 MiB)     
         RX errors 0  dropped 0  overruns 0  frame 0     
         TX packets 29088  bytes 123439479 (117.7 MiB)     
         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
    
    enp1s0f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500     
         inet 10.31.229.99  netmask 255.255.255.128  broadcast 10.31.229.127     
         inet6 fe80::56ab:3aff:fed6:614f  prefixlen 64  scopeid 0x20<link>     
         ether 54:ab:3a:d6:61:4f  txqueuelen 1000  (Ethernet)     
         RX packets 8116350  bytes 11069954019 (10.3 GiB)     
         RX errors 0  dropped 9  overruns 0  frame 0     
         TX packets 1504305  bytes 162349141 (154.8 MiB)     
         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
    ...

    In this case, we want the Ethernet IP address since that is the address of the web server for DIGITS (10.31.229.56 for this example). Your IP address will be different.

  3. We now need to download the MNIST data set into the application. The DIGITS application has a simple script for downloading the data set into the application. As a check, run the following command to make sure the application is running.
    $ docker ps -a
    CONTAINER ID    IMAGE                       ...  NAMES
    c930962b9636    nvcr.io/nvidia/digits:17.04 ...  digits-17.04

    The application is running and has the name that we gave it (digits-17.04).

    Next you need to “shell” into the running application from another terminal on the DGX-1.
    $ docker exec -it digits-17.04 bash
    root@XXXXXXXXXXXX:/workspace#
    We want to put the data into the directory /data/mnist. There is a simple Python script in the application that will do this for us. It downloads the data in the correct format as well.
    # python -m digits.download_data mnist /data/mnist
    Downloading url=http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz ...
    Downloading url=http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz ...
    Downloading url=http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz ...
    Downloading url=http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz ...
    Uncompressing file=train-images-idx3-ubyte.gz ...
    Uncompressing file=train-labels-idx1-ubyte.gz ...
    Uncompressing file=t10k-images-idx3-ubyte.gz ...
    Uncompressing file=t10k-labels-idx1-ubyte.gz ...
    Reading labels from /data/mnist/train-labels.bin ...
    Reading images from /data/mnist/train-images.bin ...
    Reading labels from /data/mnist/test-labels.bin ...
    Reading images from /data/mnist/test-images.bin ...
    Dataset directory is created successfully at '/data/mnist'
    Done after 13.4188599586 seconds.
    
  4. You can now open a web browser to the IP address from the previous step. Be sure to use port 8888 since we mapped the DIGITS port from 5000 to port 8888. For this example, the URL would be the following.
    10.31.229.56:8888
    On the home page of DIGITS, in the top right corner it says that there are 8 of 8 GPUs available on this DGX-1.
    Figure 37. DIGITS home page DIGITS home page.
  5. Load a dataset. We are going to use the MNIST dataset as an example since it comes with the application.
    1. Click the Datasets tab.
    2. Click the Images drop down menu and select Classification. If DIGITS asks for a user name, you can enter anything you want. The New Image Classification Dataset window displays. After filling in the fields, your screen should look like the following.
      Figure 38. New Image Classification Dataset New Image Classification Dataset.
    3. Provide values for the Image Type and the Image size as shown in the above image.
    4. Give your dataset a name in the Dataset Name field. You can name the dataset anything you like. In this case the name is just “mnist”.
    5. Click Create. This tells DIGITS to tell TensorFlow to load the datasets. After the datasets are loaded, your screen should look similar to the following.
      Note: This screen capture has been truncated because the web page is very long.
      Figure 39. MNIST top level MNIST top level
      Figure 40. MNIST lower level MNIST lower level
      Note: There are two sections that allow you to “explore” the db (database). The Create DB (train) is for training data and Create DB (val) is for validating data. In either of these displays, you can click Explore the db for the training set.
  6. Train a model. We are going to use Yann Lecun’s LeNet model as an example since it comes with the application.
    1. Define the model. Click DIGITS in the upper left corner to be taken back to the home page.
    2. Click the Models tab.
    3. Click the Images drop down menu and select Classification. The New Image Classification Model window displays.
    4. Provide values for the Select Dataset and the training parameter fields.
    5. In the Standard Networks tab, click TensorFlow and select the LeNet radio button.
      Note: DIGITS allows you to use previous networks, pre-trained networks, and customer networks if you want.
    6. Click Create. The training of the LeNet model starts.
      Note: This screen capture has been truncated because the web page is very long.
      Figure 41. New Image Classification Model top level New Image Classification Model top level
      During the training, DIGITS displays the history of the training parameters, specifically, the loss function for the training data, the accuracy from the validation data set, and the loss function for the validation data. After the training completes, (all 30 epochs are trained), your screen should look similar to the following.
      Note: This screen capture has been truncated because the web page is very long.
      Figure 42. Image Classification Model Image Classification Model
  7. Optional: You can test some images (inference) against the trained model by scrolling to the bottom of the web page. For illustrative purposes, a single image is input from the test data set. You can always upload an image if you like. You can also input a list of “test” images if you want. The screen below does inference against a test image called /data/mnist/test/5/06206.png . Also, select the Statistics and Visualizations checkbox to ensure that you can see all of the details from the network as well as the network prediction.

    Figure 43. Trained Models Trained Models
    Note: You can select a model from any of the epochs if you want. To do so, click the Select Model drop down arrow and select a different epoch.
  8. Click Classify One. This opens another browser tab and displays predictions. The screen below is the output for the test image that is the number “5”.

    Figure 44. Classify One Image Classify One Image

4.3.2. Example 2: Siamese Network

  1. In order to train a siamese dataset, you must first have the MNIST dataset. To create the MNIST dataset, see Example 1: MNIST.
  2. Remember the Job Directory path, since this is needed in this task.
    Figure 45. Job directory Job directory
  1. Run the Python script available at: GitHub: siamese-TF.py. The script requires the following parameters:
    Create_db.py  <where to save results> <the job directory>  -c  <how many samples>
    Where:
    • <where to save results> is the directory path where you want to save your output.
    • <the job directory> is the name of the directory that you took note of in the prerequisites.
    • <how many samples> is where you define the number of samples. Set this number to 100000.
  2. Create the siamese dataset.
    1. On the Home page, click New Dataset > Images > Other.
      Figure 46. New dataset New dataset
    2. Provide the directory paths to the following fields:
      Note: The directory path should be the same location that was specified in <where to save results>.
      • The train image database
      • The train label database
      • The validation image database
      • The validation label database
      • The train image train_mean.binaryproto file
      Figure 47. New image dataset New image dataset
  3. Click New Model > Images > Other to create the model. In this example, we will use TensorFlow to train our siamese network.
  4. Train the model.
    1. Click the Custom Network tab and select TensorFlow.
    2. Copy and paste the following network definition: https://github.com/ethantang95/DIGITS/blob/master/examples/siamese/mnist_siamese_train_test.prototext
    3. Ensure the Base Learning Rate is set to 0.01, keep the default settings to the other fields, and click Train.
      Figure 48. New image model New image model
      Figure 49. Custom model Custom model
      Figure 50. Training on TensorFlow TensorFlow train
      After the model is trained, the graph output should look similar to the following:
      Figure 51. TensorFlow graph output TensorFlow graph
  5. Test an image by uploading one from the same directory location that you specified in the <where to save results> path.
    1. Select the Show visualization and statistics check box. In order to ensure that the network was trained correctly and everything worked, there are two things you need to confirm are included within the results.
      Figure 52. Verify Verify
      1. Near the top, there is an activation result which highlights one of the numbers that exists in the image. In this example, you will see that the number 1 is highlighted.
        Figure 53. Example output Example output
      2. Scroll down to see the inference highlighting the numbers that were seen inside the given image.
        Figure 54. Example output Example output

5. Troubleshooting

5.1. Support

For the latest Release Notes, see the DIGITS Release Notes Documentation website.

For more information about DIGITS, see:
Note: There may be slight variations between the NVIDIA-docker images and this image.

Notices

Notice

THE INFORMATION IN THIS GUIDE AND ALL OTHER INFORMATION CONTAINED IN NVIDIA DOCUMENTATION REFERENCED IN THIS GUIDE IS PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE INFORMATION FOR THE PRODUCT, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the product described in this guide shall be limited in accordance with the NVIDIA terms and conditions of sale for the product.

THE NVIDIA PRODUCT DESCRIBED IN THIS GUIDE IS NOT FAULT TOLERANT AND IS NOT DESIGNED, MANUFACTURED OR INTENDED FOR USE IN CONNECTION WITH THE DESIGN, CONSTRUCTION, MAINTENANCE, AND/OR OPERATION OF ANY SYSTEM WHERE THE USE OR A FAILURE OF SUCH SYSTEM COULD RESULT IN A SITUATION THAT THREATENS THE SAFETY OF HUMAN LIFE OR SEVERE PHYSICAL HARM OR PROPERTY DAMAGE (INCLUDING, FOR EXAMPLE, USE IN CONNECTION WITH ANY NUCLEAR, AVIONICS, LIFE SUPPORT OR OTHER LIFE CRITICAL APPLICATION). NVIDIA EXPRESSLY DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY OF FITNESS FOR SUCH HIGH RISK USES. NVIDIA SHALL NOT BE LIABLE TO CUSTOMER OR ANY THIRD PARTY, IN WHOLE OR IN PART, FOR ANY CLAIMS OR DAMAGES ARISING FROM SUCH HIGH RISK USES.

NVIDIA makes no representation or warranty that the product described in this guide will be suitable for any specified use without further testing or modification. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to ensure the product is suitable and fit for the application planned by customer and to do the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this guide. NVIDIA does not accept any liability related to any default, damage, costs or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this guide, or (ii) customer product designs.

Other than the right for customer to use the information in this guide with the product, no other license, either expressed or implied, is hereby granted by NVIDIA under this guide. Reproduction of information in this guide is permissible only if reproduction is approved by NVIDIA in writing, is reproduced without alteration, and is accompanied by all associated conditions, limitations, and notices.

Trademarks

NVIDIA, the NVIDIA logo, and cuBLAS, CUDA, cuDNN, cuFFT, cuSPARSE, DIGITS, DGX, DGX-1, Jetson, Kepler, NVIDIA Maxwell, NCCL, NVLink, Pascal, Tegra, TensorRT, and Tesla are trademarks and/or registered trademarks of NVIDIA Corporation in the Unites States and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.