Running NVCaffe

Before running the container, use the docker pull command to ensure an up-to-date image is installed. Once the pull is complete, you can run the container image. This is because nvidia-docker ensures that drivers that match the host are used and configured for the container. Without nvidia-docker, you are likely to get an error when trying to run the container.

  1. Issue the command for the applicable release of the container that you want. The following command assumes you want to pull the latest container.
    docker pull
  2. Open a command prompt and paste the pull command. The pulling of the container image begins. Ensure the pull completes successfully before proceeding to the next step.
  3. Run the container image. To run the container, choose interactive mode or non-interactive mode.
    1. Interactive mode: Open a command prompt and issue:
      nvidia-docker run -it --rm -v local_dir:container_dir<xx.xx>-py2
    2. Non-interactive mode: Open a command prompt and issue:
      nvidia-docker run --rm -v local_dir:container_dir<xx.xx>-py2 caffe train …

      • -it means interactive
      • --rm means delete the container when finished
      • –v means mount directory
      • local_dir is the directory or file from your host system (absolute path) that you want to access from inside your container. For example, the local_dir in the following path is /home/jsmith/data/mnist.
        -v /home/jsmith/data/mnist:/data/mnist

        If you are inside the container, for example, ls /data/mnist, you will see the same files as if you issued the ls /home/jsmith/data/mnist command from outside the container.

      • container_dir is the target directory when you are inside your container. For example, /data/mnist is the target directory in the example:
        -v /home/jsmith/data/mnist:/data/mnist
      • <xx.xx> is the container version. For example, 19.01.

    You might want to pull in data and model descriptions from locations outside the container for use by NVCaffe or save results to locations outside the container. To accomplish this, the easiest method is to mount one or more host directories as Docker® data volumes.

    You have pulled the latest files and run the container image.
    Note: In order to share data between ranks, NVIDIA® Collective Communications Library ™ (NCCL) may require shared system memory for IPC and pinned (page-locked) system memory resources. The operating system’s limits on these resources may need to be increased accordingly. Refer to your system’s documentation for details.
    In particular, Docker containers default to limited shared and pinned memory resources. When using NCCL inside a container, it is recommended that you increase these resources by issuing:
    --shm-size=1g --ulimit memlock=-1
    in the command line to
    nvidia-docker run
  4. See /workspace/ inside the container for information on customizing your NVCaffe image. For more information about Caffe, including tutorials, documentation, and examples, see the Caffe website. NVCaffe typically utilizes the same input formats and configuration parameters as Caffe, therefore, community-authored materials and pre-trained models for Caffe can usually also be applied to NVCaffe.

Running An NVCaffe Container On A Cluster

NVCaffe supports training on multiple nodes using OpenMPI version 2.0 protocol, however, you cannot specify the number of threads per process because NVCaffe has its own thread manager (currently it runs one worker thread per GPU). For example:
dgx job submit --name jobname --volume <src>:<dst> --tasks 48 
--clusterid <id> --gpu 8 --cpu 64 --mem 480 --image <tag> --nc "mpirun 
-bind-to none -np 48 -pernode --tag-output caffe train --solver 
solver.prototxt --gpu all >> /logs/caffe.log 2>&1"