ResNet-N with TensorFlow and DALI ================================= This demo implements residual networks model and use DALI for the data augmentation pipeline from `the original paper`_. Common utilities for defining the network and performing basic training are located in the nvutils directory. Use of nvutils is demonstrated in the model scripts. For parallelization, we use the Horovod distribution framework, which works in concert with MPI. To train ResNet-50 (``--layers=50``) using 8 V100 GPUs, for example on DGX-1, use the following command (``--dali_cpu`` indicates to the script to use CPU backend for DALI): :: $ mpiexec --allow-run-as-root --bind-to socket -np 8 python resnet.py \ --layers=50 \ --data_dir=/data/imagenet \ --data_idx_dir=/data/imagenet-idx \ --precision=fp16 \ --log_dir=/output/resnet50 \ --dali_cpu Here we have assumed that imagenet is stored in tfrecord format in the directory '/data/imagenet'. After training completes, evaluation is performed using the validation dataset. Some common training parameters can tweaked from the command line. Others must be configured within the network scripts themselves. Original scripts modified from ``nvidia-examples`` scripts in `NGC TensorFlow Container`_ Requirements ~~~~~~~~~~~~ TensorFlow ^^^^^^^^^^ :: pip install tensorflow-gpu==1.10.0 OpenMPI ^^^^^^^ :: wget -q -O - https://www.open-mpi.org/software/ompi/v3.0/downloads/openmpi-3.0.0.tar.gz | tar -xz cd openmpi-3.0.0 ./configure --enable-orterun-prefix-by-default --with-cuda --prefix=/usr/local/mpi --disable-getpwuid make -j"$(nproc)" install cd .. && rm -rf openmpi-3.0.0 echo "/usr/local/mpi/lib" >> /etc/ld.so.conf.d/openmpi.conf && ldconfig export PATH=/usr/local/mpi/bin:$PATH The following works around a segfault in OpenMPI 3.0 when run within a single node without ssh being installed. :: /bin/echo -e '#!/bin/bash'\ '\ncat <> /usr/local/mpi/bin/rsh_warn.sh && \ chmod +x /usr/local/mpi/bin/rsh_warn.sh && \ echo "plm_rsh_agent = /usr/local/mpi/bin/rsh_warn.sh" >> /usr/local/mpi/etc/openmpi-mca-params.conf Horovod ^^^^^^^ :: export HOROVOD_GPU_ALLREDUCE=NCCL export HOROVOD_NCCL_INCLUDE=/usr/include export HOROVOD_NCCL_LIB=/usr/lib/x86_64-linux-gnu export HOROVOD_NCCL_LINK=SHARED export HOROVOD_WITHOUT_PYTORCH=1 pip install horovod==0.15.1 .. _the original paper: https://arxiv.org/pdf/1512.03385.pdf .. _NGC TensorFlow Container: https://www.nvidia.com/en-us/gpu-cloud/deep-learning-containers/