ResNet-N with TensorFlow and DALI#

This demo implements residual networks model and use DALI for the data augmentation pipeline from the original paper.

It implements the ResNet50 v1.5 CNN model and demonstrates efficient single-node training on multi-GPU systems. They can be used for benchmarking, or as a starting point for implementing and training your own network.

Common utilities for defining CNN networks and performing basic training are located in the nvutils directory inside docs/examples/use_cases/tensorflow/resnet-n. The utilities are written in Tensorflow 2.0. Use of nvutils is demonstrated in the model script (i.e. resnet.py). The scripts support both Keras Fit/Compile and Custom Training Loop (CTL) modes with Horovod.

To use DALI pipeline for data loading and preprocessing --dali_mode=GPU or --dali_mode=CPU

Training in Keras Fit/Compile mode#

For the full training on 8 GPUs:

mpiexec --allow-run-as-root --bind-to socket -np 8 \
  python resnet.py --num_iter=90 --iter_unit=epoch \
  --data_dir=/data/imagenet/train-val-tfrecord/ \
  --precision=fp16 --display_every=100 \
  --export_dir=/tmp --dali_mode="GPU"

For the benchmark training on 8 GPUs:

mpiexec --allow-run-as-root --bind-to socket -np 8 \
  python resnet.py --num_iter=400 --iter_unit=batch \
  --data_dir=/data/imagenet/train-val-tfrecord/ \
  --precision=fp16 --display_every=100 --dali_mode="GPU"

Predicting in Keras Fit/Compile mode#

For predicting with previously saved mode in /tmp:

python resnet.py --predict --export_dir=/tmp --dali_mode="GPU"

Training in CTL (Custom Training Loop) mode#

For the full training on 8 GPUs:

mpiexec --allow-run-as-root --bind-to socket -np 8 \
  python resnet_ctl.py --num_iter=90 --iter_unit=epoch \
  --data_dir=/data/imagenet/train-val-tfrecord/ \
  --precision=fp16 --display_every=100 \
  --export_dir=/tmp --dali_mode="GPU"

For the benchmark training on 8 GPUs:

mpiexec --allow-run-as-root --bind-to socket -np 8 \
  python resnet_ctl.py --num_iter=400 --iter_unit=batch \
  --data_dir=/data/imagenet/train-val-tfrecord/ \
  --precision=fp16 --display_every=100 --dali_mode="GPU"

Predicting in CTL (Custom Training Loop) mode#

For predicting with previously saved mode in /tmp:

python resnet_ctl.py --predict --export_dir=/tmp --dali_mode="GPU"

Other useful options#

To use tensorboard (Note, /tmp/some_dir needs to be created by users):

--tensorboard_dir=/tmp/some_dir

To export saved model at the end of training (Note, /tmp/some_dir needs to be created by users):

--export_dir=/tmp/some_dir

To store checkpoints at the end of every epoch (Note, /tmp/some_dir needs to be created by users):

--log_dir=/tmp/some_dir

To enable XLA:

--use_xla

Requirements#

TensorFlow#

pip install tensorflow-gpu==2.4.1

OpenMPI#

wget -q -O - \
  https://www.open-mpi.org/software/ompi/v3.0/downloads/openmpi-3.0.0.tar.gz \
  | tar -xz
cd openmpi-3.0.0
./configure --enable-orterun-prefix-by-default --with-cuda \
            --prefix=/usr/local/mpi --disable-getpwuid
make -j"$(nproc)" install
cd .. && rm -rf openmpi-3.0.0
echo "/usr/local/mpi/lib" >> /etc/ld.so.conf.d/openmpi.conf && ldconfig
export PATH=/usr/local/mpi/bin:$PATH

The following works around a segfault in OpenMPI 3.0 when run within a single node without ssh being installed.

/bin/echo -e '#!/bin/bash'\
'\ncat <<EOF'\
'\n======================================================================'\
'\nTo run a multi-node job, install an ssh client and clear plm_rsh_agent'\
'\nin '/usr/local/mpi/etc/openmpi-mca-params.conf'.'\
'\n======================================================================'\
'\nEOF'\
'\nexit 1' >> /usr/local/mpi/bin/rsh_warn.sh && \
    chmod +x /usr/local/mpi/bin/rsh_warn.sh && \
    echo "plm_rsh_agent = /usr/local/mpi/bin/rsh_warn.sh" \
    >> /usr/local/mpi/etc/openmpi-mca-params.conf

Horovod#

export HOROVOD_GPU_ALLREDUCE=NCCL
export HOROVOD_NCCL_INCLUDE=/usr/include
export HOROVOD_NCCL_LIB=/usr/lib/x86_64-linux-gnu
export HOROVOD_NCCL_LINK=SHARED
export HOROVOD_WITHOUT_PYTORCH=1
pip install horovod==0.21.0