ResNet-N with TensorFlow and DALI

This demo implements residual networks model and use DALI for the data augmentation pipeline from the original paper.

It implements the ResNet50 v1.5 CNN model and demonstrates efficient single-node training on multi-GPU systems. They can be used for benchmarking, or as a starting point for implementing and training your own network.

Common utilities for defining CNN networks and performing basic training are located in the nvutils directory inside docs/examples/use_cases/tensorflow/resnet-n. The utilities are written in Tensorflow 2.0. Use of nvutils is demonstrated in the model script (i.e. resnet.py). The scripts support both Keras Fit/Compile and Custom Training Loop (CTL) modes with Horovod.

To use DALI pipeline for data loading and preprocessing --dali_mode=GPU or --dali_mode=CPU

Training in Keras Fit/Compile mode

For the full training on 8 GPUs:

mpiexec --allow-run-as-root --bind-to socket -np 8 \
  python resnet.py --num_iter=90 --iter_unit=epoch \
  --data_dir=/data/imagenet/train-val-tfrecord-480/ \
  --precision=fp16 --display_every=100 \
  --export_dir=/tmp --dali_mode="GPU"

For the benchmark training on 8 GPUs:

mpiexec --allow-run-as-root --bind-to socket -np 8 \
  python resnet.py --num_iter=400 --iter_unit=batch \
  --data_dir=/data/imagenet/train-val-tfrecord-480/ \
  --precision=fp16 --display_every=100 --dali_mode="GPU"

Predicting in Keras Fit/Compile mode

For predicting with previously saved mode in /tmp:

python resnet.py --predict --export_dir=/tmp --dali_mode="GPU"

Training in CTL (Custom Training Loop) mode

For the full training on 8 GPUs:

mpiexec --allow-run-as-root --bind-to socket -np 8 \
  python resnet_ctl.py --num_iter=90 --iter_unit=epoch \
  --data_dir=/data/imagenet/train-val-tfrecord-480/ \
  --precision=fp16 --display_every=100 \
  --export_dir=/tmp --dali_mode="GPU"

For the benchmark training on 8 GPUs:

mpiexec --allow-run-as-root --bind-to socket -np 8 \
  python resnet_ctl.py --num_iter=400 --iter_unit=batch \
  --data_dir=/data/imagenet/train-val-tfrecord-480/ \
  --precision=fp16 --display_every=100 --dali_mode="GPU"

Predicting in CTL (Custom Training Loop) mode

For predicting with previously saved mode in /tmp:

python resnet_ctl.py --predict --export_dir=/tmp --dali_mode="GPU"

Other useful options

To use tensorboard (Note, /tmp/some_dir needs to be created by users):

--tensorboard_dir=/tmp/some_dir

To export saved model at the end of training (Note, /tmp/some_dir needs to be created by users):

--export_dir=/tmp/some_dir

To store checkpoints at the end of every epoch (Note, /tmp/some_dir needs to be created by users):

--log_dir=/tmp/some_dir

To enable XLA:

--use_xla

Requirements

TensorFlow

pip install tensorflow-gpu==2.4.1

OpenMPI

wget -q -O - https://www.open-mpi.org/software/ompi/v3.0/downloads/openmpi-3.0.0.tar.gz | tar -xz
cd openmpi-3.0.0
./configure --enable-orterun-prefix-by-default --with-cuda --prefix=/usr/local/mpi --disable-getpwuid
make -j"$(nproc)" install
cd .. && rm -rf openmpi-3.0.0
echo "/usr/local/mpi/lib" >> /etc/ld.so.conf.d/openmpi.conf && ldconfig
export PATH=/usr/local/mpi/bin:$PATH

The following works around a segfault in OpenMPI 3.0 when run within a single node without ssh being installed.

/bin/echo -e '#!/bin/bash'\
'\ncat <<EOF'\
'\n======================================================================'\
'\nTo run a multi-node job, install an ssh client and clear plm_rsh_agent'\
'\nin '/usr/local/mpi/etc/openmpi-mca-params.conf'.'\
'\n======================================================================'\
'\nEOF'\
'\nexit 1' >> /usr/local/mpi/bin/rsh_warn.sh && \
    chmod +x /usr/local/mpi/bin/rsh_warn.sh && \
    echo "plm_rsh_agent = /usr/local/mpi/bin/rsh_warn.sh" >> /usr/local/mpi/etc/openmpi-mca-params.conf

Horovod

export HOROVOD_GPU_ALLREDUCE=NCCL
export HOROVOD_NCCL_INCLUDE=/usr/include
export HOROVOD_NCCL_LIB=/usr/lib/x86_64-linux-gnu
export HOROVOD_NCCL_LINK=SHARED
export HOROVOD_WITHOUT_PYTORCH=1
pip install horovod==0.21.0