ResNet-N with TensorFlow and DALI¶

This demo implements residual networks model and use DALI for the data augmentation pipeline from the original paper.

It implements the ResNet50 v1.5 CNN model and demonstrates efficient single-node training on multi-GPU systems. They can be used for benchmarking, or as a starting point for implementing and training your own network.

Common utilities for defining CNN networks and performing basic training are located in the nvutils directory. The utilities are written in Tensorflow 2.0. Use of nvutils is demonstrated in the model script (i.e. resnet.py). The scripts support both Keras Fit/Compile and Custom Training Loop (CTL) modes with Horovod.

To use DALI pipeline for data loading and preprocessing ` --dali_mode=GPU #or --dali_mode=CPU `

## Training in Keras Fit/Compile mode For the full training on 8 GPUs: ``` mpiexec –allow-run-as-root –bind-to socket -np 8

python resnet.py –num_iter=90 –iter_unit=epoch –data_dir=/data/imagenet/train-val-tfrecord-480/ –precision=fp16 –display_every=100 –export_dir=/tmp –dali_mode=”GPU”

```

For the benchmark training on 8 GPUs: ``` mpiexec –allow-run-as-root –bind-to socket -np 8

python resnet.py –num_iter=400 –iter_unit=batch –data_dir=/data/imagenet/train-val-tfrecord-480/ –precision=fp16 –display_every=100 –dali_mode=”GPU”

```

## Predicting in Keras Fit/Compile mode For predicting with previously saved mode in /tmp: ` python resnet.py --predict --export_dir=/tmp --dali_mode="GPU" `

## Training in CTL (Custom Training Loop) mode For the full training on 8 GPUs: ``` mpiexec –allow-run-as-root –bind-to socket -np 8

python resnet_ctl.py –num_iter=90 –iter_unit=epoch –data_dir=/data/imagenet/train-val-tfrecord-480/ –precision=fp16 –display_every=100 –export_dir=/tmp –dali_mode=”GPU”

```

For the benchmark training on 8 GPUs: ``` mpiexec –allow-run-as-root –bind-to socket -np 8

python resnet_ctl.py –num_iter=400 –iter_unit=batch –data_dir=/data/imagenet/train-val-tfrecord-480/ –precision=fp16 –display_every=100 –dali_mode=”GPU”

```

## Predicting in CTL (Custom Training Loop) mode For predicting with previously saved mode in /tmp: ` python resnet_ctl.py --predict --export_dir=/tmp --dali_mode="GPU" `

## Other useful options To use tensorboard (Note, /tmp/some_dir needs to be created by users): ` --tensorboard_dir=/tmp/some_dir `

To export saved model at the end of training (Note, /tmp/some_dir needs to be created by users): ` --export_dir=/tmp/some_dir `

To store checkpoints at the end of every epoch (Note, /tmp/some_dir needs to be created by users): ` --log_dir=/tmp/some_dir `

To enable XLA ` --use_xla `

Requirements¶

TensorFlow¶

pip install tensorflow-gpu==2.4.1

OpenMPI¶

wget -q -O - https://www.open-mpi.org/software/ompi/v3.0/downloads/openmpi-3.0.0.tar.gz | tar -xz
cd openmpi-3.0.0
./configure --enable-orterun-prefix-by-default --with-cuda --prefix=/usr/local/mpi --disable-getpwuid
make -j"$(nproc)" install
cd .. && rm -rf openmpi-3.0.0
echo "/usr/local/mpi/lib" >> /etc/ld.so.conf.d/openmpi.conf && ldconfig
export PATH=/usr/local/mpi/bin:$PATH

The following works around a segfault in OpenMPI 3.0 when run within a single node without ssh being installed.

/bin/echo -e '#!/bin/bash'\
'\ncat <<EOF'\
'\n======================================================================'\
'\nTo run a multi-node job, install an ssh client and clear plm_rsh_agent'\
'\nin '/usr/local/mpi/etc/openmpi-mca-params.conf'.'\
'\n======================================================================'\
'\nEOF'\
'\nexit 1' >> /usr/local/mpi/bin/rsh_warn.sh && \
    chmod +x /usr/local/mpi/bin/rsh_warn.sh && \
    echo "plm_rsh_agent = /usr/local/mpi/bin/rsh_warn.sh" >> /usr/local/mpi/etc/openmpi-mca-params.conf

Horovod¶

export HOROVOD_GPU_ALLREDUCE=NCCL
export HOROVOD_NCCL_INCLUDE=/usr/include
export HOROVOD_NCCL_LIB=/usr/lib/x86_64-linux-gnu
export HOROVOD_NCCL_LINK=SHARED
export HOROVOD_WITHOUT_PYTORCH=1
pip install horovod==0.21.0