Training with multiple GPUs
Clara’s multi-GPU training is based on Horovod (https://github.com/horovod/horovod). It works as follows:
To train with N GPUs, N processes running exactly the same code are used. Each process is pinned to a GPU.
The Optimizer is made into Distributed Optimizer (by calling a horovod function).
Horovod synchronizes the gradients across all processes at each training step. For this to work, all processes have identical number of training steps.
Clara uses two datasets: training dataset for minimizing loss, and validation dataset for validating the model to obtain the best model. In multi-GPU training, both datasets are sharded such that each process only takes a portion of the load.
Training dataset sharding. The training dataset is divided between the number of GPUs. This is the main reason for reduced total training time - the number of training steps for each process/GPU is only 1/N of the total, where N is the number of GPUs. Since Horovod synchronizes the training processes at each step, the sharding algorithm makes sure that all shards have the same size: if the dataset size is not divisible by N, it adds the 1st element in the dataset to the short shards. At the beginning of each epoch, the content of each shard is shuffled globally such that each process gets to see the whole picture of the training dataset over time.
Validation dataset sharding. The same algorithm is applied to the validation dataset, except that each shard does not need to be equal size.
When computing validation metrics, results by individual processes are aggregated using MPI’s gather function.
It can be difficult to set up the training parameters properly with multi-GPU training.
Batch Size - The value of batch size is constrained by the GPU memory. You have to choose a batch size that is acceptable by all GPUs if your GPUs don’t have the same amount of available memory.
Learning Rate - The value of learning rate is closely related to the number of GPUs and batch size. According to horovod, as the rule of thumb, you should scale up the learning rate with the number of GPUs. For example, suppose your LR for single GPU is 0.0001, you could start with a LR of 0.0002 when training with 2 GPUs. It can require some experimentation to find the best LR.
You can create your own train_ngpu.sh based on train_2gpu.sh. Make sure you adjust the learning rate accordingly.