Multi-node BERT User Guide

This document describes the cluster architecture that can be used to run BERT training jobs.

1. Introduction to Multi-Node Training Needs

Deep Learning models continue to grow larger and more complex while datasets are ever expanding. These two factors, along with an increased need for reduced time-to-market, improved accuracy for a better user experience, and the desire for more research iterations for better outcomes, have driven the requirement for large GPU compute clusters.

This document outlines the cluster architecture that was used at NVIDIA to run recent record-setting BERT training jobs. BERT is currently one of the most popular natural language processing architectures as it can consume unlabeled datasets while producing state-of-the-art accuracy results; with many other models, outlined below, following a similar network design. On a single DGX-2 node with 16 NVIDIA V100 GPUs, the BERT-Large model of 330M parameters can be trained in about 3 days. On a 16 DGX-2 node cluster, BERT-Large can be trained in less than 4 hours. On a 64 DGX-2 node cluster utilizing the technologies listed in this document, the training time is reduced down to just 67 minutes.

Not only is this cluster setup efficient for BERT, but also likely applicable to the many other Transformer-based architectures, such as Transformer-XL, GPT-2, and Megatron. Along with training on more data at once, having a large optimized cluster allows data scientists to take advantage of model parallelism to train larger and more accurate models.

Through hours of performance tuning and the development of new open source technologies, NVIDIA was able to take full advantage of a cluster of DGX servers to achieve state-of-the-art results. Herein, we will share our expertise to help you deploy a similar cluster into your own environment. We provide a technical deployment guide, example multi-node run commands, and give you an idea of what performance results you might expect.

2. Summary of Cluster Technology

Slurm

The first decision to be made when deploying a cluster is which scheduler you will use. Job schedulers are important to maximize GPU utilization and make it easy for data scientists to deploy and scale workloads.

Those with an HPC background have likely used a tool called Slurm. It is an open source, highly scalable cluster management and job scheduling system - and it is just as effective for running Deep Learning jobs as it is for running other scientific computational workloads.

Leveraging Slurm allows data scientists to queue up their workloads, run multiple experiments in parallel, and get the highest utilization out of their compute systems.

On top of high utilization, NVIDIA has recently released several new open source projects to facilitate the use of Slurm while boosting performance. Compared to a stand-alone Slurm installation, integrating Enroot, PMIX, and Pyxis will provide users with improved security, better user management, better performance, and simplified configuration, setup, and job deployment.

Users may be familiar with PMIX, but Enroot and Pyxis are brand new technologies that were developed and used by NVIDIA during recent benchmarking efforts.

Enroot

Enroot is a tool to turn traditional container images into unprivileged sandboxes. It is a lightweight, performant container runtime with built-in NVIDIA GPU support.

Aside from adding no performance impact, one benefit Enroot brings is that, without any additional overhead to admins, users will enter their containers as unprivileged users. This means data scientists can spend less time fussing with file permissions or user rights and more time inspecting their results with no risk of accidentally running as root and deleting a file system.

Pyxis

Pyxis is a plugin that integrates Enroot with Slurm. It adds command-line arguments to srun which allows a user to specify container images and configurations. It then is able to take advantage of Enroot features and launch these containers against one or more nodes.

Pyxis comes with support for PMIX, a library that works with containers and performs tight integration between MPI and Slurm. The integration of PMIX is the main component that gives this stack a performance increase over standard Slurm.

The value of Pyxis can be seen here, where a complex multi-line command can now be replaced with one or two easy-to-read lines of execution. Along with the performance boosts Pyxis brings from PMIX integration, this usability factor will reduce deployment bugs and increase ease-of-understanding.

DGX POD and Super POD

NVIDIA DGX POD provides a reference architecture; a blueprint for building scalable GPU compute clusters. A DGX POD encapsulates a rack of DGX servers, storage, networking, DGX POD management software, and NVIDIA optimized deep learning models - in this case, BERT.

The DGX SuperPOD is a large scale implementation of the DGX POD reference architecture, consisting of 96 DGX-2H servers, with a total of 1,536 V100 GPUs. The 96 DGX-2H node cluster utilizing the technologies listed in this document is capable of reducing BERT-Large pre-training time to just 53 minutes.

3. Requirements and Deployment

Requirements

In order to deploy this cluster you will need the following:
  • Cluster of DGX-1 servers
  • High-speed persistent storage
  • Non-Blocking InfiniBand switches

Deployment

After the initial configuration and deployment, data scientists will be able to log in, monitor running jobs, view the existing queue and available nodes, and deploy their batch or interactive jobs. For a complete technical deployment guide refer to the High-Performance Multi-Node Cluster Deployment Guide.

4. BERT Multi-node Model Implementation

In order to take the most advantage of a high performance GPU compute cluster, such as the DGX POD, NVIDIA has developed a Pytorch implementation of BERT and a TensorFlow implementation optimized for NVIDIA tensor-core GPUs and multi-node training.

Some of the prominent features of this implementation include:
  • Automatic Mixed Precision training support: Mixed precision is the use of both float16 and float32 data types when training a model. Performing arithmetic operations in float16 takes advantage of the performance gains of using specialized processing units such as the Tensor cores.
  • LAMB (Layerwise Adaptive Moments based optimizer) is a large batch optimization technique that helps accelerate training of deep neural networks using large minibatches, allowing global batch size up to 65536. We implemented a fused CUDA kernel for LAMB to improve performance.
  • Fused CUDA kernels for better performance of “LayerNorm” operation.
  • Pytorch JIT: since version 1.0, Pytorch introduces the just in time compiler, a way to create serializable and optimizable models from PyTorch code. PyTorch can compile jit-able modules rather than running them as an interpreter, allowing for various optimizations and improving performance, both during training and inference. The NVIDIA Pytorch BERT implementation makes use of Pytorch JIT for the GeLU layer.
  • Overlap all reduce operation with batch-prop to hide communication cost.

5. Examples

Now that we understand the use cases for a high-performance cluster and we know the steps to deploy all the components, let’s take a look at the code required to run workloads. The examples below use BERT, but the same commands could be modified for any model training. Note that this assumes we already have our training data downloaded and accessible on our network storage device.

Running the same BERT workload with Pyxis and Enroot configured gives us a simplified command.

Note: Because we are using Enroot, the simplifed command will download the Docker image and convert it to Enroot automatically; it does not require any complicated configuration.

In order to launch a training job, we need to specify a few things: the number of nodes, the total number of GPUs, the Docker image we want to use, the partitions we want to use, and the command we want to run. The command for training on a single DGX looks like this.

BATCHSIZE=8192 LR=6e-3 GRADIENT_STEPS=512 PHASE=1 sbatch -N1 --ntasks-per-node=8 run.sub

By contrast, the commands that would have been required to execute the same workload without Pyxis are shown below. Keep in mind that, in addition to this more complicated command, there is also an additional shim layer that needs to be configured in order for Docker to properly launch containers on Slurm nodes. This configuration file can be dozens of lines and is not necessary when using Enroot:

srun -N1 --pty bash 
docker login nvcr.io 
docker pull nvcr.io/nvidia/pytorch:19.09-py3 
nvidia-docker run -v <volume mounts> -it --rm --net=host --uts=host --ipc=host --ulimit stack=67108864 --ulimit memlock=-1 --security-opt seccomp=unconfined nvcr.io/nvidia/pytorch:19.09-py3 bash 
python -u /workspace/bert/run_pretraining.py \
 --seed=42 \
 --do_train \
 --config_file=/workspace/bert/bert_config.json \
 --output_dir=/results \
 --fp16 \
 --allreduce_post_accumulation --allreduce_post_accumulation_fp16 \
 --gradient_accumulation_steps=2 

Back to Pyxis, running the same workload on a DGX-2 can be done by changing only the number of processes executed per node.

BATCHSIZE=4096 LR=6e-3 GRADIENT_STEPS=64 PHASE=1 sbatch -N1 --ntasks-per-node=16 run.sub

If we are running on multiple nodes, the command is very similar. If we are running across 4 DGX-1s we specify 4 nodes with 8 tasks per node and modify the batch size and gradient steps accordingly, for 4 DGX-2s we specify 4 nodes with 16 tasks per node and scale the batch size and gradient steps to match. The commands and expected output will look as follows:

# 4 DGX-1s 
BATCHSIZE=2048 LR=6e-3 GRADIENT_STEPS=128 PHASE=1 sbatch -N4 --ntasks-per-node=8 run.sub 
BATCHSIZE=1024 LR=4e-3 GRADIENT_STEPS=256 PHASE=2 sbatch -N4 --ntasks-per-node=8 run.sub 
 
# 4 DGX-2s 
BATCHSIZE=1024 LR=6e-3 GRADIENT_STEPS=16 PHASE=1 sbatch -N4 --ntasks-per-node=16 run.sub 
BATCHSIZE=512 LR=4e-3 GRADIENT_STEPS=64 PHASE=2 sbatch -N4 --ntasks-per-node=16 run.sub 

If you would like to inspect the training container without directly launching a training job, you can run an interactive command. After entering the container you can then manually run the training script:

python -m torch.distributed.launch --nproc_per_node=8 /workspace/bert/run_pretraining.py \
    --seed=42 \
    --train_batch_size=8192 \
    --learning_rate=6e-3 \
    --input_dir=/workspace/datadir \
    --max_seq_length=128 \
    --max_predictions_per_seq=20 \
    --max_steps=7038 \
    --num_steps_per_checkpoint=2500 \
    --warmup_proportion=0.2843 \
    --do_train \
    --config_file=/workspace/bert/bert_config.json \
    --output_dir=/results \
    --fp16 \
    --allreduce_post_accumulation --allreduce_post_accumulation_fp16 \
    --gradient_accumulation_steps=512

If you are curious to learn more about Enroot, the GitHub page has some usage examples you can use to learn the tool.

For a full list of Pyxis configurations, see the Pyxis guide.

6. Performance Expectations

When running these benchmarks on a cluster of DGX nodes, you should expect model convergence times to scale linearly as you add nodes.

As can be seen in the plot below, training time continues to drop linearly within a single node as you go from 1 to 8 nodes, and continues that performance trend as you scale up to 96 nodes.

BERT-Large Training Times on GPUs

The following table shows results from actual pre-training based on Wikipedia and Bookcorpus datasets.

Time System Number of Nodes Number of V100 GPUs
53 min DGX SuperPOD 92 x DGX-2H 1,472
67 min DGX SuperPOD 64 x DGX-2H 1,024
236 min DGX SuperPOD 16 x DGX-2H 256
Note: For the latest benchmarks, refer to BERT hosted on NGC.

Slurm and PMIX are able to take advantage of tools such as NCCL and MPI along with hardware/software components such as NVLink, Infiniband, and RDMA to gain this near-linear performance at large scales. One take-away here is that the performance gains are coming not only from the hardware, but also from the software on top of that hardware. With software improvements coming out every few weeks, models will continue to train faster on the same hardware.

7. Summary

As showcased here, with a few simple steps we were able to take a cluster of DGX-1 systems, deploy Slurm packages to them, and replicate state-of-the-art BERT training benchmarks across the cluster.

This setup will enable you to train some of the biggest deep learning models and train other large models at the fastest speed possible.

With the use of NVLink, InfiniBand, and RDMA, the GPUs across the cluster are able to communicate seamlessly between each other, creating a highly-performant compute mesh. This gave us performance that continues to scale in a linear fashion as you continue to add more GPUs and more nodes. We saw this scaling go from 1 up to 96 without any drop off, and this pattern would continue if we grew the cluster further.

Performance is a key part of a compute cluster, but the human aspect remains just as important. We hope that with the continual release of new technologies such as Enroot and Pyxis the deployment and configuration time for system admins drops along with the time-to-adoption for data scientist learning a new compute platform.

For more details about NVIDIA’s BERT training benchmarks and our approaches to model parallelism checkout this post.

8. FAQ & Troubleshooting

Notices

Notice

THE INFORMATION IN THIS GUIDE AND ALL OTHER INFORMATION CONTAINED IN NVIDIA DOCUMENTATION REFERENCED IN THIS GUIDE IS PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE INFORMATION FOR THE PRODUCT, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the product described in this guide shall be limited in accordance with the NVIDIA terms and conditions of sale for the product.

THE NVIDIA PRODUCT DESCRIBED IN THIS GUIDE IS NOT FAULT TOLERANT AND IS NOT DESIGNED, MANUFACTURED OR INTENDED FOR USE IN CONNECTION WITH THE DESIGN, CONSTRUCTION, MAINTENANCE, AND/OR OPERATION OF ANY SYSTEM WHERE THE USE OR A FAILURE OF SUCH SYSTEM COULD RESULT IN A SITUATION THAT THREATENS THE SAFETY OF HUMAN LIFE OR SEVERE PHYSICAL HARM OR PROPERTY DAMAGE (INCLUDING, FOR EXAMPLE, USE IN CONNECTION WITH ANY NUCLEAR, AVIONICS, LIFE SUPPORT OR OTHER LIFE CRITICAL APPLICATION). NVIDIA EXPRESSLY DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY OF FITNESS FOR SUCH HIGH RISK USES. NVIDIA SHALL NOT BE LIABLE TO CUSTOMER OR ANY THIRD PARTY, IN WHOLE OR IN PART, FOR ANY CLAIMS OR DAMAGES ARISING FROM SUCH HIGH RISK USES.

NVIDIA makes no representation or warranty that the product described in this guide will be suitable for any specified use without further testing or modification. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to ensure the product is suitable and fit for the application planned by customer and to do the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this guide. NVIDIA does not accept any liability related to any default, damage, costs or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this guide, or (ii) customer product designs.

Other than the right for customer to use the information in this guide with the product, no other license, either expressed or implied, is hereby granted by NVIDIA under this guide. Reproduction of information in this guide is permissible only if reproduction is approved by NVIDIA in writing, is reproduced without alteration, and is accompanied by all associated conditions, limitations, and notices.

Trademarks

NVIDIA, the NVIDIA logo, and Volta are trademarks and/or registered trademarks of NVIDIA Corporation in the Unites States and other countries.

Docker and the Docker logo are trademarks or registered trademarks of Docker, Inc. in the United States and/or other countries.

Other company and product names may be trademarks of the respective companies with which they are associated.