Parallel Processing using Multi-GPU Configurations

Introduction

This section shows how to increase the performance of Modulus runs using the advanced features like JIT, TF32, multi-GPU, etc. This section also presents some studies that show the scalability of Modulus across multiple GPUs.

Running jobs using TF32 math mode

TensorFloat-32 (TF32) is a new math mode available on NVIDIA A100 GPUs for handing matrix math and tensor operations used during the training of a neural network.

On A100 GPUs, the TF32 feature is “ON” by default and you do not need to make any modifications to the regular scripts to use this feature. With this feature, you can obtain up to 1.8x speed-up over FP32 on A100 GPUs for the FPGA problem. This allows us to achieve same results with dramatically reduced training times (Fig. 35) without change in accuracy and loss convergence (Table 2 and Fig. 36).

Speedup using TF32 on an A100 GPU.

Fig. 35 Achieved speedup using the TF32 compute mode on an A100 GPU for the FPGA example

Table 2 Comparison of results with and without TF32 math mode

Case Description

\(P_{drop}\) \((Pa)\)

Modulus: Fully Connected Networks with FP32

29.24

Modulus: Fully Connected Networks with TF32

29.13

OpenFOAM Solver

28.03

Commercial Solver

28.38

Loss convergence plot for FPGA simulation with TF32 feature

Fig. 36 Loss convergence plot for FPGA simulation with TF32 feature

Running jobs using Just-In-Time (JIT) compilation

JIT compilation is a feature where elements of the computational graph can be compiled from native PyTorch to the TorchScript backend. This allows for optimizations like avoiding python’s Global Interpreter Lock (GIL) as well as compute optimizations including dead code elimination, common substring elimination and pointwise kernel fusion.

PINNs used in Modulus have many peculiarities including the presence of many pointwise operations. Such operations, while being computationally inexpensive, put a large pressure on the memory subsystem of a GPU. JIT allows for kernel fusion, so that many of these operations can be computed simultaneously in a single kernel and thereby reducing the number of memory transfers from GPU memory to the compute units.

JIT is enabled by default in Modulus through the jit option in the config file. You can optionally disable JIT by adding a jit: false option in the config file or add a jit=False command line option.

Running jobs using multiple GPUs

To boost performance and to run larger problems, Modulus supports multi-GPU and multi-node scaling. This allows for multiple processes, each targeting a single GPU, to perform independent forward and backward passes and aggregate the gradients collectively before updating the model weights. The Fig. 37 shows the scaling performance of Modulus on the laminar FPGA test problem (script can be found at examples/fpga/laminar/fpga_flow.py) up to 1024 A100 GPUs on 128 nodes. The scaling efficiency from 16 to 1024 GPUs is almost 85%.

This data parallel fashion of multi-GPU training keeps the number of points sampled per GPU constant while increasing the total effective batch size. We can use this to our advantage to increase the number of points sampled by increasing the number of GPUs allowing us to handle much larger problems.

To run a Modulus solution using multiple GPUs on a single compute node, one can first find out the available GPUs using

nvidia-smi

Once you have found out the available GPUs, you can run the job using mpirun -np #GPUs. Below command shows how to run the job using 2 GPUs.

mpirun -np 2 python fpga_flow.py

Modulus supports running a problem on multiple nodes as well using a SLURM scheduler. Simply launch a job using srun and the appropriate flags and Modulus will set up the multi-node distributed process group. The command below shows how to launch a 2 node job with 8 GPUs per node (16 GPUs in total):

srun -n 16 --ntasks-per-node 8 --mpi=none python fpga_flow.py

Modulus also supports running on other clusters that do not have a SLURM scheduler as long as the following environment variables are set for each process:

  • MASTER_ADDR: IP address of the node with rank 0

  • MASTER_PORT: port that can be used for the different processes to communicate on

  • RANK: rank of that process

  • WORLD_SIZE: total number of participating processes

  • LOCAL_RANK (optional): rank of the process on it’s node

For more information, see Environment variable initialization

FPGA scaling

Fig. 37 Multi-node scaling efficiency for the FPGA example