Parallel Processing using Multi-GPU Configurations¶
This section shows how to increase the performance of Modulus runs using the advanced features like JIT, TF32, multi-GPU, etc. This section also presents some studies that show the scalability of Modulus across multiple GPUs.
Running jobs using TF32 math mode¶
TensorFloat-32 (TF32) is a new math mode available on NVIDIA A100 GPUs for handing matrix math and tensor operations used during the training of a neural network.
On A100 GPUs, the TF32 feature is “ON” by default and you do not need to make any modifications to the regular scripts to use this feature. With this feature, you can obtain up to 1.8x speed-up over FP32 on A100 GPUs for the FPGA problem. This allows us to achieve same results with dramatically reduced training times (Fig. 35) without change in accuracy and loss convergence (Table 2 and Fig. 36).
Modulus: Fully Connected Networks with FP32
Modulus: Fully Connected Networks with TF32
Running jobs using Just-In-Time (JIT) compilation¶
JIT compilation is a feature where elements of the computational graph can be compiled from native PyTorch to the TorchScript backend. This allows for optimizations like avoiding python’s Global Interpreter Lock (GIL) as well as compute optimizations including dead code elimination, common substring elimination and pointwise kernel fusion.
PINNs used in Modulus have many peculiarities including the presence of many pointwise operations. Such operations, while being computationally inexpensive, put a large pressure on the memory subsystem of a GPU. JIT allows for kernel fusion, so that many of these operations can be computed simultaneously in a single kernel and thereby reducing the number of memory transfers from GPU memory to the compute units.
JIT is enabled by default in Modulus through the
jit option in the config
file. You can optionally disable JIT by adding a
jit: false option in the
config file or add a
jit=False command line option.
Running jobs using multiple GPUs¶
To boost performance and to run larger problems, Modulus supports
multi-GPU and multi-node scaling. This allows for multiple
processes, each targeting a single GPU, to perform independent forward
and backward passes and aggregate the gradients collectively before
updating the model weights. The Fig. 37 shows the scaling performance of
Modulus on the laminar FPGA test problem (script can be found at
examples/fpga/laminar/fpga_flow.py) up to 1024 A100 GPUs on 128
nodes. The scaling efficiency from 16 to 1024 GPUs is almost 85%.
This data parallel fashion of multi-GPU training keeps the number of points sampled per GPU constant while increasing the total effective batch size. We can use this to our advantage to increase the number of points sampled by increasing the number of GPUs allowing us to handle much larger problems.
To run a Modulus solution using multiple GPUs on a single compute node, one can first find out the available GPUs using
Once you have found out the available GPUs, you can run the job using
mpirun -np #GPUs. Below command shows how to run the job using 2
mpirun -np 2 python fpga_flow.py
Modulus supports running a problem on multiple nodes as well using a
SLURM scheduler. Simply launch a job using
srun and the appropriate
flags and Modulus will set up the multi-node distributed process group.
The command below shows how to launch a 2 node job with 8 GPUs per node
(16 GPUs in total):
srun -n 16 --ntasks-per-node 8 --mpi=none python fpga_flow.py
Modulus also supports running on other clusters that do not have a SLURM scheduler as long as the following environment variables are set for each process:
MASTER_ADDR: IP address of the node with rank 0
MASTER_PORT: port that can be used for the different processes to communicate on
RANK: rank of that process
WORLD_SIZE: total number of participating processes
LOCAL_RANK(optional): rank of the process on it’s node
For more information, see Environment variable initialization