Performance

A collection of various methods for accelerating Modulus are presented below. The figures below show a summary of performance improvements using various Modulus features over different releases.

Speed-up across different Modulus releases on V100 GPUs

Fig. 51 Speed-up across different Modulus releases on V100 GPUs. (MFD: Meshless Finite Derivatives)

Speed-up across different Modulus releases on A100 GPUs

Fig. 52 Speed-up across different Modulus releases on A100 GPUs. (MFD: Meshless Finite Derivatives)

Note

The higher vRAM in A100 GPUs means that we can use twice the batch size/GPU compared to the V100 runs. For comparison purposes, the total batch size is held constant, hence the A100 plots use 2 A100 GPUs.

Note

These figures are only for summary purposes and the runs were performed on the flow part of the example presented in Industrial Heat Sink. For more details on performance gains due to individual features, please refer to the subsequent sections.

Running jobs using TF32 math mode

TensorFloat-32 (TF32) is a new math mode available on NVIDIA A100 GPUs for handing matrix math and tensor operations used during the training of a neural network.

On A100 GPUs, the TF32 feature is “ON” by default and you do not need to make any modifications to the regular scripts to use this feature. With this feature, you can obtain up to 1.8x speed-up over FP32 on A100 GPUs for the FPGA problem. This allows us to achieve same results with dramatically reduced training times (Fig. 53) without change in accuracy and loss convergence (Table 2 and Fig. 54).

Speed-up using TF32 on an A100 GPU.

Fig. 53 Achieved speed-up using the TF32 compute mode on an A100 GPU for the FPGA example

Table 2 Comparison of results with and without TF32 math mode

Case Description

\(P_{drop}\) \((Pa)\)

Modulus: Fully Connected Networks with FP32

29.24

Modulus: Fully Connected Networks with TF32

29.13

OpenFOAM Solver

28.03

Commercial Solver

28.38

Loss convergence plot for FPGA simulation with TF32 feature

Fig. 54 Loss convergence plot for FPGA simulation with TF32 feature

Running jobs using Just-In-Time (JIT) compilation

JIT compilation is a feature where elements of the computational graph can be compiled from native PyTorch to the TorchScript backend. This allows for optimizations like avoiding python’s Global Interpreter Lock (GIL) as well as compute optimizations including dead code elimination, common substring elimination and pointwise kernel fusion.

PINNs used in Modulus have many peculiarities including the presence of many pointwise operations. Such operations, while being computationally inexpensive, put a large pressure on the memory subsystem of a GPU. JIT allows for kernel fusion, so that many of these operations can be computed simultaneously in a single kernel and thereby reducing the number of memory transfers from GPU memory to the compute units.

JIT is enabled by default in Modulus through the jit option in the config file. You can optionally disable JIT by adding a jit: false option in the config file or add a jit=False command line option.

CUDA Graphs

Modulus supports CUDA Graph optimization which can accelerate problems that are launch latency bottlenecked and improve parallel performance. Due to the strong scaling of GPU hardware, some machine learning problems can struggle keeping the GPU saturated resulting in work submission latency. This also impacts scalability due to work getting delayed from these bottlenecks. CUDA Graphs provides a solution to this problem by allowing the CPU to submit a sequence of jobs to the GPU rather than individually. For problems that are not matrix multiplied bound on the GPU, this can produce a notable speed up. Regardless of performance gains, it is recommended to use CUDA Graphs when possible, particularly when using multi-GPU and multi-node training. For additional details on CUDA Graphs in PyTorch, the reader is refered to the PyTorch Blog.

There are three steps to using CUDA Graphs:

  1. Warm-up phase where training is executed normally.

  2. Recording phase during which the forward and backward kernels during one training iteration are recorded into a graph.

  3. Replay of the recorded graph which is used for the rest of training.

Modulus supports this PyTorch utility and is turned on by default. CUDA Graphs can be enabled using Hydra. It is suggested to use at least 20 warm-up steps, which is the default. After 20 training iterations, Modulus will then attempt to record a CUDA Graph and if successful it will replay it for the remainder of training.

cuda_graphs: True
cuda_graph_warmup: 20

Warning

CUDA Graphs is presently a beta feature in PyTorch and may change in the future. This feature requires newer NCCL versions and host GPU drivers (R465 or greater). If errors are occurring please verify your drivers are up to date.

Warning

CUDA Graphs do not work for all user guide examples when using multiple GPUs. Some examples requires find_unused_parameters when using DDP, which is not compatible with CUDA Graphs.

Note

NVTX markers do not work inside of CUDA Graphs, thus we suggest shutting this feature off when profiling the code.

Meshless Finite Derivatives

Meshless finite derivatives is an alternative approach for calculating derivatives for physics-informed learning. Rather than relying on automatic differentiation to compute analytical gradients, meshless finite derivatives queries stencil points on the fly to approximate the gradients using finite difference. With autodiff, multiple automatic differentiation calls are needed to calculate the higher-order derivatives as well as the backward pass for optimization. The trouble is that computational complexity exponentially increases for every additional autodiff pass needed, which can significantly slow training. Meshless finite derivatives replaces the need for autodiff with additional forward passes. Since the finite difference stencil points are queried on demand, no grid discretion is needed preserving mesh free training.

For many problems, the additional computation needed for the foward passes in meshless finite derivatives is far less than the autodiff equivalent. This approach can potentially yield anywhere from a \(2-4\) times speed-up over the autodiff approach with comparable accuracy.

To use meshless finite derivatives, one just needs to define a MeshlessFiniteDerivative node and add it to a constraint that will require gradient quantities. Modulus will prioritize the use of meshless finite derivatives over autodiff when provided. When creating a MeshlessFiniteDerivative node, the derivatives that will be needed must be explicitly defined. This can be done though just a list, or accessing needed derivatives from other nodes. Additionally, this node requires a node that has the inputs consist of the independent variables and output being the quantities derivatives are needed for. For example, the derivative \(\partial f / \partial x\) with require a node with input variables that contain \(x\) and outputs \(f\). Switching to meshless finite derivatives is straight forward for most problems. As an example, for LDC the following code snippet turns on meshless finite derivative providing a \(3\) times speed-up:

from modulus.eq.derivatives import MeshlessFiniteDerivative

# Make list of nodes to unroll graph on
ns = NavierStokes(nu=0.01, rho=1.0, dim=2, time=False)
flow_net = instantiate_arch(
    input_keys=[Key("x"), Key("y")],
    output_keys=[Key("u"), Key("v"), Key("p")],
    cfg=cfg.arch.fully_connected
)
flow_net_node = flow_net.make_node(name="flow_network", jit=cfg.jit)
# Define derivatives needed to be calculated
# Requirements for 2D N-S
derivatives_strs = set(["u__x", "v__x", "p__x", "v__x__x", "u__x__x", "u__y", "v__y", \
    "p__y", "u__y__y", "v__y__y"])
derivatives = Key.convert_list(derivatives_strs)
# Or get the derivatives from the N-S node itself
derivatives = []
for node in ns.make_nodes():
    for key in node.derivatives:
        derivatives.append(Key(key.name, size=key.size, derivatives=key.derivatives))

# Create MFD node
mfd_node = MeshlessFiniteDerivative.make_node(
    node_model=flow_net_node,
    derivatives=derivatives,
    dx=0.001,
    max_batch_size=4*cfg.batch_size.Interior,
)
# Add to node list
nodes = ns.make_nodes() + [flow_net_node, mfd_node]

Warning

Meshless Finite Derivatives is a development from the Modulus team and is presently in beta. Use at your own discretion; stability and convergence is not garanteed. API subject to change in future versions.

Present Pitfalls

  • Setting the dx parameter is a very critical part of meshless finite derivatives. While classical numerical methods offer clear guidance on this topic, these do not directly apply here due additional stability constraints placed by the backwards pass and optimization. For most problems in our user guide a dx close to 0.001 works well and yields good convergence, lower will likely lead to instability during training with a float32 precision model. Additional details, tools and guidance on the specification of dx will be forthcoming in the near future.

  • Meshless finite derivatives can increase the noise during training compared to automatic differentiation due its approximate nature. Thus this feature is currently not suggested for problems that are exhibit unstable training characteristics for automatic differentiation.

  • Meshless finite derivatives can converge to the wrong solution and accuracy is highly dependent on the dx used.

  • Performance gains are problem specific and is based on the derivatives needed. Presently the best way to further increase the performance of meshless finite derivatives, users should increase max_batch_size when creating the meshless finite derivative node.

  • Modulus will add automatic differentiation nodes if all required derivatives are not specified to the meshless finite derivative.

Running jobs using multiple GPUs

To boost performance and to run larger problems, Modulus supports multi-GPU and multi-node scaling. This allows for multiple processes, each targeting a single GPU, to perform independent forward and backward passes and aggregate the gradients collectively before updating the model weights. The Fig. 55 shows the scaling performance of Modulus on the laminar FPGA test problem (script can be found at examples/fpga/laminar/fpga_flow.py) up to 1024 A100 GPUs on 128 nodes. The scaling efficiency from 16 to 1024 GPUs is almost 85%.

This data parallel fashion of multi-GPU training keeps the number of points sampled per GPU constant while increasing the total effective batch size. You can use this to your advantage to increase the number of points sampled by increasing the number of GPUs allowing you to handle much larger problems.

To run a Modulus solution using multiple GPUs on a single compute node, one can first find out the available GPUs using

nvidia-smi

Once you have found out the available GPUs, you can run the job using mpirun -np #GPUs. Below command shows how to run the job using 2 GPUs.

mpirun -np 2 python fpga_flow.py

Modulus supports running a problem on multiple nodes as well using a SLURM scheduler. Simply launch a job using srun and the appropriate flags and Modulus will set up the multi-node distributed process group. The command below shows how to launch a 2 node job with 8 GPUs per node (16 GPUs in total):

srun -n 16 --ntasks-per-node 8 --mpi=none python fpga_flow.py

Modulus also supports running on other clusters that do not have a SLURM scheduler as long as the following environment variables are set for each process:

  • MASTER_ADDR: IP address of the node with rank 0

  • MASTER_PORT: port that can be used for the different processes to communicate on

  • RANK: rank of that process

  • WORLD_SIZE: total number of participating processes

  • LOCAL_RANK (optional): rank of the process on it’s node

For more information, see Environment variable initialization

FPGA scaling

Fig. 55 Multi-node scaling efficiency for the FPGA example