This section shows how to increase the performance of Modulus runs using the advanced features like JIT, TF32, multi-GPU, etc. This section also presents some studies that show the scalability of Modulus across multiple GPUs.
TensorFloat-32 (TF32) is a new math mode available on NVIDIA A100 GPUs for handing matrix math and tensor operations used during the training of a neural network.
On A100 GPUs, the TF32 feature is “ON” by default and you do not need to make any modifications to the regular scripts to use this feature. With this feature, you can obtain up to 1.8x speed-up over FP32 on A100 GPUs for the FPGA problem. This allows us to achieve same results with dramatically reduced training times (Fig. 35) without change in accuracy and loss convergence (Table 2 and Fig. 36).

Fig. 35Achieved speed-up using the TF32 compute mode on an A100 GPU for the FPGA example
Case Description |
\(P_{drop}\) \((Pa)\) |
Modulus: Fully Connected Networks with FP32 |
29.24 |
Modulus: Fully Connected Networks with TF32 |
29.13 |
OpenFOAM Solver |
28.03 |
Commercial Solver |
28.38 |

Fig. 36Loss convergence plot for FPGA simulation with TF32 feature
JIT compilation is a feature where elements of the computational graph can be compiled from native PyTorch to the TorchScript backend. This allows for optimizations like avoiding python’s Global Interpreter Lock (GIL) as well as compute optimizations including dead code elimination, common substring elimination and pointwise kernel fusion.
PINNs used in Modulus have many peculiarities including the presence of many pointwise operations. Such operations, while being computationally inexpensive, put a large pressure on the memory subsystem of a GPU. JIT allows for kernel fusion, so that many of these operations can be computed simultaneously in a single kernel and thereby reducing the number of memory transfers from GPU memory to the compute units.
JIT is enabled by default in Modulus through the jit
option in the config
file. You can optionally disable JIT by adding a jit: false
option in the
config file or add a jit=False
command line option.
To boost performance and to run larger problems, Modulus supports
multi-GPU and multi-node scaling. This allows for multiple
processes, each targeting a single GPU, to perform independent forward
and backward passes and aggregate the gradients collectively before
updating the model weights. The Fig. 37 shows the scaling performance of
Modulus on the laminar FPGA test problem (script can be found at
examples/fpga/laminar/fpga_flow.py
) up to 1024 A100 GPUs on 128
nodes. The scaling efficiency from 16 to 1024 GPUs is almost 85%.
This data parallel fashion of multi-GPU training keeps the number of points sampled per GPU constant while increasing the total effective batch size. We can use this to our advantage to increase the number of points sampled by increasing the number of GPUs allowing us to handle much larger problems.
To run a Modulus solution using multiple GPUs on a single compute node, one can first find out the available GPUs using
nvidia-smi
Once you have found out the available GPUs, you can run the job using
mpirun -np #GPUs
. Below command shows how to run the job using 2
GPUs.
mpirun -np 2 python fpga_flow.py
Modulus supports running a problem on multiple nodes as well using a
SLURM scheduler. Simply launch a job using srun
and the appropriate
flags and Modulus will set up the multi-node distributed process group.
The command below shows how to launch a 2 node job with 8 GPUs per node
(16 GPUs in total):
srun -n 16 --ntasks-per-node 8 --mpi=none python fpga_flow.py
Modulus also supports running on other clusters that do not have a SLURM scheduler as long as the following environment variables are set for each process:
MASTER_ADDR
: IP address of the node with rank 0MASTER_PORT
: port that can be used for the different processes to communicate onRANK
: rank of that processWORLD_SIZE
: total number of participating processesLOCAL_RANK
(optional): rank of the process on it’s node
For more information, see Environment variable initialization

Fig. 37Multi-node scaling efficiency for the FPGA example