Performance#

In the Profiling Applications in PhysicsNeMo tutorial, you saw an end-to-end application speedup by leveraging performance analysis tools applied to an AI application. In this tutorial, you will dive deeper into a variety of subjects to get a better understanding of how to apply some of the best performance tricks—specifically focused on the tools you need to make your scientific AI applications faster.

One major challenge in performance optimization is understanding where to start. Always remember Amdahl’s Law. As cited on Wikipedia, reproduced here for convenience:

Amdahl’s Law - Key Performance Principle

“The overall performance improvement gained by optimizing a single part of a system is limited by the fraction of time that the improved part is actually used”.

– Reddy, Martin (2011). API Design for C++. Burlington, Massachusetts: Morgan Kaufmann Publishers. p. 210. doi:10.1016/C2010-0-65832-9. ISBN 978-0-12-385003-4. LCCN 2010039601. OCLC 666246330.

As it applies to performance of AI applications, Amdahl’s Law reminds you to view AI as end-to-end, CPU+GPU applications that have many computational subsystems. Doubling the performance of your AI model won’t really matter if your application spends 90% of its time blocked on I/O and data preprocessing. While it’s not a comprehensive list, some of the critical components of an AI pipeline can be:

Model Performance: Once inputs are loaded onto the GPU, the AI model itself has a number of tools for improving computational performance. You’ll explore some of these below, but torch.compile (tutorial), CUDA Graphs (blog post), mixed precision (examples), multi-device parallelism with NCCL (docs), and specialized kernels from the NVIDIA ecosystem such as cuML (docs), cuGraphs (docs), and Warp (docs) are all powerful tools to improve model performance.
Data Loading: In small-scale AI applications, a dataset might be loaded to CPU RAM and streamed to the GPU in batches, as needed. For Scientific AI, however, datasets often are measured in units of TB. Data loading can become a serious bottleneck for application performance, in both training and inference. Several libraries exist (HDF5 (docs), Zarr (docs), and higher level tools) that can load data faster than pure NumPy. But interactions with storage systems, CPU cores, CPU-GPU transfers, and other hardware components can quickly complicate data loading, often with unexpected performance degradations. Some out-of-the-box solutions can be easily handled with tools like NVIDIA DALI, while other applications take more effort.
Data Preprocessing: Once data has been loaded from file, there are often preprocessing steps required before the data can flow to the AI model. This can include anything from deterministic data transformations (padding data for your model, normalization, or others) to stochastic, run-time transformations (subsampling of large data, augmentation of data with noise, random cropping, mirroring, etc.). If done on the GPU, this can limit application performance by starving the GPU of work.
Scaling Performance: Applications that run well on a single GPU may struggle when deploying to a multi-GPU or multi-node system. There is great documentation (see PyTorch DDP (tutorial) for example) on model scale up. Fewer tools and tutorials exist, however, to help you manage parallel-I/O, checkpoint writing and restoring, or aggregating and tracking metrics efficiently.

When it comes time to evaluate your model for performance optimization, keep these ideas in mind. The sub-sections below can help you improve performance of certain areas, but where you spend your time for optimizations should be guided empirically by application performance and bottlenecks. Meaning, of course: you need to profile your whole application, and often with multiple tools to get a full picture of performance bottlenecks. Further—after you’ve made improvements, make sure to reprofile before you decide what to optimize next.

Performance Topics

Click below to learn more about performance related topics, and how to apply them in Scientific-AI workloads with PhysicsNeMo:

Profiling

Start here to get an overview of profiling tools, and how to use them for your model.

Profiling

Torch Compile Support

Learn how to integrate other kernels effectively into your models and use torch.compile to enable end-to-end model compilation for maximum performance.

Torch Compile and External Kernels

GPU Optimized Layers

Learn more about layer optimizations that accelerate computation on NVIDIA GPUs for SciML models and applications.

GPU Optimized Layers

Note

These performance guides are works in progress. Look for more updated content in our next release!

Is there a performance critical component of PhysicsNeMo or scientific AI workloads that doesn’t get enough attention? Let us know on GitHub!