Achieving High Performance#

Below we present some general advice that may help you achieve high performance using nvCOMPDx.

General Advice#

  • Make sure to partition data into chunks so that the compression/decompression task can run sufficiently in parallel on your device.

  • If possible, ensure that a large enough batch of chunks is present so that enough CUDA thread blocks can run in a grid to fill the GPU for peak performance.

  • The provided ANS algorithm offers block-level strong-scaling, so try increasing the block size when it is executed through the block-level API.

  • Try (de)compressing multiple chunks in a single thread block through the warp-level API.

  • The best parameters for compute-bound and memory-bound kernels might not be identical.

  • Merge adjacent memory-bound kernels (pre- and post-processing) with a (de)compressor kernel to save global memory round trips.

Memory Management#

  • Avoid unnecessary reading and writing of data from/to global memory.

  • Ensure that global memory reads and writes to shared memory are coalesced.

  • Use __shared__ memory or extra registers to store temporary data.

  • Use vectorized load/store instructions when possible (e.g., float4, int4).

  • Utilize memory pools to reduce allocation overhead for frequently allocated/deallocated memory.

Stream Management#

  • Use multiple CUDA streams to overlap computation and data transfers.

  • Consider using CUDA events to synchronize operations between streams when necessary.

  • Implement asynchronous memory operations where possible to hide transfer latency.

  • Use stream priorities to ensure critical operations get higher priority.

Advanced#

  • For compression/decompression tasks not filling the device entirely, consider increasing the granularity of the problem, i.e., introduce smaller chunks.

  • Use the NVIDIA Nsight Compute CUDA Occupancy Calculator [7] to understand bottlenecks in your fused kernel (e.g., register pressure, shared memory pressure, etc.).

  • Use the cudaOccupancyMaxActiveBlocksPerMultiprocessor [9] function to determine optimal launch parameters, i.e., avoid multiple waves of computation if possible.

  • Add __launch_bounds__ to your CUDA kernel signature if the thread block size is known at compile-time to allow more fine-grained compiler optimizations.

  • Add the __restrict__ keyword to your non-aliasing pointer arguments in the fused CUDA kernel signature.

  • Check whether CUDA Programmatic Dependent Launch [8] (PDL) is applicable in your application.

  • Compile your CUDA kernel(s) with the nvcc argument --ptxas-options=-v and eliminate spilled loads and stores.

  • Profile your application using Nsight Systems [10] to identify system-level bottlenecks.

Warning

Due to internal optimizations, it is not guaranteed that repeated execution of the exact same compressor description results in bit-identical results. However, repeated executions of the exact same decompressor always produce bit-identical results.

Further Reading#

References#