Achieving High Performance#

Below we present some general advice that may help you achieve high performance using nvCOMPDx.

General Advice#

Make sure to partition data into chunks so that the compression/decompression task can run sufficiently in parallel on your device.
If possible, ensure that a large enough batch of chunks is present so that enough CUDA thread blocks can run in a grid to fill the GPU for peak performance.
The provided ANS algorithm offers block-level strong-scaling, so try increasing the block size when it is executed through the block-level API.
Try (de)compressing multiple chunks in a single thread block through the warp-level API.
The best parameters for compute-bound and memory-bound kernels might not be identical.
Merge adjacent memory-bound kernels (pre- and post-processing) with a (de)compressor kernel to save global memory round trips.

Memory Management#

Avoid unnecessary reading and writing of data from/to global memory.
Ensure that global memory reads and writes to shared memory are coalesced.
Use __shared__ memory or extra registers to store temporary data.
Use vectorized load/store instructions when possible (e.g., float4, int4).
Utilize memory pools to reduce allocation overhead for frequently allocated/deallocated memory.

Stream Management#

Use multiple CUDA streams to overlap computation and data transfers.
Consider using CUDA events to synchronize operations between streams when necessary.
Implement asynchronous memory operations where possible to hide transfer latency.
Use stream priorities to ensure critical operations get higher priority.

Advanced#

For compression/decompression tasks not filling the device entirely, consider increasing the granularity of the problem, i.e., introduce smaller chunks.
Use the NVIDIA Nsight Compute CUDA Occupancy Calculator [7] to understand bottlenecks in your fused kernel (e.g., register pressure, shared memory pressure, etc.).
Use the cudaOccupancyMaxActiveBlocksPerMultiprocessor [9] function to determine optimal launch parameters, i.e., avoid multiple waves of computation if possible.
Add __launch_bounds__ to your CUDA kernel signature if the thread block size is known at compile-time to allow more fine-grained compiler optimizations.
Add the __restrict__ keyword to your non-aliasing pointer arguments in the fused CUDA kernel signature.
Check whether CUDA Programmatic Dependent Launch [8] (PDL) is applicable in your application.
Compile your CUDA kernel(s) with the nvcc argument --ptxas-options=-v and eliminate spilled loads and stores.
Profile your application using Nsight Systems [10] to identify system-level bottlenecks.

Warning

Due to internal optimizations, it is not guaranteed that repeated execution of the exact same compressor description results in bit-identical results. However, repeated executions of the exact same decompressor always produce bit-identical results.