Decompression Engine#

The NVIDIA Blackwell architecture introduces a dedicated hardware Decompression Engine (DE) capable of decompressing data up to 600 GB/s. It offloads the Central Processing Unit (CPU) from decompression tasks, improving performance for data-intensive workloads such as AI training, HPC simulations, and analytics pipelines.

The DE is a fixed-function hardware block designed to accelerate Snappy, LZ4, Deflate, and GZip decompression. By offloading this task, the DE frees up Streaming Multiprocessor (SM) resources for computations, reducing cycles spent on data movement.

Integrated with the copy engine, the DE eliminates the need for host-to-device transfers followed by software decompression. Instead, compressed data can be directly sent via Peripheral Component Interconnect Express (PCIe) or Chip-to-Chip (C2C) and decompressed in transit, reducing I/O bottlenecks.

Beyond throughput, the DE enables concurrent data movement and computation. Multi-stream workloads can decompress data in parallel with SM kernel execution, ensuring full GPU utilization. This allows data-intensive applications like LLM training, genomics analysis, or HPC simulations to keep pace with Blackwell GPU bandwidth without I/O stalls.

A recently published technical blog describing DE , and showcasing performance can be found here.

Below we list some frequently asked questions (FAQs) about the usage of DE.

What do I need to change in my existing code to be able to use DE with nvCOMP?#

nvCOMP offers users flexibility in decompression backend selection (default, DE, or CUDA) through the backend parameter within nvcompBatched<Format>DecompressOpts_t in the nvcompBatched<Format>DecompressAsync API.

In the default setting, nvCOMP prioritizes the DE for decompression. If the DE is unavailable or if the provided pointers are non-compliant with the DE requirements, it automatically and safely falls back to the CUDA implementation. It is also necessary for the user to provide a valid pointer for the decompressed size (device_uncompressed_chunk_bytes), failing which the decompression will fall back to the CUDA implementation.

Memory allocations for compressed chunks must also be DE compatible. Please refer to this FAQ item for further details on DE compatible memory requirements.

I use cudaMalloc to allocate device memory, will it work with DE?#

Yes, device-to-device decompression with nvCOMP can utilize standard cudaMalloc (CUDA Runtime API) allocations. If you are a user of the CUDA Driver API, analogously, cuMemAlloc allocations also work. For host-to-device or host-to-host decompression, specialized memory allocation is required for optimal performance using hardware decompression:

Use cudaMallocFromPoolAsync() with the cudaMemPoolCreateUsageHwDecompress flag.
Alternatively, use cuMemCreate() with the CU_MEM_CREATE_USAGE_HW_DECOMPRESS flag.

If the buffer requirements for hardware decompression are not met, nvCOMP will automatically fall back to executing the decompression on the SM. To check if a memory buffer is DE compatiable, you can execute the following code

#include <cuda.h>
#define CU_CHECK(c) do { \
    CUresult err = (c); \
    if (err != CUDA_SUCCESS) { \
        const char* err_str; \
        cuGetErrorString(err, &err_str); \
        throw std::runtime_error(std::string("CUDA Driver API error: '") + std::to_string(err) + "': " + err_str); \
    } \
} while(0)

template<typename T>
bool is_ptr_de_decompress_capable(T* ptr) {
    bool test;
    try {
        CU_CHECK(cuPointerGetAttribute(&test, CU_POINTER_ATTRIBUTE_IS_HW_DECOMPRESS_CAPABLE, (CUdeviceptr)ptr));
    } catch(std::exception& e) {
        std::cerr << e.what() << std::endl;
        test = false;
    }
    return test;
}

Does DE work with host memory?#

Yes, DE supports host memory, provided it has been allocated using cudaMallocFromPoolAsync or cuMemCreate with the necessary flags. Refer to this FAQ item for more details.

What compression formats are supported by DE?#

As of the latest CUDA version, decompression for the following compression formats is supported on DE

Snappy
Deflate (along with GZip)
LZ4

I have a blackwell GPU. Do I have DE?#

DE is currently supported on the B200, B300, GB200, and GB300 GPUs. nvCOMP is designed to leverage DE automatically for decompression. If the user selects the default decompression backend within the nvcompBatched<Format>DecompressOpts_t parameter when calling the nvcompBatched<Format>DecompressAsync API, nvCOMP will detect the compatible GPU and launch the decompression on DE.

Is there an ideal chunk size to get best performance out of the DE?#

Selecting the optimal chunk size is crucial for performance and is dependent on the data being processed. A chunk size that is too small leads to excessive driver overhead, while a size that is too large can result in performance degradation due to “tail effects.” The best performance is achieved by finding a balance between these two extremes.

Currently, the maximum allowed chunk size on a B200 GPU is 4 MiB, although this limit may be subject to change in future generation GPUs. This maximum size can be programmatically determined using the cuDeviceGetAttribute driver function, specifically with the CU_DEVICE_ATTRIBUTE_MEM_DECOMPRESS_MAXIMUM_LENGTH attribute.

My chunks sizes in the same batch are heavily varied. How can I get better decompression throughput on the DE?#

To enhance performance and achieve better load balancing on the DE, nvCOMP offers an option to sort the compressed data chunks. This feature is enabled by setting the sort_before_hw_decompress field to 1 within the nvcompBatched<Format>DecompressOpts_t structure when invoking the decompression API.

What are the likely causes of an error after invoking the nvCOMP decompression API with DE enabled?#

Errors can originate from the DE driver API launch itself or during the decompression process. Please verify the following:

Compression Standard Compliance: Ensure the compressed buffer adheres to the compression standard, as DE is fault-intolerant.
DE Compliance: The compressed buffer, uncompressed buffer, and the uncompressed buffer chunk size host/device memory must be DE compliant (See FAQ item).
Maximum chunk size: Ensure that the maximum chunk size is within the limits supported by the GPU (See this FAQ item).
Providing a buffer with correct compressed data, but incorrect compressed size (e.g., larger size, due to the buffer holding the data is larger) is also considered non-compliant.

For detailed error information, check the nvCOMP error logs. These logs record errors and API calls and can be enabled by setting the NVCOMP_LOG_LEVEL environment variable (See nvCOMP logging for more details).