Compressing a Buffer Using nvCOMPDx#

In this introduction, we will use the nvCOMPDx library to compress a buffer using the LZ4 algorithm. Throughout this guide, we’ll use the terms device and GPU interchangeably to refer to an NVIDIA GPU.

The provided code snippets are simplified excerpts of the Introduction Example. Refer to the Examples section for further nvCOMPDx samples.

Note

The example presented here is deliberately simple to demonstrate the basic usage of nvCOMPDx and is not optimized for performance. For performance optimization, see the Achieving High Performance page.

Defining a Function Descriptor#

NVIDIA MathDx libraries follow the convention of using a list of operators to create a complete description of the problem to solve. Therefore, the first step is to define a function descriptor by adding together nvCOMPDx operators that capture the properties of the function. The correctness of this description is evaluated at compile time.

A well-defined nvCOMPDx routine description must include the following components:

Algorithm “direction”, i.e., whether we are compressing or decompressing.
A selected compression or decompression algorithm.
The natural data type of the uncompressed data.
For compression: Maximum uncompressed chunk size.

To get a descriptor for an LZ4 compression routine operating on 8-bit data with a maximum uncompressed chunk size of 64KiB, we write the following lines:

#include <nvcompdx.hpp>

using namespace nvcompdx;

using COMP = decltype(Algorithm<algorithm::lz4>() +
                      DataType<datatype::uint8>() +
                      Direction<direction::compress>() +
                      MaxUncompChunkSize<65536>());

In order to encode the operation properties, nvCOMPDx provides operators Algorithm, DataType, Direction, and MaxUncompChunkSize, which can be combined using ordinary addition (+).

To obtain a fully usable operation that executes the function in CUDA kernels, at least two additional pieces of information are needed:

The first is the SM operator which indicates the targeted CUDA architecture on which we want to run the function. In this example, we are targeting A100 GPUs (SM<800>()).
Second, either the Warp or the Block operator to indicate that the routine will be performed at the cooperative level of a CUDA warp or an entire thread block, respectively. In the latter case, the block shape must be further configured through the BlockDim or BlockWarp operator.

At this point, nvCOMPDx performs additional verifications to ensure that the provided description is valid and that it can be executed on the requested architecture. Issues related to unsupported data types, unsupported maximum uncompressed chunk sizes, or excessive shared memory requirements will result in a compilation error. The compatibility matrices are presented under the operator descriptions here and here.

#include <nvcompdx.hpp>

using namespace nvcompdx;

using COMP = decltype(Algorithm<algorithm::lz4>() +
                      DataType<datatype::uint8>() +
                      Direction<direction::compress>() +
                      MaxUncompChunkSize<65536>() +
                      Warp() +
                      SM<800>());

Executing the Function#

The class COMP which describes the function can be instantiated into an object. Creating the object has no computational cost, and the object should be treated as a handle. The function descriptor object provides execution functions that can perform the requested function.

__global__ void comp_warp_kernel(const void * const uncomp_chunk,
                                 const size_t uncomp_chunk_size,
                                 void * comp_chunk,
                                 size_t * comp_chunk_size,
                                 uint8_t * tmp_buffer) {
   auto compressor = COMP();
   constexpr size_t shmem_alignment = compressor.shmem_alignment();
   extern __shared__ __align__(shmem_alignment) uint8_t shared_comp_scratch_buffer[];

   compressor.execute(
      uncomp_chunk,
      comp_chunk,
      uncomp_chunk_size,
      comp_chunk_size,
      shared_comp_scratch_buffer,
      tmp_buffer);

   // At this point, the compressed chunk resides in the buffer pointed by
   // `comp_chunk`. The uncompressed input data `uncomp_chunk` is
   // left untouched.
}

Besides forwarding its parameters to the execution function of the instantiated COMP object, the above kernel also sets aside an appropriately sized and aligned shared-memory scratch buffer for the compressor. The alignment can be queried and specified within the kernel through Other Methods. The size of the dynamic shared memory buffer must be specified at kernel launch time as part of the kernel’s launch configuration.

Launching the Kernel#

To prepare for launching the above comp_warp_kernel, we need to ensure that the input data chunk has been transferred to GPU-accessible memory, and allocate an output-chunk buffer, a buffer for the output chunk size, and lastly a temporary-memory buffer to be used by the compression algorithm. We also need to determine the appropriate kernel launch configuration, i.e., the grid and block sizes along with the amount of dynamic shared memory used per block.

The alignments of the input, output, and temporary memory buffers, the size of the temporary memory buffer and worst-case size of the output buffer, as well as the amount of dynamic shared memory used per block are encoded in the routine descriptor (see Other Methods). In this example, we can query these parameters through member functions of COMP.

The grid and block sizes depend on the number of chunks and the presence of either a Warp or a Block operator summand in the descriptor type. If Warp was specified, the block size must be a multiple of the warp size (32 threads), and the grid size must be such that the total number of warps executed by the grid is greater than or equal to the number of chunks. If Block was specified, there is more flexibility in the grid and block shapes. See Execution Operators for more details.

Other sizes, alignments, and values are determined by the input data and the number of chunks in a straightforward manner. The entire procedure is shown in the snippet below. For simplicity, the input data is compressed as a single chunk.

int lz4_gpu_comp_introduction(const std::vector<char>& data,
                              std::vector<char>& compressed)
{
  size_t total_bytes = data.size();

  // We are going to compress the input data as is,
  // in a single chunk
  size_t num_chunks = 1;

  // Allocate buffer for the input (uncompressed) data
  // Note: with cudaMalloc() the input alignment is implicitly met
  void* d_input_data;
  cudaMalloc(&d_input_data, total_bytes);
  cudaMemcpy(d_input_data, data.data(), total_bytes, cudaMemcpyHostToDevice);

  // Allocate buffer for the input/output sizes
  size_t* d_output_size;
  cudaMalloc(&d_output_size, sizeof(size_t));

  // Allocate temporary scratch space
  // Note: with cudaMalloc() the global temporary memory alignment is implicitly met
  uint8_t* d_comp_temp;
  cudaMalloc(&d_comp_temp, COMP().tmp_size_total(num_chunks));

  // Calculate the maximum compressed size, i.e. the worst case
  // size of the output buffer.
  const size_t max_comp_chunk_size = COMP().max_comp_chunk_size();

  // Allocate buffer for the output (compressed) data
  // Note: with cudaMalloc() the output alignment is implicitly met
  void* d_output_data;
  cudaMalloc(&d_output_data, max_comp_chunk_size);

  // Compression parameters
  // We are compressing 1 chunk per thread block
  const unsigned int block_size = 32; // 1 warp
  const unsigned int block_count = static_cast<unsigned int>(num_chunks); // 1 chunk
  const auto comp_shared_memory = COMP().shmem_size_group();

  comp_warp_kernel<<<block_count, block_size, comp_shared_memory>>>(
      d_input_data,
      total_bytes,
      d_output_data,
      d_output_size,
      d_comp_temp
    );

  // Copy back the resulting compressed size
  size_t comp_bytes;
  cudaMemcpy(&comp_bytes, d_output_size, sizeof(size_t), cudaMemcpyDeviceToHost);

  // Copy data back to host for write out
  compressed.resize(comp_bytes);
  cudaMemcpy(compressed.data(), d_output_data, comp_bytes, cudaMemcpyDeviceToHost);

  cudaFree(d_input_data);
  cudaFree(d_output_size);
  cudaFree(d_comp_temp);
  cudaFree(d_output_data);

  return 0;
}

Decompression#

For completeness, this section outlines a decompression operation that could be used to decompress the buffer produced in the previous section. The procedure is largely analogous to that for compression with a few differences:

The routine descriptor must contain a Direction<direction::decompress> operator summand rather than a Direction<direction::compress> one.
If the decompressed chunk sizes are not known in advance, they need to be computed through a separate invocation of the decompression routine with a null pointer passed for the output buffer.
In identifier names, “input” now refers to the compressed data and “output” refers to the uncompressed data.

The full decompression procedure, with the maximum decompressed chunk size known in advance, is illustrated in the snippet below.

#include <nvcompdx.hpp>

using namespace nvcompdx;

// For simplicity, let's assume the maximum uncompressed
// chunk size is known prior to decompression.
constexpr size_t max_uncomp_chunk_size = 65536;

using DECOMP = decltype(Algorithm<algorithm::lz4>() +
                        DataType<datatype::uint8>() +
                        Direction<direction::decompress>() +
                        Warp() +
                        SM<800>());

__global__ void decomp_warp_kernel(const void * const comp_chunk,
                                   const size_t comp_chunk_size,
                                   void * uncomp_chunk,
                                   size_t * uncomp_chunk_size,
                                   uint8_t * tmp_buffer) {
   auto decompressor = DECOMP();
   constexpr size_t shmem_alignment = decompressor.shmem_alignment();
   extern __shared__ __align__(shmem_alignment) uint8_t shared_comp_scratch_buffer[];

   decompressor.execute(
      comp_chunk,
      uncomp_chunk,
      comp_chunk_size,
      uncomp_chunk_size,
      shared_comp_scratch_buffer,
      tmp_buffer);

   // At this point, the uncompressed chunk resides in the buffer pointed by
   // `uncomp_chunk`. The compressed input data `comp_chunk` is
   // left untouched.
}

int lz4_gpu_decomp_introduction(const std::vector<char>& compressed,
                                std::vector<char>& uncompressed)
{
  size_t total_bytes = data.size();

  // The input data consists of a single compressed chunk
  size_t num_chunks = 1;

  // Allocate buffer for the input (compressed) data
  // Note: with cudaMalloc() the input alignment is implicitly met
  void* d_input_data;
  cudaMalloc(&d_input_data, total_bytes);
  cudaMemcpy(d_input_data, data.data(), total_bytes, cudaMemcpyHostToDevice);

  // Allocate buffer for the input/output sizes
  size_t* d_output_size;
  cudaMalloc(&d_output_size, sizeof(size_t));

  // Allocate temporary scratch space
  // Note: with cudaMalloc() the global temporary memory alignment is implicitly met
  uint8_t* d_decomp_temp;
  cudaMalloc(&d_decomp_temp, DECOMP().tmp_size_total(num_chunks));

  // Allocate buffer for the output (uncompressed) data
  // The size of the output buffer is known in advance in this example
  // Note: with cudaMalloc() the output alignment is implicitly met
  void* d_output_data;
  cudaMalloc(&d_output_data, max_uncomp_chunk_size);

  // Decompression parameters
  // We are decompressing 1 chunk per thread block
  const unsigned int block_size = 32; // 1 warp
  const unsigned int block_count = static_cast<unsigned int>(num_chunks); // 1 chunk
  const auto decomp_shared_memory = DECOMP().shmem_size_group();

  decomp_warp_kernel<<<block_count, block_size, decomp_shared_memory>>>(
      d_input_data,
      total_bytes,
      d_output_data,
      d_output_size,
      d_decomp_temp
    );

  size_t uncomp_chunk_size;
  cudaMemcpy(&uncomp_chunk_size,
             d_output_size,
             sizeof(size_t),
             cudaMemcpyDeviceToHost);

  // Copy data back to host for write out
  uncompressed.resize(uncomp_chunk_size);
  cudaMemcpy(uncompressed.data(),
             d_output_data,
             uncomp_chunk_size,
             cudaMemcpyDeviceToHost);

  cudaFree(d_input_data);
  cudaFree(d_output_size);
  cudaFree(d_decomp_temp);
  cudaFree(d_output_data);

  return 0;
}

Note

The full Introduction Example does not include a decompression part, but many other examples do.

Compilation#

For instructions on how to compile programs with nvCOMPDx, see Installation Guide.