Low-level C Quick Start Guide#

Some applications require compressing or decompressing multiple small inputs, so we provide an additional API to do this efficiently. These API calls combine all compression/decompression into a single execution, greatly improving performance compared with running each input individually. This API relies on the user to split the data into chunks, as well as manage metadata information such as compressed and uncompressed chunk sizes. When splitting data, for best performance, chunks should be relatively equal size to achieve good load-balancing as well as extract sufficient parallelism. So in the case that there are multiple inputs to compress, it may still be best to break each one up into smaller chunks.

The low-level batched C API provides a set of functions to do batched decompression and compression.

In the following API description, replace <compression_method> with the desired compression algorithm, which can be one of:

ans
bitcomp
cascaded
deflate
gdeflate
gzip (only for decompression)
lz4
snappy
zstd

For example, for LZ4, nvcompBatched<compression_method>CompressAsync becomes nvcompBatchedLZ4CompressAsync and nvcompBatched<compression_method>DecompressAsync becomes nvcompBatchedLZ4DecompressAsync.

Some compressors have (up-to 8 byte) alignment requirements on the input, output and/or scratch buffers that the user provides. Please view the documentation in the appropriate header located in include/ to see detail on the alignment requirements on any particular API.

Compression API#

To do batched compression, a temporary workspace is required to be allocated in device memory. The size of this space is computed using:

/**
* @brief Get the amount of temporary memory required on the GPU for compression.
*
* @param[in] num_chunks The number of chunks of memory in the batch.
* @param[in] max_uncompressed_chunk_bytes The maximum size of a chunk in the
* batch.
* @param[in] format_opts Compression options.
* @param[out] temp_bytes The amount of GPU memory that will be temporarily
* required during compression.
*
* @return nvcompSuccess if successful, and an error code otherwise.
*/
nvcompStatus_t nvcompBatched<compression_method>CompressGetTempSize(
    size_t num_chunks,
    size_t max_uncompressed_chunk_bytes,
    nvcompBatched<compression_method>Opts format_opts,
    size_t * temp_bytes);

Then compression is done using:

/**
* @brief Perform batched asynchronous compression.
*
* The caller is responsible for passing device_compressed_chunk_bytes of size
* sufficient to hold compressed data
*
* @param[in] device_uncompressed_chunk_ptrs Array with size \p num_chunks of pointers
* to the uncompressed data chunks. Both the pointers and the uncompressed data
* should reside in device-accessible memory.
* @param[in] device_uncompressed_chunk_bytes Array with size \p num_chunks of
* sizes of the uncompressed chunks in bytes.
* The sizes should reside in device-accessible memory.
* @param[in] max_uncompressed_chunk_bytes The size of the largest uncompressed chunk.
* @param[in] num_chunks Number of chunks of data to compress.
* @param[in] device_temp_ptr The temporary GPU workspace, could be NULL in case
* temporary memory is not needed.
* @param[in] temp_bytes The size of the temporary GPU memory pointed to by
* `device_temp_ptr`.
* @param[out] device_compressed_chunk_ptrs Array with size \p num_chunks of pointers
* to the output compressed buffers. Both the pointers and the compressed
* buffers should reside in device-accessible memory. Each compressed buffer
* should be preallocated with the size given by
* `nvcompBatched<compression_method>CompressGetMaxOutputChunkSize`.
* @param[out] device_compressed_chunk_bytes Array with size \p num_chunks,
* to be filled with the compressed sizes of each chunk.
* The buffer should be preallocated in device-accessible memory.
* @param[in] format_opts Compression options.
* @param[in] stream The CUDA stream to operate on.
*
* @return nvcompSuccess if successfully launched, and an error code otherwise.
*/
nvcompStatus_t nvcompBatched<compression_method>CompressAsync(
    const void* const* device_uncompressed_chunk_ptrs,
    const size_t* device_uncompressed_chunk_bytes,
    size_t max_uncompressed_chunk_bytes,
    size_t num_chunks,
    void* device_temp_ptr,
    size_t temp_bytes,
    void* const* device_compressed_chunk_ptrs,
    size_t* device_compressed_chunk_bytes,
    nvcompBatched<compression_method>Opts_t format_opts,
    cudaStream_t stream);

Decompression API#

Decompression also requires a temporary workspace. This is computed using:

/**
* @brief Get the amount of temporary memory required on the GPU for decompression.
*
* @param[in] num_chunks Number of chunks of data to be decompressed.
* @param[in] max_uncompressed_chunk_bytes The size of the largest chunk in bytes
* when uncompressed.
* @param[out] temp_bytes The amount of GPU memory that will be temporarily required
* during decompression.
*
* @return nvcompSuccess if successful, and an error code otherwise.
*/
nvcompStatus_t nvcompBatched<compression_method>DecompressGetTempSize(
    size_t num_chunks,
    size_t max_uncompressed_chunk_bytes,
    size_t * temp_bytes);

During decompression, device memory buffers that are large enough to hold the decompression result must be provided. There are three possible workflows that are supported:

Uncompressed size for each buffer is known exactly (e.g. Apache Parquet apache/parquet-format )
Only maximum uncompressed size across all buffers is known (e.g. Apache ORC)
No information about the uncompressed sizes is provided (e.g. Apache Avro)

For case 3), nvCOMP provides an API for pre-processing the compressed file: to determine the proper sizes for the decompressed output buffers. This API is as follows:

/**
* @brief Asynchronously compute the number of bytes of uncompressed data for
* each compressed chunk.
*
* This is needed when we do not know the expected output size.
* NOTE: If the stream is corrupt, the sizes will be garbage.
*
* @param[in] device_compressed_chunk_ptrs Array with size \p num_chunks of
* pointers in device-accessible memory to compressed buffers.
* @param[in] device_compressed_chunk_bytes Array with size \p num_chunks of sizes
* of the compressed buffers in bytes. The sizes should reside in device-accessible memory.
* @param[out] device_uncompressed_chunk_bytes Array with size \p num_chunks
* to be filled with the sizes, in bytes, of each uncompressed data chunk.
* This argument needs to be prealloated in device-accessible memory.
* @param[in] num_chunks Number of data chunks to compute sizes of.
* @param[in] stream The CUDA stream to operate on.
*
* @return nvcompSuccess if successful, and an error code otherwise.
*/
nvcompStatus_t nvcompBatched<compression_method>GetDecompressSizeAsync(
    const void* const* device_compressed_chunk_ptrs,
    const size_t* device_compressed_chunk_bytes,
    size_t* device_uncompressed_chunk_bytes,
    size_t num_chunks,
    cudaStream_t stream);

With the decompressed sizes known, we can now use the decompression API:

/**
* @brief Perform batched asynchronous decompression.
*
* @param[in] device_compressed_chunk_ptrs Array with size \p num_chunks of pointers
* in device-accessible memory to compressed buffers. Each compressed buffer
* should reside in device-accessible memory.
* @param[in] device_compressed_chunk_bytes Array with size \p num_chunks of sizes of
* the compressed buffers in bytes. The sizes should reside in device-accessible memory.
* @param[in] device_uncompressed_buffer_bytes Array with size \p num_chunks of sizes,
* in bytes, of the output buffers to be filled with uncompressed data for each chunk.
* The sizes should reside in device-accessible memory. If a
* size is not large enough to hold all decompressed data, the decompressor
* will set the status in \p device_statuses corresponding to the
* overflow chunk to `nvcompErrorCannotDecompress`.
* @param[out] device_uncompressed_chunk_bytes Array with size \p num_chunks to
* be filled with the actual number of bytes decompressed for every chunk.
* This argument needs to be preallocated, but can be nullptr if desired,
* in which case the actual sizes are not reported.
* @param[in] num_chunks Number of chunks of data to decompress.
* @param[in] device_temp_ptr The temporary GPU space.
* @param[in] temp_bytes The size of the temporary GPU space.
* @param[out] device_uncompressed_chunk_ptrs Array with size \p num_chunks of
* pointers in device-accessible memory to decompressed data. Each uncompressed
* buffer needs to be preallocated in device-accessible memory, have the size
* specified by the corresponding entry in device_uncompressed_buffer_bytes.
* @param[out] device_statuses Array with size \p num_chunks of statuses in
* device-accessible memory. This argument needs to be preallocated. For each
* chunk, if the decompression is successful, the status will be set to
* `nvcompSuccess`. If the decompression is not successful, for example due to
* the corrupted input or out-of-bound errors, the status will be set to
* `nvcompErrorCannotDecompress`.
* Can be nullptr if desired, in which case error status is not reported.
* @param[in] stream The CUDA stream to operate on.
*
* @return nvcompSuccess if successfully launched, and an error code otherwise.
*/
nvcompStatus_t nvcompBatched<compression_method>DecompressAsync(
    const void* const* device_compressed_chunk_ptrs,
    const size_t* device_compressed_chunk_bytes,
    const size_t* device_uncompressed_buffer_bytes,
    size_t* device_uncompressed_chunk_bytes,
    size_t num_chunks,
    void* const device_temp_ptr,
    size_t temp_bytes,
    void* const* device_uncompressed_chunk_ptrs,
    nvcompStatus_t* device_statuses,
    cudaStream_t stream);

Note that the device_uncompressed_chunk_bytes and device_statuses can both be specified as nullptr for LZ4, Snappy, and GDeflate. If these are nullptr, these methods will not compute these values. In particular, if device_statuses is nullptr then out of bounds (OOB) error checking is disabled. This can lead to significant increases in decompression throughput.

Batched Compression / Decompression Example - LZ4#

For an example of batched compression and decompression using LZ4, please see the examples/low_level_quickstart_example.cpp.