Execution Method#

The execution method is used to run the compressor and decompressor functionality as defined by the user with nvCOMPDx operators.

nvCOMPDx supports execution at the CUDA warp (warp execution) and thread block (block execution) levels. The API for executing these functionalities is identical.

Warp execution#

The warp execution method is available if the descriptor has been constructed using the Warp Operator and Is it a complete execution description? Trait is true. All threads within a warp must collectively participate in the warp execution method, but no more than a warp of threads should call execute(...) to process one particular chunk of data. Otherwise, the behavior is undefined.

Block execution#

The block execution method is available if the descriptor has been constructed using the Block Operator and Is it a complete execution description? Trait is true. All threads within a block must collectively participate in the block execution method by calling execute(...). Otherwise, the behavior is undefined.

The Execute Method#

execute(...) runs either compression or decompression defined by the COMP and DECOMP descriptors respectively. The function arguments remain identical irrespective of the descriptor’s direction.

The input and output designators in case of compression mean uncompressed input and compressed output, while in case of decompression they mean compressed input and decompressed output.

// Compression description
using COMP = decltype(nvcompdx::Direction<nvcompdx::direction::compress> + ...);

// Compresses a contiguous buffer of data (one chunk)
//
// - input_chunk[in]        The to-be-compressed chunk
// - output_chunk[out]      The resulting compressed chunk
// - input_chunk_size[in]   The size of the to-be-compressed chunk in bytes
// - output_chunk_size[out] The size of the resulting compressed chunk in bytes
// - shared_mem_buffer[in]  The shared memory scratch buffer to be used internally
// - global_mem_buffer[in]  The global memory scratch buffer to be used internally
__device__ void COMP::execute(const void*  input_chunk,
                              void*        output_chunk,
                              const size_t input_chunk_size,
                              size_t*      output_chunk_size,
                              uint8_t*     shared_mem_buffer,
                              uint8_t*     global_mem_buffer);

// Decompression description
using DECOMP = decltype(nvcompdx::Direction<nvcompdx::direction::decompress> + ...);

// Decompresses a contiguous buffer of data (one chunk)
//
// - input_chunk[in]        The to-be-decompressed chunk
// - output_chunk[out]      The resulting decompressed chunk
// - input_chunk_size[in]   The size of the compressed chunk in bytes
// - output_chunk_size[out] The size of the resulting decompressed chunk in bytes
// - shared_mem_buffer[in]  The shared memory scratch buffer to be used internally
// - global_mem_buffer[in]  The global memory scratch buffer to be used internally
__device__ void DECOMP::execute(const void*    input_chunk,
                                void*          output_chunk,
                                const size_t   input_chunk_size,
                                size_t * const output_chunk_size,
                                uint8_t*       shared_mem_buffer,
                                uint8_t*       global_mem_buffer);

If the output data (output_chunk) is used within the same kernel after the execute(...) call, one must use a __syncwarp() (for warp-level execution) or __syncthreads() (for block-level execution) call to make sure that all the data is visible to all respective threads.

Note that the availability of output_chunk_size is regulated by special rules, depending on the destination (registers, shared memory, or global memory):

If the size retrieval occurs in registers (i.e., a thread-local variable), only thread 0 receives information about the uncompressed chunk size. In case of warp-level execution thread 0 is %laneid = 0, while for block-level execution thread 0 is (threadIdx.x, threadIdx.y, threadIdx.z) = (0, 0, 0).
If the size retrieval occurs in shared memory, after calling into execute(...), one needs to perform a __syncwarp() or __syncthreads() call to make sure that the data is visible to all threads within the warp or thread block.
If the size retrieval occurs in global memory, after calling into execute(...), one needs to perform a __syncwarp() or __syncthreads() call if this information is used later on in the same kernel.

Note

nvCOMPDx does not stipulate a strict location for the input and output chunks, and their sizes. As long as the size and alignment requirements are met, the compression and decompression tasks are performed correctly.

Uncompressed Size Retrieval#

The method execute(...) in case of decompression (alongside a DECOMP descriptor) can also be used for determining the uncompressed chunk size without actually retrieving the uncompressed data.

This can be particularly useful when the user wants to perform decompression but does not have any information about the buffer size to be allocated for the uncompressed data.

If one calls the execute(...) method with a nullptr in place of the output_chunk, the method will only retrieve the uncompressed chunk size.

// Decompression description
using DECOMP = decltype(nvcompdx::Direction<nvcompdx::direction::decompress> + ...);

// Only retrieve the uncompressed chunk size
DECOMP::execute(input_chunk,
                nullptr,
                input_chunk_size,
                output_chunk_size,
                shared_mem_buffer,
                global_mem_buffer);

The availability of the uncompressed chunk size (output_chunk_size) is governed by the same set of rules as specified in The Execute Method.

Note

The complexity of retrieving a chunk’s original (uncompressed) size varies from algorithm to algorithm. Performance-wise, it is best if the size-retrieval happens during the actual decompression.

That is, one gets the best performance if there is only a single call to execute(...) in the decompression path.