Other Methods#

Apart from the main execute(...) method, nvCOMPDx requires several other methods to set up the compression and decompression on the device. These methods are used to calculate the maximum compressed chunk size, the scratch memory requirements, and the alignment requirements for the input, output, shared, and global scratch buffers.

Maximum Compressed Chunk Size#

For the compression operation, nvCOMPDx provides a method that gives an upper bound for the attainable compressed chunk size (output_chunk_size within execute(...)). The provided maximum may be useful for allocating an output buffer in global memory, or reserving this space in shared memory either statically or dynamically within a thread block. This maximum is calculated at compilation time using a COMP descriptor.

Note, that there is no method that provides an estimate for the inverse operation, i.e., the maximum attainable decompressed chunk size.

Warning

Performing compression or decompression into a buffer that is smaller than required is considered undefined behavior.

// Compression description
using COMP = decltype(nvcompdx::Direction<nvcompdx::direction::compress> +
                      nvcompdx::MaxUncompChunkSize<uncomp_chunk_size> + ...);

// Calculate the maximum compressed size of a single chunk in bytes
__device__ __host__ constexpr size_t COMP::max_comp_chunk_size();

Scratch Memory Requirement Methods#

nvCOMPDx, depending on the selected algorithm, the direction of operation, etc., may require a scratch memory buffer either in shared memory, global memory, or both. The required buffer sizes are calculated during compilation time using the COMP or DECOMP descriptors.

Note, that the methods shmem_size_group and tmp_size_group return buffer sizes required for a single instance of execution level (warp or block), which may differ from the total allocation requirements if multiple chunks are processed simultaneously.

If Warp Operator is used, and multiple chunks are calculated simultaneously in one thread block, one must multiply the output of shmem_size_group by the number of chunks processed in parallel in a thread block. One can also retrieve the shared memory scratch buffer offset using shmem_size_group by multiplying the output with the current warp_id. Analogously, one can retrieve the global memory scratch buffer offset using tmp_size_group by multiplying the output with the global chunk_id.

Note

The values returned by shmem_size_group and tmp_size_group already take into consideration the alignment requirements of scratch buffers for consecutive chunks. This means that users don’t need to factor in shmem_alignment or tmp_alignment during the offset calculation process. For more information, please check the alignment requirements below.

// Compression description
using COMP = decltype(nvcompdx::Direction<nvcompdx::direction::compress> + ...);

// Decompression description
using DECOMP = decltype(nvcompdx::Direction<nvcompdx::direction::decompress> + ...);

// Calculate the amount of shared memory scratch space required for the selected
// execution level, for a single instance (one warp, or one block) in bytes.
__device__ __host__ constexpr size_t COMP::shmem_size_group();
// or
__device__ __host__ constexpr size_t DECOMP::shmem_size_group();

// Calculate the amount of global memory scratch space required for the selected
// execution level, for a single instance (one warp or one block) in bytes.
__device__ __host__ constexpr size_t COMP::tmp_size_group();
// or
__device__ __host__ constexpr size_t DECOMP::tmp_size_group();

The methods shmem_size_group and tmp_size_group retrieve buffer sizes for a single instance of execution level (either one warp or one block). To retrieve the total amount of global scratch buffer required, one must use the tmp_size_total method. Prior to the kernel invocation, the total number of chunks to be processed must be known.

// Compression description
using COMP = decltype(nvcompdx::Direction<nvcompdx::direction::compress> + ...);

// Decompression description
using DECOMP = decltype(nvcompdx::Direction<nvcompdx::direction::decompress> + ...);

// Calculate the total amount of global memory scratch space required
// for the whole kernel execution in bytes.
//
// - num_chunks[in] The total number of chunks processed by the kernel
__host__ constexpr size_t COMP::tmp_size_total(size_t num_chunks);
// or
__host__ constexpr size_t DECOMP::tmp_size_total(size_t num_chunks);

Warning

The scratch buffer allocation requirements may change between nvCOMPDx releases, therefore, please always make sure to query the scratch buffer requirements and allocate the scratch buffers accordingly. Using scratch buffers smaller than their required size is considered undefined behavior.

Alignment Methods#

nvCOMPDx requires its buffers to be aligned to specific byte boundaries. The alignment requirements may be different for different algorithms, and may also depend on the selected execution level (warp or block) and selected data type. The alignments required for the input, output, shared, and global scratch buffers are always going to be a power of 2.

Note

Global memory allocations directly provided by cudaMalloc(...), cudaMallocPitch(...), etc., return buffers with 256-byte alignment, and therefore these allocations automatically satisfy the input, output, and global scratch buffer alignment requirements.

// Compression description
using COMP = decltype(nvcompdx::Direction<nvcompdx::direction::compress> + ...);

// Decompression description
using DECOMP = decltype(nvcompdx::Direction<nvcompdx::direction::decompress> + ...);

// Retrieve the required alignment for the shared memory scratch buffer.
__device__ __host__ constexpr size_t COMP::shmem_alignment();
// or
__device__ __host__ constexpr size_t DECOMP::shmem_alignment();

// Retrieve the required alignment for the global memory scratch buffer.
__device__ __host__ constexpr size_t COMP::tmp_alignment();
// or
__device__ __host__ constexpr size_t DECOMP::tmp_alignment();

// Retrieve the required alignment for the input data. Depending on the
// direction, this can mean either the uncompressed/raw buffer (compressor) or
// the compressed buffer (decompressor).
__device__ __host__ constexpr size_t COMP::input_alignment();
// or
__device__ __host__ constexpr size_t DECOMP::input_alignment();

// Retrieve the required alignment for the output data. Depending on the
// direction, this can mean either the compressed buffer (compressor) or
// the decompressed buffer (decompressor).
__device__ __host__ constexpr size_t COMP::output_alignment();
// or
__device__ __host__ constexpr size_t DECOMP::output_alignment();

Warning

The alignment requirements may change between nvCOMPDx releases, therefore, please always make sure to query the alignment requirements before invoking nvCOMPDx methods. Violating alignment requirements for any of the required buffers is considered undefined behavior.