Other Methods#
Apart from the main execute(...)
method, nvCOMPDx requires several other methods to set up the compression and decompression on the device.
These methods are used to calculate the maximum compressed chunk size, the scratch memory requirements, and the alignment requirements for the input, output, shared, and global scratch buffers.
Maximum Compressed Chunk Size#
For the compression operation, nvCOMPDx provides a method that gives an upper bound for the attainable compressed chunk size (output_chunk_size
within execute(...)
). The provided maximum may be useful for allocating an output buffer in global memory, or reserving this space in shared memory either statically or dynamically within a thread block. This maximum is calculated at compilation time using a COMP
descriptor.
Note, that there is no method that provides an estimate for the inverse operation, i.e., the maximum attainable decompressed chunk size.
Warning
Performing compression or decompression into a buffer that is smaller than required is considered undefined behavior.
// Compression description
using COMP = decltype(nvcompdx::Direction<nvcompdx::direction::compress> +
nvcompdx::MaxUncompChunkSize<uncomp_chunk_size> + ...);
// Calculate the maximum compressed size of a single chunk in bytes
__device__ __host__ constexpr size_t COMP::max_comp_chunk_size();
Scratch Memory Requirement Methods#
nvCOMPDx, depending on the selected algorithm, the direction of operation, etc., may require a scratch memory buffer either in shared memory, global memory, or both. The required buffer sizes are calculated during compilation time using the COMP
or DECOMP
descriptors.
Note, that the methods shmem_size_group
and tmp_size_group
return buffer sizes required for a single instance of execution level (warp or block), which may differ from the total allocation requirements if multiple chunks are processed simultaneously.
If Warp Operator is used, and multiple chunks are calculated simultaneously in one thread block, one must multiply the output of shmem_size_group
by the number of chunks processed in parallel in a thread block. One can also retrieve the shared memory scratch buffer offset using shmem_size_group
by multiplying the output with the current warp_id
. Analogously, one can retrieve the global memory scratch buffer offset using tmp_size_group
by multiplying the output with the global chunk_id
.
Note
The values returned by shmem_size_group
and tmp_size_group
already take into consideration the alignment requirements of scratch buffers for consecutive chunks. This means that users don’t need to factor in shmem_alignment
or tmp_alignment
during the offset calculation process. For more information, please check the alignment requirements below.
// Compression description
using COMP = decltype(nvcompdx::Direction<nvcompdx::direction::compress> + ...);
// Decompression description
using DECOMP = decltype(nvcompdx::Direction<nvcompdx::direction::decompress> + ...);
// Calculate the amount of shared memory scratch space required for the selected
// execution level, for a single instance (one warp, or one block) in bytes.
__device__ __host__ constexpr size_t COMP::shmem_size_group();
// or
__device__ __host__ constexpr size_t DECOMP::shmem_size_group();
// Calculate the amount of global memory scratch space required for the selected
// execution level, for a single instance (one warp or one block) in bytes.
__device__ __host__ constexpr size_t COMP::tmp_size_group();
// or
__device__ __host__ constexpr size_t DECOMP::tmp_size_group();
The methods shmem_size_group
and tmp_size_group
retrieve buffer sizes for a single instance of execution level (either one warp or one block). To retrieve the total amount of global scratch buffer required, one must use the tmp_size_total
method.
Prior to the kernel invocation, the total number of chunks to be processed must be known.
// Compression description
using COMP = decltype(nvcompdx::Direction<nvcompdx::direction::compress> + ...);
// Decompression description
using DECOMP = decltype(nvcompdx::Direction<nvcompdx::direction::decompress> + ...);
// Calculate the total amount of global memory scratch space required
// for the whole kernel execution in bytes.
//
// - num_chunks[in] The total number of chunks processed by the kernel
__host__ constexpr size_t COMP::tmp_size_total(size_t num_chunks);
// or
__host__ constexpr size_t DECOMP::tmp_size_total(size_t num_chunks);
Warning
The scratch buffer allocation requirements may change between nvCOMPDx releases, therefore, please always make sure to query the scratch buffer requirements and allocate the scratch buffers accordingly. Using scratch buffers smaller than their required size is considered undefined behavior.
Alignment Methods#
nvCOMPDx requires its buffers to be aligned to specific byte boundaries. The alignment requirements may be different for different algorithms, and may also depend on the selected execution level (warp or block) and selected data type. The alignments required for the input, output, shared, and global scratch buffers are always going to be a power of 2.
Note
Global memory allocations directly provided by cudaMalloc(...)
, cudaMallocPitch(...)
, etc., return buffers with 256-byte alignment, and therefore these allocations automatically satisfy the input, output, and global scratch buffer alignment requirements.
// Compression description
using COMP = decltype(nvcompdx::Direction<nvcompdx::direction::compress> + ...);
// Decompression description
using DECOMP = decltype(nvcompdx::Direction<nvcompdx::direction::decompress> + ...);
// Retrieve the required alignment for the shared memory scratch buffer.
__device__ __host__ constexpr size_t COMP::shmem_alignment();
// or
__device__ __host__ constexpr size_t DECOMP::shmem_alignment();
// Retrieve the required alignment for the global memory scratch buffer.
__device__ __host__ constexpr size_t COMP::tmp_alignment();
// or
__device__ __host__ constexpr size_t DECOMP::tmp_alignment();
// Retrieve the required alignment for the input data. Depending on the
// direction, this can mean either the uncompressed/raw buffer (compressor) or
// the compressed buffer (decompressor).
__device__ __host__ constexpr size_t COMP::input_alignment();
// or
__device__ __host__ constexpr size_t DECOMP::input_alignment();
// Retrieve the required alignment for the output data. Depending on the
// direction, this can mean either the compressed buffer (compressor) or
// the decompressed buffer (decompressor).
__device__ __host__ constexpr size_t COMP::output_alignment();
// or
__device__ __host__ constexpr size_t DECOMP::output_alignment();
Warning
The alignment requirements may change between nvCOMPDx releases, therefore, please always make sure to query the alignment requirements before invoking nvCOMPDx methods. Violating alignment requirements for any of the required buffers is considered undefined behavior.