Device API#
This is the Device API reference for the NVIDIA® nvCOMP library.
Note
Disclaimer: Device APIs are experimental and are subject to change in the next nvCOMP releases.
Operators#
-
namespace device#
Enums
-
enum nvcomp_direction#
Selection of compression or decompression.
Values:
-
enumerator compress#
-
enumerator decompress#
-
enumerator compress#
-
enum nvcomp_algo#
The compression algorithm to be selected.
Values:
-
enumerator ans#
-
enumerator zstd#
-
enumerator bitcomp#
-
enumerator lz4#
-
enumerator deflate#
-
enumerator gdeflate#
-
enumerator ans#
-
enum nvcomp_datatype#
The way in which the compression algo will interpret the input data.
Values:
-
enumerator uint8#
Data to be interpreted as consecutive bytes. If the input datatype is not included in the options below, uint8 should be selected.
-
enumerator float16#
Data to be interpreted as consecutive IEEE half-precision floats. Requires the total number of input bytes per chunk to be divisible by two.
-
enumerator bfloat16#
Data to be interpreted as consecutive bfloat16 values. Requires the total number of input bytes per chunk to be divisible by two.
-
enumerator uint8#
-
namespace detail#
Enums
-
enum nvcomp_operator#
Values:
-
enumerator direction#
Selects between compression/decompression.
-
enumerator algo#
Selects the compression algorithm.
-
enumerator grouptype#
Selects threads group type to work on.
-
enumerator datatype#
The format of the input data.
-
enumerator max_uncomp_chunk_size#
The maximum uncompressed chunk size. (For compression only).
-
enumerator direction#
-
enum nvcomp_operator#
-
enum nvcomp_direction#
User API#
-
namespace device
-
namespace detail
-
template<class ...Operators>
class nvcomp_device_execution# - #include <user_api.hpp>
An object of type `nvcomp_device_execution` is to be constructed in order to use the device-side API. To do this, start by declaring a user-defined type using the following operators:
`Direction()` `Algo()` `Grouptype()` `Datatype()` `MaxUncompChunkSize()`
See the description of these operators in `operators.hpp`
The custom type is created using the `+` between the available operators. The resulting type is then used to initialize an object of type `nvcomp_device_execution` and device API calls can be made from that object.
Example using ANS compression: ``` using namespace nvcomp::device;
constexpr size_t max_uncomp_chunk_size = 1<<15; constexpr size_t num_warps_per_block = 1;
using ans_base_type = decltype(Grouptype<nvcomp_grouptype::warp>() + Algo<nvcomp_algo::ans>() + Datatype<nvcomp_datatype::float16>());
using ans_compressor_type = decltype(ans_base_type() + Direction<nvcomp_direction::compress>() + MaxUncompChunkSize<max_uncomp_chunk_size>());
auto compressor = ans_compressor_type(); size_t shmem_size_warp = compressor.shmem_size_group(); size_t tmp_size_warp = compressor.tmp_size_group(); constexpr size_t scs = compressor.shmem_size_block(num_warps_per_block); constexpr size_t shmem_alignment = compressor.shmem_alignment(); __shared__ uint8_t __align__(shmem_alignment) shared_buffer[scs];
auto block = cg::this_thread_block(); auto warp = cg::tiled_partition<32>(block);
// one warp per CTA in this example auto wid = blockIdx.x; auto wid_in_block = threadIdx.x/32;
compressor.compress( uncomp_chunks[wid], comp_chunks[wid], uncomp_chunk_sizes[wid], comp_chunk_sizes + wid, shared_buffer + shmem_size_warp*wid_in_block, tmp_buf + tmp_size_warp*wid, warp); ```
Public Functions
- template<typename CG> inline void __device__ compress (const void *input_chunk, void *output_chunk, const size_t input_size, size_t *output_size, uint8_t *shared_mem_buf, uint8_t *tmp_buf, CG &group)
Compresses a contiguous buffer of data.
- Parameters:
input_chunk – [in] The to-be-compressed chunk. Should be aligned to a 16 byte boundary.
output_chunk – [in] The resulting compressed chunk. Should be aligned to an 8 bytes boundary.
input_size – [in] The size in bytes of the to-be-comrpessed chunk
output_size – [in] The size in bytes of the resulting compressed chunk
shared_mem_buf – [in] The shared memory buffer to be used internally by the API
tmp_buf – [in] The global scratch buffer to be used internally by the API
group – [in] The cooperative group which compresses the input
- template<typename CG> inline void __device__ decompress (const void *input_chunk, void *output_chunk, uint8_t *shared_mem_buf, uint8_t *tmp_buf, CG &group)
Decompresses a contiguous buffer of data.
- Parameters:
input_chunk – [in] The to-be-decompressed chunk. Should be aligned to an 8 byte boundary.
output_chunk – [in] The resulting decompressed chunk
shared_mem_buf – [in] The shared memory buffer to be used internally by the API
tmp_buf – [in] The global scratch buffer to be used internally by the API
group – [in] The cooperative group which decompresses the input
- template<typename CG> inline void __device__ decompress (const void *input_chunk, void *output_chunk, const size_t comp_chunk_size, size_t *const decomp_chunk_size, uint8_t *shared_mem_buf, uint8_t *tmp_buf, CG &group)
Decompresses a contiguous buffer of data.
- Parameters:
input_chunk – [in] The to-be-decompressed chunk. Should be aligned to an 8 byte boundary.
output_chunk – [in] The resulting decompressed chunk
comp_chunk_size – [in] The size of the compressed chunk
decomp_chunk_size – [in] The size of the resulting decompressed chunk
shared_mem_buf – [in] The shared memory buffer to be used internally by the API
tmp_buf – [in] The global scratch buffer to be used internally by the API
group – [in] The cooperative group which decompresses the input
Public Static Functions
- static inline __device__ __host__ constexpr size_t shmem_size_block (size_t num_warps_per_block)
Returns the amount of shared mem necessary for a given CTA.
- Parameters:
num_warps_per_block – [in] The number of warps per block
- Returns:
The amount of shared mem in bytes
- static inline __device__ __host__ constexpr size_t shmem_size_group ()
Returns the amount of shared mem necessary for each cooperative group. Not the same as `shmem_size_block` because there could be multiple API invocations, each with a different cooperative group that are all part of the same CTA.
- Returns:
The amount of shared mem in bytes
- static inline __device__ __host__ constexpr size_t shmem_alignment ()
Returns the alignment necessary for the CG’s shared memory allocation.
- Returns:
The shared memory alignment size
-
static inline constexpr size_t max_comp_chunk_size()#
Returns the maximium compressed chunk size.
- Returns:
The max compressed chunk size in bytes
-
static inline constexpr size_t tmp_size_total(size_t num_warps)#
Returns the scratch space size needed for the whole kernel.
- Parameters:
num_warps – [in] Total number of warps in all kernel blocks.
- Returns:
The memory scratch space size in bytes
- static inline __device__ __host__ constexpr size_t tmp_size_group ()
Returns the global memory scratch space needed for each cooperative group. Not the same as `tmp_size_total` because there could be multiple API invocations per kernel, each requiring part of the total amount of global scratch memory.
- Returns:
The global memory scratch space size in bytes
-
template<class ...Operators>
-
namespace detail