Device API#

This is the Device API reference for the NVIDIA® nvCOMP library.

Note

Disclaimer: Device APIs are experimental and are subject to change in the next nvCOMP releases.

Operators#

namespace device#

Enums

enum nvcomp_direction#

Selection of compression or decompression.

Values:

enumerator compress#
enumerator decompress#
enum nvcomp_algo#

The compression algorithm to be selected.

Values:

enumerator ans#
enumerator zstd#
enumerator bitcomp#
enumerator lz4#
enumerator deflate#
enumerator gdeflate#
enum nvcomp_datatype#

The way in which the compression algo will interpret the input data.

Values:

enumerator uint8#

Data to be interpreted as consecutive bytes. If the input datatype is not included in the options below, uint8 should be selected.

enumerator float16#

Data to be interpreted as consecutive IEEE half-precision floats. Requires the total number of input bytes per chunk to be divisible by two.

enumerator bfloat16#

Data to be interpreted as consecutive bfloat16 values. Requires the total number of input bytes per chunk to be divisible by two.

enum nvcomp_grouptype#

Threads group type to work on.

Values:

enumerator warp#

Group provided to API expected to be single-warp-sized.

namespace detail#

Enums

enum nvcomp_operator#

Values:

enumerator direction#

Selects between compression/decompression.

enumerator algo#

Selects the compression algorithm.

enumerator grouptype#

Selects threads group type to work on.

enumerator datatype#

The format of the input data.

enumerator max_uncomp_chunk_size#

The maximum uncompressed chunk size. (For compression only).

User API#

namespace device
namespace detail
template<class ...Operators>
class nvcomp_device_execution#
#include <user_api.hpp>

An object of type `nvcomp_device_execution` is to be constructed in order to use the device-side API. To do this, start by declaring a user-defined type using the following operators:

`Direction()` `Algo()` `Grouptype()` `Datatype()` `MaxUncompChunkSize()`

See the description of these operators in `operators.hpp`

The custom type is created using the `+` between the available operators. The resulting type is then used to initialize an object of type `nvcomp_device_execution` and device API calls can be made from that object.

Example using ANS compression: ``` using namespace nvcomp::device;

constexpr size_t max_uncomp_chunk_size = 1<<15; constexpr size_t num_warps_per_block = 1;

using ans_base_type = decltype(Grouptype<nvcomp_grouptype::warp>() + Algo<nvcomp_algo::ans>() + Datatype<nvcomp_datatype::float16>());

using ans_compressor_type = decltype(ans_base_type() + Direction<nvcomp_direction::compress>() + MaxUncompChunkSize<max_uncomp_chunk_size>());

auto compressor = ans_compressor_type(); size_t shmem_size_warp = compressor.shmem_size_group(); size_t tmp_size_warp = compressor.tmp_size_group(); constexpr size_t scs = compressor.shmem_size_block(num_warps_per_block); constexpr size_t shmem_alignment = compressor.shmem_alignment(); __shared__ uint8_t __align__(shmem_alignment) shared_buffer[scs];

auto block = cg::this_thread_block(); auto warp = cg::tiled_partition<32>(block);

// one warp per CTA in this example auto wid = blockIdx.x; auto wid_in_block = threadIdx.x/32;

compressor.compress( uncomp_chunks[wid], comp_chunks[wid], uncomp_chunk_sizes[wid], comp_chunk_sizes + wid, shared_buffer + shmem_size_warp*wid_in_block, tmp_buf + tmp_size_warp*wid, warp); ```

Public Functions

template<typename CG> inline void __device__ compress (const void *input_chunk, void *output_chunk, const size_t input_size, size_t *output_size, uint8_t *shared_mem_buf, uint8_t *tmp_buf, CG &group)

Compresses a contiguous buffer of data.

Parameters:
  • input_chunk[in] The to-be-compressed chunk. Should be aligned to a 16 byte boundary.

  • output_chunk[in] The resulting compressed chunk. Should be aligned to an 8 bytes boundary.

  • input_size[in] The size in bytes of the to-be-comrpessed chunk

  • output_size[in] The size in bytes of the resulting compressed chunk

  • shared_mem_buf[in] The shared memory buffer to be used internally by the API

  • tmp_buf[in] The global scratch buffer to be used internally by the API

  • group[in] The cooperative group which compresses the input

template<typename CG> inline void __device__ decompress (const void *input_chunk, void *output_chunk, uint8_t *shared_mem_buf, uint8_t *tmp_buf, CG &group)

Decompresses a contiguous buffer of data.

Parameters:
  • input_chunk[in] The to-be-decompressed chunk. Should be aligned to an 8 byte boundary.

  • output_chunk[in] The resulting decompressed chunk

  • shared_mem_buf[in] The shared memory buffer to be used internally by the API

  • tmp_buf[in] The global scratch buffer to be used internally by the API

  • group[in] The cooperative group which decompresses the input

template<typename CG> inline void __device__ decompress (const void *input_chunk, void *output_chunk, const size_t comp_chunk_size, size_t *const decomp_chunk_size, uint8_t *shared_mem_buf, uint8_t *tmp_buf, CG &group)

Decompresses a contiguous buffer of data.

Parameters:
  • input_chunk[in] The to-be-decompressed chunk. Should be aligned to an 8 byte boundary.

  • output_chunk[in] The resulting decompressed chunk

  • comp_chunk_size[in] The size of the compressed chunk

  • decomp_chunk_size[in] The size of the resulting decompressed chunk

  • shared_mem_buf[in] The shared memory buffer to be used internally by the API

  • tmp_buf[in] The global scratch buffer to be used internally by the API

  • group[in] The cooperative group which decompresses the input

Public Static Functions

static inline __device__ __host__ constexpr size_t shmem_size_block (size_t num_warps_per_block)

Returns the amount of shared mem necessary for a given CTA.

Parameters:

num_warps_per_block[in] The number of warps per block

Returns:

The amount of shared mem in bytes

static inline __device__ __host__ constexpr size_t shmem_size_group ()

Returns the amount of shared mem necessary for each cooperative group. Not the same as `shmem_size_block` because there could be multiple API invocations, each with a different cooperative group that are all part of the same CTA.

Returns:

The amount of shared mem in bytes

static inline __device__ __host__ constexpr size_t shmem_alignment ()

Returns the alignment necessary for the CG’s shared memory allocation.

Returns:

The shared memory alignment size

static inline constexpr size_t max_comp_chunk_size()#

Returns the maximium compressed chunk size.

Returns:

The max compressed chunk size in bytes

static inline constexpr size_t tmp_size_total(size_t num_warps)#

Returns the scratch space size needed for the whole kernel.

Parameters:

num_warps[in] Total number of warps in all kernel blocks.

Returns:

The memory scratch space size in bytes

static inline __device__ __host__ constexpr size_t tmp_size_group ()

Returns the global memory scratch space needed for each cooperative group. Not the same as `tmp_size_total` because there could be multiple API invocations per kernel, each requiring part of the total amount of global scratch memory.

Returns:

The global memory scratch space size in bytes