Device API#

Note

Disclaimer: Device APIs are experimental and are subject to change in the next nvCOMP releases.

Operators#

namespace nvcomp
namespace device#

Enums

enum class nvcomp_direction#

Selection of compression or decompression.

Values:

enumerator compress#
enumerator decompress#
enum class nvcomp_algo#

The compression algorithm to be selected.

Values:

enumerator ans#
enumerator zstd#
enumerator bitcomp#
enumerator lz4#
enumerator deflate#
enumerator gdeflate#
enum class nvcomp_datatype#

The way in which the compression algo will interpret the input data.

Values:

enumerator uint8#

Data to be interpreted as consecutive bytes. If the input datatype is not included in the options below, uint8 should be selected.

enumerator float16#

Data to be interpreted as consecutive IEEE half-precision floats. Requires the total number of input bytes per chunk to be divisible by two.

enumerator bfloat16#

Data to be interpreted as consecutive bfloat16 values. Requires the total number of input bytes per chunk to be divisible by two.

enum class nvcomp_grouptype#

Threads group type to work on.

Values:

enumerator warp#

Group provided to API expected to be single-warp-sized.

namespace detail#

Enums

enum class nvcomp_operator#

Values:

enumerator direction#

Selects between compression/decompression.

enumerator algo#

Selects the compression algorithm.

enumerator grouptype#

Selects threads group type to work on.

enumerator datatype#

The format of the input data.

enumerator max_uncomp_chunk_size#

The maximum uncompressed chunk size. (For compression only).

User API#

namespace nvcomp
namespace device
namespace detail
template<class ...Operators>
class nvcomp_device_execution#
#include <user_api.hpp>

An object of type nvcomp_device_execution is to be constructed in order to use the device-side API. To do this, start by declaring a user-defined type using the following operators:

Direction() Algo() Grouptype() Datatype() MaxUncompChunkSize()

See the description of these operators in operators.hpp

The custom type is created using the + between the available operators. The resulting type is then used to initialize an object of type nvcomp_device_execution and device API calls can be made from that object.

Example using ANS compression:

using namespace nvcomp::device;

constexpr size_t max_uncomp_chunk_size = 1<<15;
constexpr size_t num_warps_per_block = 1;

using ans_base_type =
  decltype(Grouptype<nvcomp_grouptype::warp>() +
           Algo<nvcomp_algo::ans>() +
           Datatype<nvcomp_datatype::float16>());

using ans_compressor_type =
  decltype(ans_base_type() +
           Direction<nvcomp_direction::compress>() +
           MaxUncompChunkSize<max_uncomp_chunk_size>());

auto compressor = ans_compressor_type();
size_t shmem_size_warp = compressor.shmem_size_group();
size_t tmp_size_warp = compressor.tmp_size_group();
constexpr size_t scs = compressor.shmem_size_block(num_warps_per_block);
constexpr size_t shmem_alignment = compressor.shmem_alignment();
__shared__ uint8_t __align__(shmem_alignment) shared_buffer[scs];

auto block = cg::this_thread_block();
auto warp = cg::tiled_partition<32>(block);

// one warp per CTA in this example
auto wid = blockIdx.x;
auto wid_in_block = threadIdx.x/32;

compressor.compress(
  uncomp_chunks[wid],
  comp_chunks[wid],
  uncomp_chunk_sizes[wid],
  comp_chunk_sizes + wid,
  shared_buffer + shmem_size_warp*wid_in_block,
  tmp_buf + tmp_size_warp*wid,
  warp);

Public Functions

template<typename CG> inline void __device__ compress (const void *input_chunk, void *output_chunk, const size_t input_size, size_t *output_size, uint8_t *shared_mem_buf, uint8_t *tmp_buf, CG &group)

Compresses a contiguous buffer of data.

Parameters:
  • input_chunk[in] The to-be-compressed chunk. Should be aligned to a 16 byte boundary.

  • output_chunk[in] The resulting compressed chunk. Should be aligned to an 8 bytes boundary.

  • input_size[in] The size in bytes of the to-be-comrpessed chunk

  • output_size[in] The size in bytes of the resulting compressed chunk

  • shared_mem_buf[in] The shared memory buffer to be used internally by the API

  • tmp_buf[in] The global scratch buffer to be used internally by the API

  • group[in] The cooperative group which compresses the input

template<typename CG> inline void __device__ decompress (const void *input_chunk, void *output_chunk, uint8_t *shared_mem_buf, uint8_t *tmp_buf, CG &group)

Decompresses a contiguous buffer of data.

Parameters:
  • input_chunk[in] The to-be-decompressed chunk. Should be aligned to an 8 byte boundary.

  • output_chunk[in] The resulting decompressed chunk

  • shared_mem_buf[in] The shared memory buffer to be used internally by the API

  • tmp_buf[in] The global scratch buffer to be used internally by the API

  • group[in] The cooperative group which decompresses the input

template<typename CG> inline void __device__ decompress (const void *input_chunk, void *output_chunk, const size_t comp_chunk_size, size_t *const decomp_chunk_size, uint8_t *shared_mem_buf, uint8_t *tmp_buf, CG &group)

Decompresses a contiguous buffer of data.

Parameters:
  • input_chunk[in] The to-be-decompressed chunk. Should be aligned to an 8 byte boundary.

  • output_chunk[in] The resulting decompressed chunk

  • comp_chunk_size[in] The size of the compressed chunk

  • decomp_chunk_size[in] The size of the resulting decompressed chunk

  • shared_mem_buf[in] The shared memory buffer to be used internally by the API

  • tmp_buf[in] The global scratch buffer to be used internally by the API

  • group[in] The cooperative group which decompresses the input

Public Static Functions

static inline __device__ constexpr __host__ size_t shmem_size_block (size_t num_warps_per_block)

Returns the amount of shared mem necessary for a given CTA.

Parameters:

num_warps_per_block[in] The number of warps per block

Returns:

The amount of shared mem in bytes

static inline __device__ constexpr __host__ size_t shmem_size_group ()

Returns the amount of shared mem necessary for each cooperative group. Not the same as shmem_size_block because there could be multiple API invocations, each with a different cooperative group that are all part of the same CTA.

Returns:

The amount of shared mem in bytes

static inline __device__ constexpr __host__ size_t shmem_alignment ()

Returns the alignment necessary for the CG’s shared memory allocation.

Returns:

The shared memory alignment size

static inline constexpr size_t max_comp_chunk_size()#

Returns the maximium compressed chunk size.

Returns:

The max compressed chunk size in bytes

static inline constexpr size_t tmp_size_total(size_t num_warps)#

Returns the scratch space size needed for the whole kernel.

Parameters:

num_warps[in] Total number of warps in all kernel blocks.

Returns:

The memory scratch space size in bytes

static inline __device__ constexpr __host__ size_t tmp_size_group ()

Returns the global memory scratch space needed for each cooperative group. Not the same as tmp_size_total because there could be multiple API invocations per kernel, each requiring part of the total amount of global scratch memory.

Returns:

The global memory scratch space size in bytes

Private Types

using algo = typename get_operator<nvcomp_operator::algo, Operators...>::type#
using direction = typename get_operator<nvcomp_operator::direction, Operators...>::type#
using max_uncomp_chunk_size = typename get_operator<nvcomp_operator::max_uncomp_chunk_size, Operators...>::type#
using data_type = typename get_operator<nvcomp_operator::datatype, Operators...>::type#
using group_type = typename get_operator<nvcomp_operator::grouptype, Operators...>::type#