Execution Operators#

Execution operators configure how functions will run. When combined with description operators, they form a complete function descriptor that can be executed on a device. nvCOMPDx supports the following execution operators:

Operator	Description
`Warp`	Creates a warp execution object.
`Block`	Creates a block execution object. See Block Configuration Operators.

Warning

It is not guaranteed that repeated executions of the exact same complete compressor function descriptor (with Direction<direction::compress>()) will produce bit-identical results. However, repeated executions of the exact same complete decompressor function descriptor (with Direction<direction::decompress>()) will always produce bit-identical results.

Warp Operator#

nvcompdx::Warp()

Generates a collective operation to run in a single CUDA warp. Threads will cooperate to compute the collective operation. The number of threads participating in the execution is fixed at 32, but the layout of those threads can be arbitrary as long as they are consecutive. Note that the participating threads in one chunk’s computation must start at %laneid = 0, i.e., the first participating thread must have %laneid = 0. This is generally satisfied when {linearized thread id} % 32 is 0.

For example, the following code snippet creates a function descriptor for a compressor that will run in a single CUDA warp with 32 threads:

#include <nvcompdx.hpp>

using namespace nvcompdx;

// Compressor type using Warp()
using COMP = decltype(Algorithm<algorithm::ans>() +
                      DataType<datatype::uint8>() +
                      Direction<direction::compress>() +
                      MaxUncompChunkSize<32768>() +
                      Warp() +
                      SM<700>());

Note

When the Warp Operator is used, the user is responsible for making sure that exactly one warp of threads calls into execute(...). Please consult the examples section to see this in practice.

Block Operator#

nvcompdx::Block()

Generates a collective operation to run in a single CUDA thread block. Threads will cooperate to compute the collective operation. The layout and number of threads participating in the execution can be configured using Block Configuration Operators.

For example, the following code snippet creates a function descriptor for a compressor that will run in a single CUDA thread block with 128 threads (4 warps):

#include <nvcompdx.hpp>

using namespace nvcompdx;

// Compressor base type
using COMP = decltype(Algorithm<algorithm::ans>() +
                      DataType<datatype::uint8>() +
                      Direction<direction::compress>() +
                      MaxUncompChunkSize<32768>() +
                      Block() +
                      SM<700>());

// Compressor type using BlockDim<>()
using COMP_BLOCKDIM = decltype(COMP() +
                               BlockDim<128,1,1>());

// Compressor type using BlockWarp<>()
using COMP_BLOCKWARP = decltype(COMP() +
                                BlockWarp<4, true>());

Block Configuration Operators#

Block configuration operators allow users to configure the dimensions of a single CUDA thread block. Please see the specifics below in the sections BlockDim Operator and BlockWarp Operator.

Operators	Default value	Description
`BlockDim<X, Y, Z>`	N/A	Number of threads used to perform compression or decompression.
`BlockWarp<NumWarps, Complete>`	N/A	Number of warps used to perform compression or decompression.

Note

Block configuration operators can only be used with a Block Operator.

Warning

nvCOMPDx cannot validate all kernel launch configurations at runtime and check that the requirements are met. Therefore, it is the user’s responsibility to adhere to the rules listed below. Violating these rules is considered undefined behavior and can lead to incorrect results and/or failures.

BlockDim Operator#

nvcompdx::BlockDim<unsigned int X, unsigned int Y, unsigned int Z>()

Sets the CUDA block size to (X, Y, Z) to configure the execution, meaning it sets the number of threads participating in the execute(...) call. Afterwards, the specified block configuration can be accessed via the nvcompdx::block_dim_of<COMP> trait.

Adding BlockDim<X, Y, Z> to the description introduces the following requirements for the execution of the COMP function:

The kernel must be launched with the 3D block dimension dim3(X1, Y1, Z1), where X1*Y1*Z1 == X*Y*Z
Exactly X * Y * Z threads must participate in the execution
The product X * Y * Z must be greater than or equal to 32, and it is recommended to be an integer multiple of 32

If the function is fused with other operations, we recommend measuring the performance of the kernel with different values (see Achieving High Performance).

Examples

BlockDim<128,1,1> is used, while the CUDA thread block is dim3(128, 1, 1) - Valid
BlockDim<128,1,1> is used, while the CUDA thread block is dim3(64, 2, 1)  - Valid
BlockDim<128,1,1> is used, while the CUDA thread block is dim3(256, 1, 1) - Invalid

BlockWarp Operator#

nvcompdx::BlockWarp<unsigned int NumWarps, bool Complete>()

Sets the CUDA block size to NumWarps warps, and Complete indicates whether the actual number of warps in the thread block is NumWarps. Afterwards, the specified block configuration can be accessed via the nvcompdx::block_warp_of<COMP> trait.

Adding BlockWarp<NumWarps, Complete> to the description introduces the following requirements for the execution of the COMP function:

The kernel must be launched with the 3D block dimension dim3(X, Y, Z), where X * Y * Z >= NumWarps * 32
The participating threads must be consecutive (adjacent) threads, and the first thread must have the index (threadIdx.x, threadIdx.y, threadIdx.z) = (0, 0, 0)
At least NumWarps warps must participate in the execution, and NumWarps must be greater than or equal to 1
When Complete is set to true, the actual CUDA thread block dimension must contain an integer number of warps, and their number must be exactly NumWarps. The thread block layout can be arbitrary.
When Complete is set to false, the actual number of warps in the thread block can be equal to or greater than NumWarps. The thread block layout can be arbitrary.

If the function is fused with other operations, we recommend measuring the performance of the kernel with different values (see Achieving High Performance).

Examples

BlockWarp<4, true> is used, while the CUDA thread block is dim3(128, 1, 1) - Valid
BlockWarp<4, true> is used, while the CUDA thread block is dim3(64, 2, 1) - Valid
BlockWarp<4, false> is used, while the CUDA thread block is dim3(256, 1, 1) - Valid
BlockWarp<4, true> is used, while the CUDA thread block is dim3(256, 1, 1) - Invalid