Execution Operators#

Execution operators configure how the function will run on the GPU. Combined with description operators, they form a complete function descriptor that can be executed on a GPU.

Operator	Default value	Description
`Thread`	Not set	Creates a thread execution object.
`Block`	Not set	Creates a block execution object. See Block Configuration Operators.

Thread Operator#

cusolverdx::Thread()

Generates an operation to run in a thread context. Each thread independently runs a Solver operation, described using description operators.

The following code example creates a function descriptor to compute singular value decomposition for a bidiagonal FP64 matrix of size 8x8 in a thread context:

#include <cusolverdx.hpp>

using namespace cusolverdx;
using Solver = decltype(Size<8>()
                    + Precision<double>()
                    + Type<type::real>()
                    + Function<function::bdsvd>()
                    + SM<900>()
                    + Thread());

Restrictions:

Is mutually exclusive with Block Operator
Compilation will fail when used with block-only operators: BatchesPerBlock Operator and BlockDim Operator.

Important

The Thread operator is intended for operations with small problem sizes that can entirely fit into registers for the best performance. We generally recommend using the Thread operator only for dense problems with M x N <= 100 or sparse bidiagonal or tridiagonal problems with N <= 50, although the optimal sizes vary depending on the specific function and workflow, data type, precision, and GPU architecture.

Important

Thread execution functions accept input and output data in various memory hierarchies, including global memory, shared memory, or registers. When the data resides in shared memory or registers, it is the user’s responsibility to select appropriate launch configurations (i.e., thread block size) that ensure the data fits within the thread block’s shared memory or avoid register spilling.

Block Operator#

cusolverdx::Block()

Generates a collective operation to run in a single CUDA block. Threads will cooperate to compute the collective operation. The layout and the number of threads participating in the execution can be configured using description operators and Block Configuration Operators.

The following code example creates a function descriptor for a Solver function that will run in a single CUDA block:

#include <cusolverdx.hpp>

using namespace cusolverdx;
using Solver = decltype(Size<32>()
                    + Precision<double>()
                    + Type<type::real>()
                    + Function<function::potrf>()
                    + FillMode<lower>()
                    + Arrangement<row_major>()
                    + SM<900>()
                    + Block());

Restrictions:

Is mutually exclusive with Thread Operator
Each function has specific size limitations when used with Block execution, depending on the shared memory size required to run the function for the given problem size. Compilation will fail if the required shared memory size exceeds the available shared memory on the GPU.

Important

Block execution functions accept input and output data in either global memory or shared memory; however, using shared memory is highly recommended to achieve optimal performance. Users should use the Solver::get_shared_memory_size trait to obtain the required shared memory size for the given problem size and should ensure that the data is copied to shared memory before calling the Solver function.

Block Configuration Operators#

Block-configuration operators allow the user to configure the block size of a single CUDA block.

Operators	Default value	Description
`BlockDim<X, Y, Z>`	Based on heuristics	Number of threads used to perform the Solver function.
`BatchesPerBlock<BPB>`	1	Number of batches to execute in parallel within a single CUDA block.

Note

Block configuration operators can only be used with Block Operator.

Warning

It is not guaranteed that executions of exactly the same Solver function with exactly the same inputs but with different

CUDA architecture (SM), or
number of threads (BlockDim)
number of batches per block (BatchesPerBlock)

will produce bit-identical results.

BlockDim Operator#

struct cusolverdx::BlockDim<unsigned int X, unsigned int Y, unsigned int Z>()

Sets the CUDA block size to (X, Y, Z) to configure the execution, meaning it sets the number of threads participating in the execution. Block dimension can be accessed via the Solver::block_dim trait.

If the BlockDim operator is not set, the default block dimensions are used (the default value is Solver::suggested_block_dim). The suggested block dimensions lead to optimal performance for the cuSolverDx function in most cases, but if the function is fused with other operations, we recommend measuring the performance of the kernel first, then experimenting with different values (see Performance).

If the set block dimensions BlockDim<X, Y, Z> are different from the suggested block dimensions, the cuSolverDx operations will still run correctly.

Warning

cuSolverDx enforces the following requirement for kernel launch configurations:

The kernel must be launched with exactly the block dimensions Solver::block_dim, which equals BlockDim<X, Y, Z> if the BlockDim operator is specified, or if not, BlockDim<Solver::suggested_block_dim>.

Important

cuSolverDx cannot validate all kernel launch configurations at runtime and check that the requirement is met; thus, it is the user’s responsibility to adhere to the rules listed above. Violating these rules is considered undefined behavior and can lead to incorrect results and/or failures.

BatchesPerBlock Operator#

cusolverdx::BatchesPerBlock<unsigned int BPB>()

Specifies the number of batches to compute in parallel within a single CUDA block. The default is 1 batch per block.

Tip

Using multiple batches per block directly impacts both performance and shared memory usage. We recommend using the Suggested batches per block trait to achieve optimal performance.