################################################################################ Getting Started ################################################################################ In this section, we show how to implement a sparse matrix-matrix multiplication using cuSPARSELt. We first introduce an overview of the workflow by showing the main steps to set up the computation. Then, we describe how to install the library and how to compile it. Lastly, we present a step by step code example with additional comments. * `cuSPARSELt Workflow`_ * `Installation and Compilation`_ * `Linux`_ * `Windows`_ * `Code Example`_ .. _`cuSPARSELt Workflow`: ================================================================================ cuSPARSELt Workflow ================================================================================ | cuSPARSELt follows an equivalent approach and adopts similar concepts to `cuBLASLt `_ and `cuTENSOR `_. The library programming model requires organizing the computation in such a way the same setup can be repeatedly used for different inputs. | In particular, the model relies on in following high-level stages: | *A.* **Problem definition**: Specify matrices shapes, data types, operations, etc. | *B.* **User preferences/constraints**: User algorithm selection or limit search space of viable implementations (candidates) | *C.* **Plan**: Gather descriptors for the execution and "find" the best implementation if needed | *D.* **Execution**: Perform the actual computation More in detail, the common workflow consists in the following steps: | **1.** Initialize the library handle: `cusparseLtInit()`. | **2.** Specify the input/output matrix characteristics: `cusparseLtDenseDescriptorInit()`, `cusparseLtStructuredDescriptorInit()`. | **3.** Initialize the matrix multiplication descriptor and its properties (e.g. operations, compute type, etc.): `cusparseLtMatmulDescriptorInit()`. | **4.** Initialize the algorithm selection descriptor: `cusparseLtMatmulAlgSelectionInit()`. | **5.** Initialize the matrix multiplication plan: `cusparseLtMatmulPlanInit()`. | **6.** Prune the `A` matrix: `cusparseLtSpMMAPrune()`. This step is not needed if the user provides a customized matrix pruning. | **7.** Compress the pruned matrix: `cusparseLtSpMMACompress()`. | **8.** Compute the required size of workspace: `cusparseLtMatmulGetWorkspace`. Allocate a device buffer of this size. | **9.** Execute the matrix multiplication: `cusparseLtMatmul()`. This step can be repeated multiple times with different input values. | **10.** Destroy the matrix descriptors, matrix multiplication plan and the library handle: `cusparseLtMatDescriptorDestroy()`, `cusparseLtMatmulPlanDestroy()` `cusparseLtDestroy()`. .. image:: workflow.png :width: 700px :align: center :alt: workflow .. _`Installation and Compilation`: ================================================================================ Installation and Compilation ================================================================================ Download the cuSPARSELt package from `developer.nvidia.com/cusparselt/downloads `_ ================================================================================ Prerequisites ================================================================================ * `CUDA 12.0 toolkit (or above) `_ and compatible driver (see `CUDA Driver Release Notes `_). * *Dependencies*: `cudart`, `cuda`, `cusparse.h` header .. _`Linux`: -------------------------------------------------------------------------------- Linux -------------------------------------------------------------------------------- Assuming cuSPARSELt has been extracted in `CUSPARSELT_DIR`, we update the library path accordingly: .. code-block:: bash export LD_LIBRARY_PATH=${CUSPARSELT_DIR}/lib64:${LD_LIBRARY_PATH} To compile the sample code we will discuss below (`matmul.cu`), .. code-block:: bash nvcc matmul.cu -I${CUSPARSELT_DIR}/include -L${CUSPARSELT_DIR}/lib64 -lcusparseLt -ldl -lcuda -o matmul Note that the previous command links `cusparseLt` as a *shared* library. Linking the code with the *static* version of the library requires additional flags: .. code-block:: bash nvcc mamtul.cu -I${CUSPARSELT_DIR}/include \ -Xlinker=${CUSPARSELT_DIR}/lib64/libcusparseLt_static.a \ -o matmul_static -ldl -lcuda .. _`Windows`: -------------------------------------------------------------------------------- Windows -------------------------------------------------------------------------------- Assuming cuSPARSELt has been extracted in `CUSPARSELT_DIR`, we update the library path accordingly: .. code-block:: bash setx PATH "%CUSPARSELT_DIR%\lib:%PATH%" To compile the sample code we will discuss below (`matmul.cu`), .. code-block:: bash nvcc.exe matmul.cu -I "%CUSPARSELT_DIR%\include" -lcusparseLt -lcuda -o matmul.exe Note that the previous command links `cusparseLt` as a *shared* library. Linking the code with the *static* version of the library requires additional flags: .. code-block:: bash nvcc.exe matmul.cu -I %CUSPARSELT_DIR%\include \ -Xlinker=/WHOLEARCHIVE:"%CUSPARSELT_DIR%\lib\cusparseLt_static.lib" \ -Xlinker=/FORCE -lcuda -o matmul.exe .. _`Code Example`: ================================================================================ Code Example ================================================================================ | The following code example shows the common steps to use cuSPARSELt and performs the matrix multiplication. | The full code can be found in `cuSPARSELt Example 1 `_. | A more advanced example that demonstrates the use of Batched Sparse GEMM, activation function, and bias can be found in `cuSPARSELt Example 2 `_. .. code-block:: cpp #include // cusparseLt header // Device pointers and coefficient definitions float alpha = 1.0f; float beta = 0.0f; __half* dA = ... __half* dB = ... __half* dC = ... //-------------------------------------------------------------------------- // cusparseLt data structures and handle initialization cusparseLtHandle_t handle; cusparseLtMatDescriptor_t matA, matB, matC; cusparseLtMatmulDescriptor_t matmul; cusparseLtMatmulAlgSelection_t alg_sel; cusparseLtMatmulPlan_t plan; cudaStream_t stream = nullptr; cusparseLtInit(&handle); //-------------------------------------------------------------------------- // matrix descriptor initialization cusparseLtStructuredDescriptorInit(&handle, &matA, num_A_rows, num_A_cols, lda, alignment, type, order, CUSPARSELT_SPARSITY_50_PERCENT); cusparseLtDenseDescriptorInit(&handle, &matB, num_B_rows, num_B_cols, ldb, alignment, type, order); cusparseLtDenseDescriptorInit(&handle, &matC, num_C_rows, num_C_cols, ldc, alignment, type, order); //-------------------------------------------------------------------------- // matmul, algorithm selection, and plan initialization cusparseLtMatmulDescriptorInit(&handle, &matmul, opA, opB, &matA, &matB, &matC, &matC, compute_type); cusparseLtMatmulAlgSelectionInit(&handle, &alg_sel, &matmul, CUSPARSELT_MATMUL_ALG_DEFAULT); cusparseLtMatmulPlanInit(&handle, &plan, &matmul, &alg_sel); //-------------------------------------------------------------------------- // Prune the A matrix (in-place) and check the correctness cusparseLtSpMMAPrune(&handle, &matmul, dA, dA, CUSPARSELT_PRUNE_SPMMA_TILE, stream); int *d_valid; cudaMalloc((void**) &d_valid, sizeof(d_valid)); cusparseLtSpMMAPruneCheck(&handle, &matmul, dA, &d_valid, stream); int is_valid; cudaMemcpyAsync(&is_valid, d_valid, sizeof(d_valid), cudaMemcpyDeviceToHost, stream); cudaStreamSynchronize(stream); if (is_valid != 0) { std::printf("!!!! The matrix has been pruned in a wrong way. " "cusparseLtMatmul will not provided correct results\n"); return EXIT_FAILURE; } //-------------------------------------------------------------------------- // Matrix A compression size_t compressed_size; cusparseLtSpMMACompressedSize(&handle, &plan, &compressed_size); cudaMalloc((void**) &dA_compressed, compressed_size); cusparseLtSpMMACompress(&handle, &plan, dA, dA_compressed, stream); //-------------------------------------------------------------------------- // Allocate workspace size_t workspace_size; void* d_workspace = nullptr; cusparseLtMatmulGetWorkspace(&handle, &plan, &workspace_size); cudaMalloc((void**) &d_workspace, workspace_size); //-------------------------------------------------------------------------- // Perform the matrix multiplication int num_streams = 0; cudaStream_t* streams = nullptr; cusparseLtMatmul(&handle, &plan, &alpha, dA_compressed, dB, &beta, dC, dD, d_workspace, streams, num_streams); //-------------------------------------------------------------------------- // Destroy descriptors, plan and handle cusparseLtMatDescriptorDestroy(&matA); cusparseLtMatDescriptorDestroy(&matB); cusparseLtMatDescriptorDestroy(&matC); cusparseLtMatmulPlanDestroy(&plan); cusparseLtDestroy(&handle);