gemm.h¶
Functions for matrix multiplication.
Functions
- 
void nvte_cublas_gemm(const NVTETensor A, const NVTETensor B, NVTETensor D, const NVTETensor bias, NVTETensor pre_gelu_out, bool transa, bool transb, bool grad, NVTETensor workspace, bool accumulate, bool use_split_accumulator, int math_sm_count, cudaStream_t stream)¶
- Compute matrix multiplication of 2 matrices, potentially fused with other operations. - Computes: - D = ABif both- biasand- pre_gelu_outare empty tensors
- D = AB + biasif- pre_gelu_outis empty and- biasis not empty
- D = GELU(AB + bias)if both- biasand- pre_gelu_outare not empty tensors
 - Parameters
- A – [in] The A matrix. 
- B – [in] The B matrix. 
- D – [inout] Output matrix. 
- bias – [in] Bias tensor. 
- pre_gelu_out – [inout] Output matrix before GELU activation. 
- transa – [in] Whether A matrix is transposed. 
- transb – [in] Whether B matrix is transposed. 
- grad – [in] Whether this operation is part of the gradient computation. 
- workspace – [out] Workspace tensor. 
- accumulate – [in] Whether to accumulate the result into the D matrix. 
- use_split_accumulator – [in] Whether to use split accumulator in the FP8 GEMM. 
- math_sm_count – [in] Number of GPU SMs to use (default=0: use cuBLAS heuristics) 
- stream – [in] CUDA stream used for the operation. 
 
 
- 
void nvte_cublas_atomic_gemm(const NVTETensor A, const NVTETensor B, NVTETensor D, const NVTETensor bias, NVTETensor pre_gelu_out, bool transa, bool transb, bool grad, NVTETensor workspace, bool accumulate, bool use_split_accumulator, int math_sm_count, int m_split, int n_split, bool gemm_producer, const NVTETensor counter, cudaStream_t stream)¶
- Compute matrix multiplication of 2 matrices with chunking and atomic counters. - Computes: - D = ABif both- biasand- pre_gelu_outare empty tensors
- D = AB + biasif- pre_gelu_outis empty and- biasis not empty
- D = GELU(AB + bias)if both- biasand- pre_gelu_outare not empty tensors
 - Warning - Cublas atomic gemm uses a beta API and is not tested for all use cases. - Parameters
- A – [in] The A matrix. 
- B – [in] The B matrix. 
- D – [inout] Output matrix. 
- bias – [in] Bias tensor. 
- pre_gelu_out – [inout] Output matrix before GELU activation. 
- transa – [in] Whether A matrix is transposed. 
- transb – [in] Whether B matrix is transposed. 
- grad – [in] Whether this operation is part of the gradient computation. 
- workspace – [out] Workspace tensor. 
- accumulate – [in] Whether to accumulate the result into the D matrix. 
- use_split_accumulator – [in] Whether to use split accumulator in the FP8 GEMM. 
- math_sm_count – [in] Number of GPU SMs to use (default=0: use cuBLAS heuristics) 
- m_split – [in] Number of chunks/splits along m-dimension for Atomic GEMM. 
- n_split – [in] Number of chunks/splits along n-dimension for Atomic GEMM. 
- gemm_producer – [in] Whether Atomic GEMM is the producer or consumer. 
- counter – [inout] counter[chunk_i]=0 indicates chunk_i has been produced. 
- stream – [in] CUDA stream used for the operation.