gemm.h¶
Functions for matrix multiplication.
Functions
-
void nvte_cublas_gemm(const NVTETensor A, const NVTETensor B, NVTETensor D, const NVTETensor bias, NVTETensor pre_gelu_out, bool transa, bool transb, bool grad, NVTETensor workspace, bool accumulate, bool use_split_accumulator, int math_sm_count, cudaStream_t stream)¶
Compute matrix multiplication of 2 matrices, potentially fused with other operations.
Computes:
D = AB
if bothbias
andpre_gelu_out
are empty tensorsD = AB + bias
ifpre_gelu_out
is empty andbias
is not emptyD = GELU(AB + bias)
if bothbias
andpre_gelu_out
are not empty tensors
- Parameters:
A – [in] The A matrix.
B – [in] The B matrix.
D – [inout] Output matrix.
bias – [in] Bias tensor.
pre_gelu_out – [inout] Output matrix before GELU activation.
transa – [in] Whether A matrix is transposed.
transb – [in] Whether B matrix is transposed.
grad – [in] Whether this operation is part of the gradient computation.
workspace – [out] Workspace tensor.
accumulate – [in] Whether to accumulate the result into the D matrix.
use_split_accumulator – [in] Whether to use split accumulator in the FP8 GEMM.
math_sm_count – [in] Number of GPU SMs to use (default=0: use cuBLAS heuristics)
stream – [in] CUDA stream used for the operation.
-
void nvte_cublas_atomic_gemm(const NVTETensor A, const NVTETensor B, NVTETensor D, const NVTETensor bias, NVTETensor pre_gelu_out, bool transa, bool transb, bool grad, NVTETensor workspace, bool accumulate, bool use_split_accumulator, int math_sm_count, int m_split, int n_split, bool gemm_producer, const NVTETensor counter, cudaStream_t stream)¶
Compute matrix multiplication of 2 matrices with chunking and atomic counters.
Computes:
D = AB
if bothbias
andpre_gelu_out
are empty tensorsD = AB + bias
ifpre_gelu_out
is empty andbias
is not emptyD = GELU(AB + bias)
if bothbias
andpre_gelu_out
are not empty tensors
Warning
Cublas atomic gemm uses a beta API and is not tested for all use cases.
- Parameters:
A – [in] The A matrix.
B – [in] The B matrix.
D – [inout] Output matrix.
bias – [in] Bias tensor.
pre_gelu_out – [inout] Output matrix before GELU activation.
transa – [in] Whether A matrix is transposed.
transb – [in] Whether B matrix is transposed.
grad – [in] Whether this operation is part of the gradient computation.
workspace – [out] Workspace tensor.
accumulate – [in] Whether to accumulate the result into the D matrix.
use_split_accumulator – [in] Whether to use split accumulator in the FP8 GEMM.
math_sm_count – [in] Number of GPU SMs to use (default=0: use cuBLAS heuristics)
m_split – [in] Number of chunks/splits along m-dimension for Atomic GEMM.
n_split – [in] Number of chunks/splits along n-dimension for Atomic GEMM.
gemm_producer – [in] Whether Atomic GEMM is the producer or consumer.
counter – [inout] counter[chunk_i]=0 indicates chunk_i has been produced.
stream – [in] CUDA stream used for the operation.
-
void nvte_multi_stream_cublas_gemm(const NVTETensor *A, const NVTETensor *B, NVTETensor *D, const NVTETensor *bias, NVTETensor *pre_gelu_out, const int num_gemms, bool transa, bool transb, bool grad, NVTETensor *workspace, bool accumulate, bool use_split_accumulator, int math_sm_count, cudaStream_t stream)¶
Compute multiple pairs of matrix multiplication, potentially fused with other operations, on multiple streams.
Computes:
D = AB
if bothbias
andpre_gelu_out
are empty tensorsD = AB + bias
ifpre_gelu_out
is empty andbias
is not emptyD = GELU(AB + bias)
if bothbias
andpre_gelu_out
are not empty tensors
- Parameters:
A – [in] The list of A matrices.
B – [in] The list of B matrices.
D – [inout] List of output matrices.
bias – [in] List of bias tensors.
pre_gelu_out – [inout] List of output matrix before GELU activation.
num_gemms – [in] Number of GEMMs to compute.
transa – [in] Whether A matrix is transposed.
transb – [in] Whether B matrix is transposed.
grad – [in] Whether this operation is part of the gradient computation.
workspace – [out] List of workspace tensors.
accumulate – [in] Whether to accumulate the result into the D matrix.
use_split_accumulator – [in] Whether to use split accumulator in the FP8 GEMM.
math_sm_count – [in] Number of GPU SMs to use (default=0: use cuBLAS heuristics)
stream – [in] CUDA stream to wait on.