gemm.h

Functions for matrix multiplication.

Functions

void nvte_cublas_gemm(const NVTETensor A, const NVTETensor B, NVTETensor D, const NVTETensor bias, NVTETensor pre_gelu_out, bool transa, bool transb, bool grad, NVTETensor workspace, bool accumulate, bool use_split_accumulator, int math_sm_count, cudaStream_t stream)

Compute matrix multiplication of 2 matrices, potentially fused with other operations.

Computes:

  • D = AB if both bias and pre_gelu_out are empty tensors

  • D = AB + bias if pre_gelu_out is empty and bias is not empty

  • D = GELU(AB + bias) if both bias and pre_gelu_out are not empty tensors

Parameters:
  • A[in] The A matrix.

  • B[in] The B matrix.

  • D[inout] Output matrix.

  • bias[in] Bias tensor.

  • pre_gelu_out[inout] Output matrix before GELU activation.

  • transa[in] Whether A matrix is transposed.

  • transb[in] Whether B matrix is transposed.

  • grad[in] Whether this operation is part of the gradient computation.

  • workspace[out] Workspace tensor.

  • accumulate[in] Whether to accumulate the result into the D matrix.

  • use_split_accumulator[in] Whether to use split accumulator in the FP8 GEMM.

  • math_sm_count[in] Number of GPU SMs to use (default=0: use cuBLAS heuristics)

  • stream[in] CUDA stream used for the operation.

void nvte_cublas_atomic_gemm(const NVTETensor A, const NVTETensor B, NVTETensor D, const NVTETensor bias, NVTETensor pre_gelu_out, bool transa, bool transb, bool grad, NVTETensor workspace, bool accumulate, bool use_split_accumulator, int math_sm_count, int m_split, int n_split, bool gemm_producer, const NVTETensor counter, cudaStream_t stream)

Compute matrix multiplication of 2 matrices with chunking and atomic counters.

Computes:

  • D = AB if both bias and pre_gelu_out are empty tensors

  • D = AB + bias if pre_gelu_out is empty and bias is not empty

  • D = GELU(AB + bias) if both bias and pre_gelu_out are not empty tensors

Warning

Cublas atomic gemm uses a beta API and is not tested for all use cases.

Parameters:
  • A[in] The A matrix.

  • B[in] The B matrix.

  • D[inout] Output matrix.

  • bias[in] Bias tensor.

  • pre_gelu_out[inout] Output matrix before GELU activation.

  • transa[in] Whether A matrix is transposed.

  • transb[in] Whether B matrix is transposed.

  • grad[in] Whether this operation is part of the gradient computation.

  • workspace[out] Workspace tensor.

  • accumulate[in] Whether to accumulate the result into the D matrix.

  • use_split_accumulator[in] Whether to use split accumulator in the FP8 GEMM.

  • math_sm_count[in] Number of GPU SMs to use (default=0: use cuBLAS heuristics)

  • m_split[in] Number of chunks/splits along m-dimension for Atomic GEMM.

  • n_split[in] Number of chunks/splits along n-dimension for Atomic GEMM.

  • gemm_producer[in] Whether Atomic GEMM is the producer or consumer.

  • counter[inout] counter[chunk_i]=0 indicates chunk_i has been produced.

  • stream[in] CUDA stream used for the operation.

void nvte_multi_stream_cublas_gemm(const NVTETensor *A, const NVTETensor *B, NVTETensor *D, const NVTETensor *bias, NVTETensor *pre_gelu_out, const int num_gemms, bool transa, bool transb, bool grad, NVTETensor *workspace, bool accumulate, bool use_split_accumulator, int math_sm_count, cudaStream_t stream)

Compute multiple pairs of matrix multiplication, potentially fused with other operations, on multiple streams.

Computes:

  • D = AB if both bias and pre_gelu_out are empty tensors

  • D = AB + bias if pre_gelu_out is empty and bias is not empty

  • D = GELU(AB + bias) if both bias and pre_gelu_out are not empty tensors

Parameters:
  • A[in] The list of A matrices.

  • B[in] The list of B matrices.

  • D[inout] List of output matrices.

  • bias[in] List of bias tensors.

  • pre_gelu_out[inout] List of output matrix before GELU activation.

  • num_gemms[in] Number of GEMMs to compute.

  • transa[in] Whether A matrix is transposed.

  • transb[in] Whether B matrix is transposed.

  • grad[in] Whether this operation is part of the gradient computation.

  • workspace[out] List of workspace tensors.

  • accumulate[in] Whether to accumulate the result into the D matrix.

  • use_split_accumulator[in] Whether to use split accumulator in the FP8 GEMM.

  • math_sm_count[in] Number of GPU SMs to use (default=0: use cuBLAS heuristics)

  • stream[in] CUDA stream to wait on.

namespace transformer_engine

Namespace containing C++ API of Transformer Engine.

Variables

constexpr int num_streams = 4