cast.h

Functions to cast to/from FP8/MXFP8.

Functions to cast to/from FP8.

Functions

void nvte_quantize(const NVTETensor input, NVTETensor output, cudaStream_t stream)

Casts input tensor to FP8/MXFP8. If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.

Parameters:
  • input[in] Input tensor to be cast.

  • output[inout] Output FP8/MXFP8 tensor.

  • stream[in] CUDA stream used for the operation.

void nvte_quantize_noop(const NVTETensor input, NVTETensor output, NVTETensor noop, cudaStream_t stream)

Casts input tensor to FP8/MXFP8, providing the option to immediately exit the kernel based on the value of the ‘noop’ tensor. If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.

Parameters:
  • input[in] Input tensor to be cast.

  • output[inout] Output FP8/MXFP8 tensor.

  • noop[out] Noop tensor.

  • stream[in] CUDA stream used for the operation.

void nvte_quantize_dbias(const NVTETensor input, NVTETensor output, NVTETensor dbias, NVTETensor workplace, cudaStream_t stream)

Casts input tensor to MXFP8. Additionally, reduces the input along columns. If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.

This function produces 2 results:

  • output is equal to cast(dact(input))

  • dbias is equal to reduce(dact(input), dim=1)

Calling this function with the workspace being an empty tensor will not perform the operation, but instead set the shape and type of the workspace tensor to the required values.

Parameters:
  • input[in] Input tensor to be cast.

  • output[inout] Output FP8/MXFP8 tensor.

  • dbias[out] Result of the reduction of the input along columns.

  • workspace[out] Workspace tensor.

  • stream[in] CUDA stream used for the operation.

void nvte_quantize_dbias_dgelu(const NVTETensor input, const NVTETensor act_input, NVTETensor output, NVTETensor dbias, NVTETensor workspace, cudaStream_t stream)

Computes backward of GeLU operation on the input, then casts to FP8/MXFP8. Additionally, reduces the result of the GeLU backward along columns. If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.

This function produces 2 results:

  • output is equal to cast(dact(input))

  • dbias is equal to reduce(dact(input), dim=1)

Calling this function with the workspace being an empty tensor will not perform the operation, but instead set the shape and type of the workspace tensor to the required values.

Parameters:
  • input[in] Input tensor to be cast.

  • act_input[in] Activation input tensor.

  • output[inout] Output FP8/MXFP8 tensor.

  • dbias[out] Result of the reduction of the input along columns.

  • workspace[out] Workspace tensor.

  • stream[in] CUDA stream used for the operation.

void nvte_quantize_dbias_dsilu(const NVTETensor input, const NVTETensor act_input, NVTETensor output, NVTETensor dbias, NVTETensor workspace, cudaStream_t stream)

Computes backward of SiLU operation on the input, then casts to FP8/MXFP8. Additionally, reduces the result of the SiLU backward along columns. If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.

This function produces 2 results:

  • output is equal to cast(dact(input))

  • dbias is equal to reduce(dact(input), dim=1)

Calling this function with the workspace being an empty tensor will not perform the operation, but instead set the shape and type of the workspace tensor to the required values.

Parameters:
  • input[in] Input tensor to be cast.

  • act_input[in] Activation input tensor.

  • output[inout] Output FP8/MXFP8 tensor.

  • dbias[out] Result of the reduction of the input along columns.

  • workspace[out] Workspace tensor.

  • stream[in] CUDA stream used for the operation.

void nvte_quantize_dbias_drelu(const NVTETensor input, const NVTETensor act_input, NVTETensor output, NVTETensor dbias, NVTETensor workspace, cudaStream_t stream)

Computes backward of ReLU operation on the input, then casts to FP8/MXFP8. Additionally, reduces the result of the ReLU backward along columns. If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.

This function produces 2 results:

  • output is equal to cast(dact(input))

  • dbias is equal to reduce(dact(input), dim=1)

Calling this function with the workspace being an empty tensor will not perform the operation, but instead set the shape and type of the workspace tensor to the required values.

Parameters:
  • input[in] Input tensor to be cast.

  • act_input[in] Activation input tensor.

  • output[inout] Output FP8/MXFP8 tensor.

  • dbias[out] Result of the reduction of the input along columns.

  • workspace[out] Workspace tensor.

  • stream[in] CUDA stream used for the operation.

void nvte_quantize_dbias_dqgelu(const NVTETensor input, const NVTETensor act_input, NVTETensor output, NVTETensor dbias, NVTETensor workspace, cudaStream_t stream)

Computes backward of Quick GeLU operation on the input, then casts to FP8/MXFP8. Additionally, reduces the result of the Quick GeLU backward along columns. If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.

This function produces 2 results:

  • output is equal to cast(dact(input))

  • dbias is equal to reduce(dact(input), dim=1)

Calling this function with the workspace being an empty tensor will not perform the operation, but instead set the shape and type of the workspace tensor to the required values.

Parameters:
  • input[in] Input tensor to be cast.

  • act_input[in] Activation input tensor.

  • output[inout] Output FP8/MXFP8 tensor.

  • dbias[out] Result of the reduction of the input along columns.

  • workspace[out] Workspace tensor.

  • stream[in] CUDA stream used for the operation.

void nvte_quantize_dbias_dsrelu(const NVTETensor input, const NVTETensor act_input, NVTETensor output, NVTETensor dbias, NVTETensor workspace, cudaStream_t stream)

Computes backward of Squared ReLU operation on the input, then casts to FP8/MXFP8. Additionally, reduces the result of the Squared ReLU backward along columns. If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.

This function produces 2 results:

  • output is equal to cast(dact(input))

  • dbias is equal to reduce(dact(input), dim=1)

Calling this function with the workspace being an empty tensor will not perform the operation, but instead set the shape and type of the workspace tensor to the required values.

Parameters:
  • input[in] Input tensor to be cast.

  • act_input[in] Activation input tensor.

  • output[inout] Output FP8/MXFP8 tensor.

  • dbias[out] Result of the reduction of the input along columns.

  • workspace[out] Workspace tensor.

  • stream[in] CUDA stream used for the operation.

void nvte_dequantize(const NVTETensor input, NVTETensor output, cudaStream_t stream)

Casts input tensor from reduced to higher precision. If the scaling mode of the input tensor is set to NVTE_MXFP8_1D_SCALING, the block dequantization (MXFP8) of the specified shape of the block will be used. In case of the MXFP8 dequantization, the dequantized values are stored to the rowwise data of the output tensor, regardless of whether the row- or columnwise scaling is used.

Parameters:
  • input[in] Input FP8/MXFP8 tensor to be cast.

  • output[inout] Output tensor.

  • stream[in] CUDA stream used for the operation.