cast.h

Functions to cast to/from FP8/MXFP8.

Functions to cast to/from FP8.

Functions

void nvte_quantize(const NVTETensor input, NVTETensor output, cudaStream_t stream)

Casts input tensor to FP8/MXFP8/BlockwiseFP8. The type of quantized tensor in the output depends on the scaling mode of the output tensor. See file level comments.

Parameters:

input – [in] Input tensor to be cast.
output – [inout] Output FP8/MXFP8/BlockwiseFP8 tensor.
stream – [in] CUDA stream used for the operation.

void nvte_quantize_noop(const NVTETensor input, NVTETensor output, NVTETensor noop, cudaStream_t stream)

Casts input tensor to FP8/MXFP8/BlockwiseFP8, providing the option to immediately exit the kernel based on the value of the ‘noop’ tensor. The type of quantized tensor in the output depends on the scaling mode of the output tensor. See file level comments.

Parameters:

input – [in] Input tensor to be cast.
output – [inout] Output quantized tensor.
noop – [out] Noop tensor.
stream – [in] CUDA stream used for the operation.

void nvte_quantize_v2(const NVTETensor input, NVTETensor output, const NVTEQuantizationConfig quant_config, cudaStream_t stream)

Casts input tensor to quantized output tensor, with advanced quantization options.

Parameters:

input – [in] Input tensor to be cast.
output – [inout] Output quantized tensor.
quant_config – [in] Quantization configuration.
stream – [in] CUDA stream used for the operation.

void nvte_quantize_dbias(const NVTETensor input, NVTETensor output, NVTETensor dbias, NVTETensor workplace, cudaStream_t stream)

Casts input tensor to MXFP8. Additionally, reduces the input along columns. If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.

This function produces 2 results:

output is equal to cast(dact(input))
dbias is equal to reduce(dact(input), dim=1)

Calling this function with the workspace being an empty tensor will not perform the operation, but instead set the shape and type of the workspace tensor to the required values.

Parameters:

input – [in] Input tensor to be cast.
output – [inout] Output FP8/MXFP8 tensor.
dbias – [out] Result of the reduction of the input along columns.
workspace – [out] Workspace tensor.
stream – [in] CUDA stream used for the operation.

void nvte_quantize_dbias_dgelu(const NVTETensor input, const NVTETensor act_input, NVTETensor output, NVTETensor dbias, NVTETensor workspace, cudaStream_t stream)

Computes backward of GeLU operation on the input, then casts to FP8/MXFP8. Additionally, reduces the result of the GeLU backward along columns. If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.

This function produces 2 results:

output is equal to cast(dact(input))
dbias is equal to reduce(dact(input), dim=1)

Calling this function with the workspace being an empty tensor will not perform the operation, but instead set the shape and type of the workspace tensor to the required values.

Parameters:

input – [in] Input tensor to be cast.
act_input – [in] Activation input tensor.
output – [inout] Output FP8/MXFP8 tensor.
dbias – [out] Result of the reduction of the input along columns.
workspace – [out] Workspace tensor.
stream – [in] CUDA stream used for the operation.

void nvte_quantize_dbias_dsilu(const NVTETensor input, const NVTETensor act_input, NVTETensor output, NVTETensor dbias, NVTETensor workspace, cudaStream_t stream)

Computes backward of SiLU operation on the input, then casts to FP8/MXFP8. Additionally, reduces the result of the SiLU backward along columns. If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.

This function produces 2 results:

output is equal to cast(dact(input))
dbias is equal to reduce(dact(input), dim=1)

Calling this function with the workspace being an empty tensor will not perform the operation, but instead set the shape and type of the workspace tensor to the required values.

Parameters:

input – [in] Input tensor to be cast.
act_input – [in] Activation input tensor.
output – [inout] Output FP8/MXFP8 tensor.
dbias – [out] Result of the reduction of the input along columns.
workspace – [out] Workspace tensor.
stream – [in] CUDA stream used for the operation.

void nvte_quantize_dbias_drelu(const NVTETensor input, const NVTETensor act_input, NVTETensor output, NVTETensor dbias, NVTETensor workspace, cudaStream_t stream)

Computes backward of ReLU operation on the input, then casts to FP8/MXFP8. Additionally, reduces the result of the ReLU backward along columns. If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.

This function produces 2 results:

output is equal to cast(dact(input))
dbias is equal to reduce(dact(input), dim=1)

Calling this function with the workspace being an empty tensor will not perform the operation, but instead set the shape and type of the workspace tensor to the required values.

Parameters:

input – [in] Input tensor to be cast.
act_input – [in] Activation input tensor.
output – [inout] Output FP8/MXFP8 tensor.
dbias – [out] Result of the reduction of the input along columns.
workspace – [out] Workspace tensor.
stream – [in] CUDA stream used for the operation.

void nvte_quantize_dbias_dqgelu(const NVTETensor input, const NVTETensor act_input, NVTETensor output, NVTETensor dbias, NVTETensor workspace, cudaStream_t stream)

Computes backward of Quick GeLU operation on the input, then casts to FP8/MXFP8. Additionally, reduces the result of the Quick GeLU backward along columns. If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.

This function produces 2 results:

output is equal to cast(dact(input))
dbias is equal to reduce(dact(input), dim=1)

Calling this function with the workspace being an empty tensor will not perform the operation, but instead set the shape and type of the workspace tensor to the required values.

Parameters:

input – [in] Input tensor to be cast.
act_input – [in] Activation input tensor.
output – [inout] Output FP8/MXFP8 tensor.
dbias – [out] Result of the reduction of the input along columns.
workspace – [out] Workspace tensor.
stream – [in] CUDA stream used for the operation.

void nvte_quantize_dbias_dsrelu(const NVTETensor input, const NVTETensor act_input, NVTETensor output, NVTETensor dbias, NVTETensor workspace, cudaStream_t stream)

Computes backward of Squared ReLU operation on the input, then casts to FP8/MXFP8. Additionally, reduces the result of the Squared ReLU backward along columns. If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.

This function produces 2 results:

output is equal to cast(dact(input))
dbias is equal to reduce(dact(input), dim=1)

Calling this function with the workspace being an empty tensor will not perform the operation, but instead set the shape and type of the workspace tensor to the required values.

Parameters:

input – [in] Input tensor to be cast.
act_input – [in] Activation input tensor.
output – [inout] Output FP8/MXFP8 tensor.
dbias – [out] Result of the reduction of the input along columns.
workspace – [out] Workspace tensor.
stream – [in] CUDA stream used for the operation.

void nvte_dequantize(const NVTETensor input, NVTETensor output, cudaStream_t stream)

Casts input tensor from reduced to higher precision. If the scaling mode of the input tensor is set to NVTE_MXFP8_1D_SCALING, the block dequantization (MXFP8) of the specified shape of the block will be used. In case of the MXFP8 dequantization, the dequantized values are stored to the rowwise data of the output tensor, regardless of whether the row- or columnwise scaling is used.

Parameters:

input – [in] Input FP8/MXFP8 tensor to be cast.
output – [inout] Output tensor.
stream – [in] CUDA stream used for the operation.

void nvte_multi_tensor_quantize(const NVTETensor *inputs, NVTETensor *outputs, const NVTEQuantizationConfig quant_config, const size_t num_tensors, cudaStream_t stream)

Casts multiple input tensors to quantized output tensors.

Parameters:

inputs – [in] List of input tensors to be cast.
outputs – [inout] List of output quantized tensors.
quant_config – [in] (Optional) Quantization configurations.
stream – [in] CUDA stream used for the operation.