activation.h
Activation functions.
Enums
-
enum class NVTE_Activation_Type
Computes activation of the input. If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.
- Param input:
[in] Input tensor for activation.
- Param output:
[inout] Output tensor.
- Param stream:
[in] CUDA stream used for the operation.
Values:
-
enumerator GELU
-
enumerator GEGLU
-
enumerator SILU
-
enumerator SWIGLU
-
enumerator RELU
-
enumerator REGLU
-
enumerator QGELU
-
enumerator QGEGLU
-
enumerator SRELU
-
enumerator SREGLU
-
enumerator CLAMPED_SWIGLU
Functions
-
void nvte_gelu(const NVTETensor input, NVTETensor output, cudaStream_t stream)
Computes the GeLU activation of the input. If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.
- Parameters:
input – [in] Input tensor for activation.
output – [inout] Output tensor.
stream – [in] CUDA stream used for the operation.
-
void nvte_silu(const NVTETensor input, NVTETensor output, cudaStream_t stream)
Computes the SiLU activation of the input. If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.
- Parameters:
input – [in] Input tensor for activation.
output – [inout] Output tensor.
stream – [in] CUDA stream used for the operation.
-
void nvte_relu(const NVTETensor input, NVTETensor output, cudaStream_t stream)
Computes the ReLU activation of the input. If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.
- Parameters:
input – [in] Input tensor for activation.
output – [inout] Output tensor.
stream – [in] CUDA stream used for the operation.
-
void nvte_qgelu(const NVTETensor input, NVTETensor output, cudaStream_t stream)
Computes the Quick GeLU activation of the input. If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.
- Parameters:
input – [in] Input tensor for activation.
output – [inout] Output tensor.
stream – [in] CUDA stream used for the operation.
-
void nvte_srelu(const NVTETensor input, NVTETensor output, cudaStream_t stream)
Computes the Squared ReLU activation of the input. If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.
- Parameters:
input – [in] Input tensor for activation.
output – [inout] Output tensor.
stream – [in] CUDA stream used for the operation.
-
void nvte_dgelu(const NVTETensor grad, const NVTETensor input, NVTETensor output, cudaStream_t stream)
Computes the GeLU activation gradient. If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.
- Parameters:
grad – [in] Incoming gradient.
input – [in] Input tensor for activation.
output – [inout] Output tensor.
stream – [in] CUDA stream used for the operation.
-
void nvte_dsilu(const NVTETensor grad, const NVTETensor input, NVTETensor output, cudaStream_t stream)
Computes the SiLU activation gradient. If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.
- Parameters:
grad – [in] Incoming gradient.
input – [in] Input tensor for activation.
output – [inout] Output tensor.
stream – [in] CUDA stream used for the operation.
-
void nvte_drelu(const NVTETensor grad, const NVTETensor input, NVTETensor output, cudaStream_t stream)
Computes the ReLU activation gradient. If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.
- Parameters:
grad – [in] Incoming gradient.
input – [in] Input tensor for activation.
output – [inout] Output tensor.
stream – [in] CUDA stream used for the operation.
-
void nvte_dqgelu(const NVTETensor grad, const NVTETensor input, NVTETensor output, cudaStream_t stream)
Computes the Quick GeLU activation gradient. If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.
- Parameters:
grad – [in] Incoming gradient.
input – [in] Input tensor for activation.
output – [inout] Output tensor.
stream – [in] CUDA stream used for the operation.
-
void nvte_dsrelu(const NVTETensor grad, const NVTETensor input, NVTETensor output, cudaStream_t stream)
Computes the Squared ReLU activation gradient. If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.
- Parameters:
grad – [in] Incoming gradient.
input – [in] Input tensor for activation.
output – [inout] Output tensor.
stream – [in] CUDA stream used for the operation.
-
void nvte_geglu(const NVTETensor input, NVTETensor output, cudaStream_t stream)
Computes the gated GeLU activation of the input. If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.
- Parameters:
input – [in] Input tensor of shape [N, H * 2].
output – [inout] Output tensor of shape [N, H]. It computes Act(input[N, :H]) x input[N, H:]
stream – [in] CUDA stream used for the operation.
-
void nvte_swiglu(const NVTETensor input, NVTETensor output, cudaStream_t stream)
Computes the gated Swish activation of the input. If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.
- Parameters:
input – [in] Input tensor of shape [N, H * 2].
output – [inout] Output tensor of shape [N, H]. It computes Act(input[N, :H]) x input[N, H:]
stream – [in] CUDA stream used for the operation.
-
void nvte_clamped_swiglu(const NVTETensor input, NVTETensor output, float limit, float alpha, cudaStream_t stream)
Computes the gated Swish activation of the input used in GPT OSS.
See https://github.com/openai/gpt-oss/blob/a0a84273e9e0c14a233cb9befdfd159c2bcfa6cd/gpt_oss/torch/model.py#L250 This Gated activation has two differences compared to the original SwiGLU 1. Both gate and pre-activations are clipped based on parameter limit. 2. Activation uses sigmoid(alpha * x) instead of sigmoid(x) used in Swish activation inspired by original GELU paper https://arxiv.org/pdf/1606.08415 If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.
- Parameters:
input – [in] Input tensor of shape [N, H * 2].
output – [inout] Output tensor of shape [N, H]. It computes Act(input[N, :H]) x input[N, H:]
limit – [in] Clipping limits for gate and pre-activation.
alpha – [in] Scaling factor for the sigmoid function used in the activation.
stream – [in] CUDA stream used for the operation.
-
void nvte_reglu(const NVTETensor input, NVTETensor output, cudaStream_t stream)
Computes the gated ReLU activation of the input. If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.
- Parameters:
input – [in] Input tensor of shape [N, H * 2].
output – [inout] Output tensor of shape [N, H]. It computes Act(input[N, :H]) x input[N, H:]
stream – [in] CUDA stream used for the operation.
-
void nvte_qgeglu(const NVTETensor input, NVTETensor output, cudaStream_t stream)
Computes the gated Quick GeLU activation of the input. If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.
- Parameters:
input – [in] Input tensor of shape [N, H * 2].
output – [inout] Output tensor of shape [N, H]. It computes Act(input[N, :H]) x input[N, H:]
stream – [in] CUDA stream used for the operation.
-
void nvte_sreglu(const NVTETensor input, NVTETensor output, cudaStream_t stream)
Computes the gated Squared ReLU activation of the input. If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.
- Parameters:
input – [in] Input tensor of shape [N, H * 2].
output – [inout] Output tensor of shape [N, H]. It computes Act(input[N, :H]) x input[N, H:]
stream – [in] CUDA stream used for the operation.
-
void nvte_dgeglu(const NVTETensor grad, const NVTETensor input, NVTETensor output, cudaStream_t stream)
Computes the gated GeLU activation gradient. If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.
- Parameters:
grad – [in] Incoming gradient of shape [N, H].
input – [in] Forward input tensor of shape [N, H * 2].
output – [inout] Outgoing gradient of shape [N, H * 2].
stream – [in] CUDA stream used for the operation.
-
void nvte_dswiglu(const NVTETensor grad, const NVTETensor input, NVTETensor output, cudaStream_t stream)
Computes the gated Swish activation gradient. If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.
- Parameters:
grad – [in] Incoming gradient of shape [N, H].
input – [in] Forward input tensor of shape [N, H * 2].
output – [inout] Outgoing gradient of shape [N, H * 2].
stream – [in] CUDA stream used for the operation.
-
void nvte_clamped_dswiglu(const NVTETensor grad, const NVTETensor input, NVTETensor output, float limit, float alpha, cudaStream_t stream)
Computes the gradient of gated Swish activation of the input used in GPT OSS.
https://github.com/openai/gpt-oss/blob/a0a84273e9e0c14a233cb9befdfd159c2bcfa6cd/gpt_oss/torch/model.py#L250 This activation has two differences compared to the original SwiGLU 1. Both gate and pre-activations are clipped based on parameter limit. 2. Activation uses sigmoid(alpha * x) instead of sigmoid(x) used in Swish activation inspired by original GELU paper https://arxiv.org/pdf/1606.08415 If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.
- Parameters:
grad – [in] Incoming gradient of shape [N, H].
input – [in] Forward input tensor of shape [N, H * 2].
output – [inout] Outgoing gradient of shape [N, H * 2].
limit – [in] Clipping limits for gate and pre-activation.
alpha – [in] Scaling factor for the sigmoid function used in the activation.
stream – [in] CUDA stream used for the operation.
-
void nvte_dreglu(const NVTETensor grad, const NVTETensor input, NVTETensor output, cudaStream_t stream)
Computes the gated ReLU activation gradient. If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.
- Parameters:
grad – [in] Incoming gradient of shape [N, H].
input – [in] Forward input tensor of shape [N, H * 2].
output – [inout] Outgoing gradient of shape [N, H * 2].
stream – [in] CUDA stream used for the operation.
-
void nvte_dqgeglu(const NVTETensor grad, const NVTETensor input, NVTETensor output, cudaStream_t stream)
Computes the gated Quick GeLU activation gradient. If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.
- Parameters:
grad – [in] Incoming gradient of shape [N, H].
input – [in] Forward input tensor of shape [N, H * 2].
output – [inout] Outgoing gradient of shape [N, H * 2].
stream – [in] CUDA stream used for the operation.
-
void nvte_dsreglu(const NVTETensor grad, const NVTETensor input, NVTETensor output, cudaStream_t stream)
Computes the gated Squared ReLU activation gradient. If the scaling mode of the output tensor is set to NVTE_MXFP8_1D_SCALING, the block quantization (MXFP8) of the specified shape of the block will be used.
- Parameters:
grad – [in] Incoming gradient of shape [N, H].
input – [in] Forward input tensor of shape [N, H * 2].
output – [inout] Outgoing gradient of shape [N, H * 2].
stream – [in] CUDA stream used for the operation.