core.fusions.fused_bias_geglu#
Module Contents#
Classes#
Custom autograd function for GEGLU activation with bias support. |
|
Custom autograd function for GEGLU activation without bias. |
|
Autograd function for token-wise weighted Quick-GEGLU (no bias). |
|
Autograd function for token-wise weighted Quick-GEGLU with bias support. |
Functions#
Performs GEGLU (GELU-Gated Linear Unit) activation. |
|
Performs GEGLU activation with bias addition. |
|
Computes the gradient for the GEGLU activation. |
|
Computes the gradient for the biased GEGLU activation. |
|
Implementation of biased GEGLU that handles different input shapes. |
|
Sigmoid approximation of gelu |
|
Performs Quick-GELU-based GEGLU activation : quick_gelu(y1) * (y2 + offset). |
|
Token-wise-weighted Quick-GEGLU activation. |
|
Backward helper for Quick-GEGLU. |
|
Backward helper for weighted Quick-GEGLU.
Returns gradient w.r.t input |
|
Token-wise weighted Quick-GEGLU activation with bias. |
|
Backward helper for weighted Quick-GEGLU with bias. |
|
Token-wise-weighted bias quick_geglu fusion. input: [num_selected_experts * seq_len, hidden_size * 2] bias: None weights: [num_selected_experts * seq_len, 1] fp8_input_store: bool linear_offset: float output: [num_selected_experts * seq_len, hidden_size] |
API#
- core.fusions.fused_bias_geglu.geglu(y)#
Performs GEGLU (GELU-Gated Linear Unit) activation.
- Parameters:
y (torch.Tensor) – Input tensor to be split into two halves along the last dimension.
- Returns:
Result of GEGLU activation: GELU(y1) * y2, where y1, y2 are the split halves.
- Return type:
torch.Tensor
- core.fusions.fused_bias_geglu.bias_geglu(bias, y)#
Performs GEGLU activation with bias addition.
- Parameters:
bias (torch.Tensor) – Bias tensor to be added to the input.
y (torch.Tensor) – Input tensor to be split and gated.
- Returns:
Result of bias addition followed by GEGLU activation.
- Return type:
torch.Tensor
- core.fusions.fused_bias_geglu.geglu_back(g, y)#
Computes the gradient for the GEGLU activation.
- Parameters:
g (torch.Tensor) – Gradient tensor from the subsequent layer.
y (torch.Tensor) – Input tensor that was used in the forward pass.
- Returns:
Gradient with respect to the input tensor.
- Return type:
torch.Tensor
- core.fusions.fused_bias_geglu.bias_geglu_back(g, y, bias)#
Computes the gradient for the biased GEGLU activation.
- Parameters:
g (torch.Tensor) – Gradient tensor from the subsequent layer.
y (torch.Tensor) – Input tensor that was used in the forward pass.
bias (torch.Tensor) – Bias tensor that was added in the forward pass.
- Returns:
Gradient with respect to the input tensor after bias addition.
- Return type:
torch.Tensor
- class core.fusions.fused_bias_geglu.BiasGeGLUFunction#
Bases:
torch.autograd.FunctionCustom autograd function for GEGLU activation with bias support.
- static forward(ctx, input, bias)#
Forward pass of biased GEGLU activation.
- Parameters:
ctx – Autograd context object for saving tensors for backward pass.
input (torch.Tensor) – Input tensor to apply GEGLU to.
bias (torch.Tensor) – Bias tensor to be added to input before GEGLU.
- Returns:
Result of applying bias addition followed by GEGLU activation.
- Return type:
torch.Tensor
- static backward(ctx, grad_output)#
Backward pass of biased GEGLU activation.
- Parameters:
ctx – Autograd context object containing saved tensors from forward pass.
grad_output (torch.Tensor) – Gradient of the loss with respect to the output.
- Returns:
Tuple containing gradients with respect to the input and bias tensors.
- Return type:
tuple
- class core.fusions.fused_bias_geglu.GeGLUFunction#
Bases:
torch.autograd.FunctionCustom autograd function for GEGLU activation without bias.
- static forward(ctx, input)#
Forward pass of GEGLU activation.
- Parameters:
ctx – Autograd context object for saving tensors for backward pass.
input (torch.Tensor) – Input tensor to apply GEGLU to.
- Returns:
Result of applying GEGLU activation.
- Return type:
torch.Tensor
- static backward(ctx, grad_output)#
Backward pass of GEGLU activation.
- Parameters:
ctx – Autograd context object containing saved tensors from forward pass.
grad_output (torch.Tensor) – Gradient of the loss with respect to the output.
- Returns:
Gradient with respect to the input tensor.
- Return type:
torch.Tensor
- core.fusions.fused_bias_geglu.bias_geglu_impl(input, bias)#
Implementation of biased GEGLU that handles different input shapes.
This function reshapes the input if necessary, applies the GEGLU activation (with or without bias), and restores the original shape.
- Parameters:
input (torch.Tensor) – Input tensor to apply GEGLU activation.
bias (torch.Tensor, optional) – Bias tensor to be added to input. If None, uses the bias-free GEGLU variant.
- Returns:
Result of biased GEGLU activation.
- Return type:
torch.Tensor
- Raises:
AssertionError – If input tensor does not have 2 or 3 dimensions.
- core.fusions.fused_bias_geglu.quick_gelu(y: torch.Tensor) torch.Tensor#
Sigmoid approximation of gelu
- core.fusions.fused_bias_geglu.quick_geglu(
- y: torch.Tensor,
- linear_offset: float = 0.0,
Performs Quick-GELU-based GEGLU activation : quick_gelu(y1) * (y2 + offset).
- Parameters:
y – Input tensor split into two halves on the last dimension.
linear_offset – Optional linear offset added to the second half before gating.
- Returns:
Tensor after applying the GEGLU activation.
- core.fusions.fused_bias_geglu.weighted_quick_geglu(
- y: torch.Tensor,
- weights: torch.Tensor,
- linear_offset: float = 0.0,
Token-wise-weighted Quick-GEGLU activation.
The weights tensor is expected to have the same first-dimension length as
yand a trailing singleton dimension so that it broadcasts over the feature dimension.
- core.fusions.fused_bias_geglu.quick_geglu_back(g, y, linear_offset: float = 0.0) torch.Tensor#
Backward helper for Quick-GEGLU.
- Parameters:
g (torch.Tensor) – Upstream gradient tensor.
y (torch.Tensor) – Input tensor used in the forward pass.
linear_offset (float, optional) – Linear offset used in the forward pass. Defaults to 0.0.
- Returns:
Gradient with respect to the input tensor.
- Return type:
torch.Tensor
- core.fusions.fused_bias_geglu.weighted_quick_geglu_back(g, y, weights, linear_offset: float = 0.0)#
Backward helper for weighted Quick-GEGLU. Returns gradient w.r.t input
yandweights.
- core.fusions.fused_bias_geglu.weighted_bias_quick_geglu(
- y: torch.Tensor,
- bias: torch.Tensor,
- weights: torch.Tensor,
- linear_offset: float = 0.0,
Token-wise weighted Quick-GEGLU activation with bias.
- Parameters:
y – Input tensor before bias addition.
bias – Bias tensor broadcastable to
y.weights – Weight tensor with shape
[tokens, 1]broadcasting over feature dim.linear_offset – Optional linear offset for the second half before gating.
- Returns:
Activated tensor with same dtype as
y.
- core.fusions.fused_bias_geglu.weighted_bias_quick_geglu_back(
- g,
- y,
- bias,
- weights,
- linear_offset: float = 0.0,
Backward helper for weighted Quick-GEGLU with bias.
Returns gradients w.r.t input
y,bias, andweights.
- class core.fusions.fused_bias_geglu.WeightedQuickGeGLUFunction#
Bases:
torch.autograd.FunctionAutograd function for token-wise weighted Quick-GEGLU (no bias).
- static forward(
- ctx,
- input: torch.Tensor,
- weights: torch.Tensor,
- fp8_input_store: bool,
- linear_offset: torch.Tensor,
Forward pass of weighted Quick-GEGLU.
- Parameters:
ctx – Autograd context object for saving tensors for backward pass.
input (torch.Tensor) – Input tensor of shape [N, 2H].
weights (torch.Tensor) – Per-token weights of shape [N, 1].
fp8_input_store (bool) – If True, stores input for backward in FP8.
linear_offset (torch.Tensor) – Scalar tensor offset added to the linear half.
- Returns:
Output tensor of shape [N, H] after weighted Quick-GEGLU.
- Return type:
torch.Tensor
- static backward(ctx, grad_output)#
Backward pass of weighted Quick-GEGLU.
- Parameters:
ctx – Autograd context object containing saved tensors from forward pass.
grad_output (torch.Tensor) – Upstream gradient w.r.t. the output.
- Returns:
Gradients with respect to (input, weights, fp8_input_store, linear_offset). The latter two gradients are None.
- Return type:
tuple
- class core.fusions.fused_bias_geglu.WeightedBiasQuickGeGLUFunction#
Bases:
torch.autograd.FunctionAutograd function for token-wise weighted Quick-GEGLU with bias support.
- static forward(
- ctx,
- input: torch.Tensor,
- bias: torch.Tensor,
- weights: torch.Tensor,
- fp8_input_store: bool,
- linear_offset: torch.Tensor,
Forward pass of weighted Quick-GEGLU.
- Parameters:
ctx – Autograd context object for saving tensors for backward pass.
input (torch.Tensor) – Input tensor of shape [N, 2H].
bias (torch.Tensor) – Bias tensor of shape [N, 1].
weights (torch.Tensor) – Per-token weights of shape [N, 1].
fp8_input_store (bool) – If True, stores input for backward in FP8.
linear_offset (torch.Tensor) – Scalar tensor offset added to the linear half.
- Returns:
Output tensor of shape [N, H] after weighted Quick-GEGLU with bias.
- Return type:
torch.Tensor
- static backward(ctx, grad_output)#
Backward pass of weighted Quick-GEGLU with bias.
- Parameters:
ctx – Autograd context object containing saved tensors from forward pass.
grad_output (torch.Tensor) – Upstream gradient w.r.t. the output.
- Returns:
Gradients with respect to (input, bias, weights, fp8_input_store, linear_offset). The latter two gradients are None.
- Return type:
tuple
- core.fusions.fused_bias_geglu.weighted_bias_quick_geglu_impl(
- input,
- bias,
- weights,
- fp8_input_store=False,
- linear_offset=0.0,
- clamp_value=None,
Token-wise-weighted bias quick_geglu fusion. input: [num_selected_experts * seq_len, hidden_size * 2] bias: None weights: [num_selected_experts * seq_len, 1] fp8_input_store: bool linear_offset: float output: [num_selected_experts * seq_len, hidden_size]