core.fusions.fused_bias_geglu#

Module Contents#

Classes#

BiasGeGLUFunction

Custom autograd function for GEGLU activation with bias support.

GeGLUFunction

Custom autograd function for GEGLU activation without bias.

WeightedQuickGeGLUFunction

Autograd function for token-wise weighted Quick-GEGLU (no bias).

WeightedBiasQuickGeGLUFunction

Autograd function for token-wise weighted Quick-GEGLU with bias support.

Functions#

geglu

Performs GEGLU (GELU-Gated Linear Unit) activation.

bias_geglu

Performs GEGLU activation with bias addition.

geglu_back

Computes the gradient for the GEGLU activation.

bias_geglu_back

Computes the gradient for the biased GEGLU activation.

bias_geglu_impl

Implementation of biased GEGLU that handles different input shapes.

quick_gelu

Sigmoid approximation of gelu

quick_geglu

Performs Quick-GELU-based GEGLU activation : quick_gelu(y1) * (y2 + offset).

weighted_quick_geglu

Token-wise-weighted Quick-GEGLU activation.

quick_geglu_back

Backward helper for Quick-GEGLU.

weighted_quick_geglu_back

Backward helper for weighted Quick-GEGLU. Returns gradient w.r.t input y and weights.

weighted_bias_quick_geglu

Token-wise weighted Quick-GEGLU activation with bias.

weighted_bias_quick_geglu_back

Backward helper for weighted Quick-GEGLU with bias.

weighted_bias_quick_geglu_impl

Token-wise-weighted bias quick_geglu fusion. input: [num_selected_experts * seq_len, hidden_size * 2] bias: None weights: [num_selected_experts * seq_len, 1] fp8_input_store: bool linear_offset: float output: [num_selected_experts * seq_len, hidden_size]

API#

core.fusions.fused_bias_geglu.geglu(y)#

Performs GEGLU (GELU-Gated Linear Unit) activation.

Parameters:

y (torch.Tensor) – Input tensor to be split into two halves along the last dimension.

Returns:

Result of GEGLU activation: GELU(y1) * y2, where y1, y2 are the split halves.

Return type:

torch.Tensor

core.fusions.fused_bias_geglu.bias_geglu(bias, y)#

Performs GEGLU activation with bias addition.

Parameters:
  • bias (torch.Tensor) – Bias tensor to be added to the input.

  • y (torch.Tensor) – Input tensor to be split and gated.

Returns:

Result of bias addition followed by GEGLU activation.

Return type:

torch.Tensor

core.fusions.fused_bias_geglu.geglu_back(g, y)#

Computes the gradient for the GEGLU activation.

Parameters:
  • g (torch.Tensor) – Gradient tensor from the subsequent layer.

  • y (torch.Tensor) – Input tensor that was used in the forward pass.

Returns:

Gradient with respect to the input tensor.

Return type:

torch.Tensor

core.fusions.fused_bias_geglu.bias_geglu_back(g, y, bias)#

Computes the gradient for the biased GEGLU activation.

Parameters:
  • g (torch.Tensor) – Gradient tensor from the subsequent layer.

  • y (torch.Tensor) – Input tensor that was used in the forward pass.

  • bias (torch.Tensor) – Bias tensor that was added in the forward pass.

Returns:

Gradient with respect to the input tensor after bias addition.

Return type:

torch.Tensor

class core.fusions.fused_bias_geglu.BiasGeGLUFunction#

Bases: torch.autograd.Function

Custom autograd function for GEGLU activation with bias support.

static forward(ctx, input, bias)#

Forward pass of biased GEGLU activation.

Parameters:
  • ctx – Autograd context object for saving tensors for backward pass.

  • input (torch.Tensor) – Input tensor to apply GEGLU to.

  • bias (torch.Tensor) – Bias tensor to be added to input before GEGLU.

Returns:

Result of applying bias addition followed by GEGLU activation.

Return type:

torch.Tensor

static backward(ctx, grad_output)#

Backward pass of biased GEGLU activation.

Parameters:
  • ctx – Autograd context object containing saved tensors from forward pass.

  • grad_output (torch.Tensor) – Gradient of the loss with respect to the output.

Returns:

Tuple containing gradients with respect to the input and bias tensors.

Return type:

tuple

class core.fusions.fused_bias_geglu.GeGLUFunction#

Bases: torch.autograd.Function

Custom autograd function for GEGLU activation without bias.

static forward(ctx, input)#

Forward pass of GEGLU activation.

Parameters:
  • ctx – Autograd context object for saving tensors for backward pass.

  • input (torch.Tensor) – Input tensor to apply GEGLU to.

Returns:

Result of applying GEGLU activation.

Return type:

torch.Tensor

static backward(ctx, grad_output)#

Backward pass of GEGLU activation.

Parameters:
  • ctx – Autograd context object containing saved tensors from forward pass.

  • grad_output (torch.Tensor) – Gradient of the loss with respect to the output.

Returns:

Gradient with respect to the input tensor.

Return type:

torch.Tensor

core.fusions.fused_bias_geglu.bias_geglu_impl(input, bias)#

Implementation of biased GEGLU that handles different input shapes.

This function reshapes the input if necessary, applies the GEGLU activation (with or without bias), and restores the original shape.

Parameters:
  • input (torch.Tensor) – Input tensor to apply GEGLU activation.

  • bias (torch.Tensor, optional) – Bias tensor to be added to input. If None, uses the bias-free GEGLU variant.

Returns:

Result of biased GEGLU activation.

Return type:

torch.Tensor

Raises:

AssertionError – If input tensor does not have 2 or 3 dimensions.

core.fusions.fused_bias_geglu.quick_gelu(y: torch.Tensor) torch.Tensor#

Sigmoid approximation of gelu

core.fusions.fused_bias_geglu.quick_geglu(
y: torch.Tensor,
linear_offset: float = 0.0,
) torch.Tensor#

Performs Quick-GELU-based GEGLU activation : quick_gelu(y1) * (y2 + offset).

Parameters:
  • y – Input tensor split into two halves on the last dimension.

  • linear_offset – Optional linear offset added to the second half before gating.

Returns:

Tensor after applying the GEGLU activation.

core.fusions.fused_bias_geglu.weighted_quick_geglu(
y: torch.Tensor,
weights: torch.Tensor,
linear_offset: float = 0.0,
) torch.Tensor#

Token-wise-weighted Quick-GEGLU activation.

The weights tensor is expected to have the same first-dimension length as y and a trailing singleton dimension so that it broadcasts over the feature dimension.

core.fusions.fused_bias_geglu.quick_geglu_back(g, y, linear_offset: float = 0.0) torch.Tensor#

Backward helper for Quick-GEGLU.

Parameters:
  • g (torch.Tensor) – Upstream gradient tensor.

  • y (torch.Tensor) – Input tensor used in the forward pass.

  • linear_offset (float, optional) – Linear offset used in the forward pass. Defaults to 0.0.

Returns:

Gradient with respect to the input tensor.

Return type:

torch.Tensor

core.fusions.fused_bias_geglu.weighted_quick_geglu_back(g, y, weights, linear_offset: float = 0.0)#

Backward helper for weighted Quick-GEGLU. Returns gradient w.r.t input y and weights.

core.fusions.fused_bias_geglu.weighted_bias_quick_geglu(
y: torch.Tensor,
bias: torch.Tensor,
weights: torch.Tensor,
linear_offset: float = 0.0,
) torch.Tensor#

Token-wise weighted Quick-GEGLU activation with bias.

Parameters:
  • y – Input tensor before bias addition.

  • bias – Bias tensor broadcastable to y.

  • weights – Weight tensor with shape [tokens, 1] broadcasting over feature dim.

  • linear_offset – Optional linear offset for the second half before gating.

Returns:

Activated tensor with same dtype as y.

core.fusions.fused_bias_geglu.weighted_bias_quick_geglu_back(
g,
y,
bias,
weights,
linear_offset: float = 0.0,
)#

Backward helper for weighted Quick-GEGLU with bias.

Returns gradients w.r.t input y, bias, and weights.

class core.fusions.fused_bias_geglu.WeightedQuickGeGLUFunction#

Bases: torch.autograd.Function

Autograd function for token-wise weighted Quick-GEGLU (no bias).

static forward(
ctx,
input: torch.Tensor,
weights: torch.Tensor,
fp8_input_store: bool,
linear_offset: torch.Tensor,
)#

Forward pass of weighted Quick-GEGLU.

Parameters:
  • ctx – Autograd context object for saving tensors for backward pass.

  • input (torch.Tensor) – Input tensor of shape [N, 2H].

  • weights (torch.Tensor) – Per-token weights of shape [N, 1].

  • fp8_input_store (bool) – If True, stores input for backward in FP8.

  • linear_offset (torch.Tensor) – Scalar tensor offset added to the linear half.

Returns:

Output tensor of shape [N, H] after weighted Quick-GEGLU.

Return type:

torch.Tensor

static backward(ctx, grad_output)#

Backward pass of weighted Quick-GEGLU.

Parameters:
  • ctx – Autograd context object containing saved tensors from forward pass.

  • grad_output (torch.Tensor) – Upstream gradient w.r.t. the output.

Returns:

Gradients with respect to (input, weights, fp8_input_store, linear_offset). The latter two gradients are None.

Return type:

tuple

class core.fusions.fused_bias_geglu.WeightedBiasQuickGeGLUFunction#

Bases: torch.autograd.Function

Autograd function for token-wise weighted Quick-GEGLU with bias support.

static forward(
ctx,
input: torch.Tensor,
bias: torch.Tensor,
weights: torch.Tensor,
fp8_input_store: bool,
linear_offset: torch.Tensor,
)#

Forward pass of weighted Quick-GEGLU.

Parameters:
  • ctx – Autograd context object for saving tensors for backward pass.

  • input (torch.Tensor) – Input tensor of shape [N, 2H].

  • bias (torch.Tensor) – Bias tensor of shape [N, 1].

  • weights (torch.Tensor) – Per-token weights of shape [N, 1].

  • fp8_input_store (bool) – If True, stores input for backward in FP8.

  • linear_offset (torch.Tensor) – Scalar tensor offset added to the linear half.

Returns:

Output tensor of shape [N, H] after weighted Quick-GEGLU with bias.

Return type:

torch.Tensor

static backward(ctx, grad_output)#

Backward pass of weighted Quick-GEGLU with bias.

Parameters:
  • ctx – Autograd context object containing saved tensors from forward pass.

  • grad_output (torch.Tensor) – Upstream gradient w.r.t. the output.

Returns:

Gradients with respect to (input, bias, weights, fp8_input_store, linear_offset). The latter two gradients are None.

Return type:

tuple

core.fusions.fused_bias_geglu.weighted_bias_quick_geglu_impl(
input,
bias,
weights,
fp8_input_store=False,
linear_offset=0.0,
clamp_value=None,
)#

Token-wise-weighted bias quick_geglu fusion. input: [num_selected_experts * seq_len, hidden_size * 2] bias: None weights: [num_selected_experts * seq_len, 1] fp8_input_store: bool linear_offset: float output: [num_selected_experts * seq_len, hidden_size]