core.fusions.fused_bias_swiglu#

Module Contents#

Classes#

BiasSwiGLUFunction

Custom autograd function for SwiGLU activation with bias support.

SwiGLUFunction

Custom autograd function for SwiGLU activation without bias.

WeightedSwiGLUFunction

Functions#

swiglu

Performs SwiGLU (Swish-Gated Linear Unit) activation function.

bias_swiglu

Performs SwiGLU activation with bias addition.

weighted_swiglu

swiglu_back

Computes the gradient for the SwiGLU activation function.

bias_swiglu_back

Computes the gradient for the biased SwiGLU activation function.

weighted_swiglu_back

bias_swiglu_impl

Implementation of biased SwiGLU that handles different input shapes.

weighted_bias_swiglu_impl

Token-wise-weighted bias swiglu fusion.

API#

core.fusions.fused_bias_swiglu.swiglu(y)#

Performs SwiGLU (Swish-Gated Linear Unit) activation function.

Parameters:

y (torch.Tensor) – Input tensor to be split into two halves along the last dimension.

Returns:

Result of SwiGLU activation: SiLU(y1) * y2, where y1, y2 are the split halves.

Return type:

torch.Tensor

core.fusions.fused_bias_swiglu.bias_swiglu(y, bias)#

Performs SwiGLU activation with bias addition.

Parameters:
  • y (torch.Tensor) – Input tensor.

  • bias (torch.Tensor) – Bias tensor to be added to input.

Returns:

Result of bias addition followed by SwiGLU activation.

Return type:

torch.Tensor

core.fusions.fused_bias_swiglu.weighted_swiglu(y, weights)#
core.fusions.fused_bias_swiglu.swiglu_back(g, y)#

Computes the gradient for the SwiGLU activation function.

Parameters:
  • g (torch.Tensor) – Gradient tensor from the subsequent layer.

  • y (torch.Tensor) – Input tensor that was used in the forward pass.

Returns:

Gradient with respect to the input tensor, computed using the chain rule and the derivative of the SiLU activation function.

Return type:

torch.Tensor

core.fusions.fused_bias_swiglu.bias_swiglu_back(g, y, bias)#

Computes the gradient for the biased SwiGLU activation function.

Parameters:
  • g (torch.Tensor) – Gradient tensor from the subsequent layer.

  • y (torch.Tensor) – Input tensor that was used in the forward pass.

  • bias (torch.Tensor) – Bias tensor that was added in the forward pass.

Returns:

Gradient with respect to the input tensor, computed after applying the bias addition.

Return type:

torch.Tensor

core.fusions.fused_bias_swiglu.weighted_swiglu_back(g, y, weights)#
class core.fusions.fused_bias_swiglu.BiasSwiGLUFunction#

Bases: torch.autograd.Function

Custom autograd function for SwiGLU activation with bias support.

static forward(ctx, input, bias, fp8_input_store, cpu_offload_input)#

Forward pass of biased SwiGLU activation.

Parameters:
  • ctx – Autograd context object for saving tensors for backward pass.

  • input (torch.Tensor) – Input tensor to apply SwiGLU to.

  • bias (torch.Tensor) – Bias tensor to be added to input before SwiGLU.

  • fp8_input_store (bool) – If True, stores intermediate values in FP8 format.

Returns:

Result of applying bias addition followed by SwiGLU activation.

Return type:

torch.Tensor

static backward(ctx, grad_output)#

Backward pass of biased SwiGLU activation.

Parameters:
  • ctx – Autograd context object containing saved tensors from forward pass.

  • grad_output (torch.Tensor) – Gradient of the loss with respect to the output.

Returns:

Tuple containing: - Gradient with respect to the input tensor - Gradient with respect to the bias tensor - None for fp8_input_store parameter

Return type:

tuple

class core.fusions.fused_bias_swiglu.SwiGLUFunction#

Bases: torch.autograd.Function

Custom autograd function for SwiGLU activation without bias.

static forward(ctx, input, fp8_input_store, cpu_offload_input)#

Forward pass of SwiGLU activation.

Parameters:
  • ctx – Autograd context object for saving tensors for backward pass.

  • input (torch.Tensor) – Input tensor to apply SwiGLU to.

  • fp8_input_store (bool) – If True, stores intermediate values in FP8 format.

Returns:

Result of applying SwiGLU activation.

Return type:

torch.Tensor

static backward(ctx, grad_output)#

Backward pass of SwiGLU activation.

Parameters:
  • ctx – Autograd context object containing saved tensors from forward pass.

  • grad_output (torch.Tensor) – Gradient of the loss with respect to the output.

Returns:

Tuple containing: - Gradient with respect to the input tensor - None for fp8_input_store parameter

Return type:

tuple

class core.fusions.fused_bias_swiglu.WeightedSwiGLUFunction#

Bases: torch.autograd.Function

static forward(ctx, input, weights, fp8_input_store)#
static backward(ctx, grad_output)#
core.fusions.fused_bias_swiglu.bias_swiglu_impl(
input,
bias,
fp8_input_store=False,
cpu_offload_input=False,
)#

Implementation of biased SwiGLU that handles different input shapes.

This function reshapes the input if necessary, applies the SwiGLU activation (with or without bias), and restores the original shape.

Parameters:
  • input (torch.Tensor) – Input tensor to apply SwiGLU activation.

  • bias (torch.Tensor, optional) – Bias tensor to be added to input. If None, uses the bias-free SwiGLU variant.

  • fp8_input_store (bool, optional) – Whether to store intermediate values in FP8 format. Defaults to False.

Returns:

Result of biased SwiGLU activation.

Return type:

torch.Tensor

Raises:

AssertionError – If input tensor does not have 2 or 3 dimensions.

core.fusions.fused_bias_swiglu.weighted_bias_swiglu_impl(input, bias, weights, fp8_input_store=False)#

Token-wise-weighted bias swiglu fusion.