nemo_automodel.components.models.common.combined_projection.combined_mlp#

Combined gate_up MLP projection for SwiGLU and similar activations.

This module provides a combined gate_up projection that combines gate_proj and up_proj into a single projection, reducing kernel launch overhead and improving memory efficiency.

Module Contents#

Classes#

CombinedGateUpMLP

SwiGLU MLP with combined gate_up projection for efficiency.

API#

class nemo_automodel.components.models.common.combined_projection.combined_mlp.CombinedGateUpMLP(config)#

Bases: torch.nn.Module

SwiGLU MLP with combined gate_up projection for efficiency.

This module combines gate_proj and up_proj into a single projection, then splits the result. This can improve efficiency by reducing kernel launches, though the benefit depends on the specific hardware and tensor sizes.

Works with any activation function that follows the gate * up pattern.

Parameters:

config –

Model config with attributes:

  • hidden_size: Model hidden dimension

  • intermediate_size: MLP intermediate dimension

  • hidden_act: Activation function name (e.g., “silu”, “gelu”)

  • mlp_bias: Whether to use bias (optional, defaults to False)

.. rubric:: Example

For Llama-style SwiGLU:#

mlp = CombinedGateUpMLP(config) # config.hidden_act = “silu”

For Qwen2-style SwiGLU:#

mlp = CombinedGateUpMLP(config) # config.hidden_act = “silu”

Initialization

forward(x: torch.Tensor) torch.Tensor#

Forward pass with combined gate_up projection.

Handles tensor parallelism by dynamically computing split sizes based on actual tensor dimensions.

Parameters:

x – Input tensor [batch, seq_len, hidden_size]

Returns:

Output tensor [batch, seq_len, hidden_size]