nemo_automodel.components.models.common.combined_projection.combined_mlp#
Combined gate_up MLP projection for SwiGLU and similar activations.
This module provides a combined gate_up projection that combines gate_proj and up_proj into a single projection, reducing kernel launch overhead and improving memory efficiency.
Module Contents#
Classes#
SwiGLU MLP with combined gate_up projection for efficiency. |
API#
- class nemo_automodel.components.models.common.combined_projection.combined_mlp.CombinedGateUpMLP(config)#
Bases:
torch.nn.ModuleSwiGLU MLP with combined gate_up projection for efficiency.
This module combines gate_proj and up_proj into a single projection, then splits the result. This can improve efficiency by reducing kernel launches, though the benefit depends on the specific hardware and tensor sizes.
Works with any activation function that follows the gate * up pattern.
- Parameters:
config –
Model config with attributes:
hidden_size: Model hidden dimension
intermediate_size: MLP intermediate dimension
hidden_act: Activation function name (e.g., “silu”, “gelu”)
mlp_bias: Whether to use bias (optional, defaults to False)
.. rubric:: Example
For Llama-style SwiGLU:#
mlp = CombinedGateUpMLP(config) # config.hidden_act = “silu”
For Qwen2-style SwiGLU:#
mlp = CombinedGateUpMLP(config) # config.hidden_act = “silu”
Initialization
- forward(x: torch.Tensor) torch.Tensor#
Forward pass with combined gate_up projection.
Handles tensor parallelism by dynamically computing split sizes based on actual tensor dimensions.
- Parameters:
x – Input tensor [batch, seq_len, hidden_size]
- Returns:
Output tensor [batch, seq_len, hidden_size]