core.models.gpt.experimental_attention_variant_module_specs#

Module Contents#

Functions#

get_gated_delta_net_module_spec

Build module spec for GatedDeltaNet attention.

get_dsa_module_spec_for_backend

Helper function to get module spec for Sparse Attention.

get_experimental_attention_variant_module_spec

Helper function to get module spec for experimental attention variant

get_transformer_block_with_experimental_attention_variant_spec

Build transformer block spec with experimental attention variants (e.g., linear attention).

is_linear_attention_variant

Check if the experimental attention variant is a linear attention variant.

get_moe_layer_pattern

Parse config.moe_layer_freq to get per-layer MoE pattern (1=MoE, 0=dense).

get_linear_attention_pattern

Parse config.linear_attention_freq to get per-layer attention pattern (1=LA, 0=SDPA).

_get_backend_spec_provider

Get backend spec provider for experimental attention variant.

_get_self_attention_module_spec

Get non-experimental self-attention module spec. For hybrid models that mix experimental and non-experimental attention architectures.

_get_dense_mlp_module_spec

Get dense MLP module spec. For hybrid models that mix dense MLP and experimental attention architectures.

_get_moe_module_spec

Get MoE module spec. For hybrid models that mix MoE and experimental attention architectures.

API#

core.models.gpt.experimental_attention_variant_module_specs.get_gated_delta_net_module_spec(
config: megatron.core.transformer.transformer_config.TransformerConfig,
backend: megatron.core.models.backends.BackendSpecProvider = None,
) megatron.core.transformer.spec_utils.ModuleSpec#

Build module spec for GatedDeltaNet attention.

core.models.gpt.experimental_attention_variant_module_specs.get_dsa_module_spec_for_backend(
config: megatron.core.transformer.transformer_config.TransformerConfig,
backend: megatron.core.models.backends.BackendSpecProvider = None,
) megatron.core.transformer.spec_utils.ModuleSpec#

Helper function to get module spec for Sparse Attention.

core.models.gpt.experimental_attention_variant_module_specs.get_experimental_attention_variant_module_spec(
config: megatron.core.transformer.transformer_config.TransformerConfig,
backend: megatron.core.models.backends.BackendSpecProvider = None,
) megatron.core.transformer.spec_utils.ModuleSpec#

Helper function to get module spec for experimental attention variant

core.models.gpt.experimental_attention_variant_module_specs.get_transformer_block_with_experimental_attention_variant_spec(
config: megatron.core.transformer.transformer_config.TransformerConfig,
vp_stage: Optional[int] = None,
pp_rank: Optional[int] = None,
) megatron.core.transformer.transformer_block.TransformerBlockSubmodules#

Build transformer block spec with experimental attention variants (e.g., linear attention).

This function constructs a heterogeneous transformer block that supports mixing different attention mechanisms (experimental vs standard) and MLP types (MoE vs dense) across layers. Note that, this API is a experimental API in the short term, and might be deprecated in the future. In the long run, we will move to a new design that better support hybrid models.

Key Design: 1. Attention and MLP patterns: The attention pattern and MLP pattern are orthogonal and determined independently. This allows flexible combinations (e.g., linear attention with MoE, or standard attention with dense MLP). - Attention pattern: derived from config.linear_attention_freq or config.experimental_attention_variant. - MLP pattern: derived from config.moe_layer_freq.

2. Per-Layer Spec Construction: Iterates through layers, constructing transformer
   layer specs based on attention and MLP patterns.

3. Pipeline Slicing: Extracts layer specs for the current pipeline stage.
Parameters:
  • config – Transformer configuration containing model hyperparameters and feature flags.

  • vp_stage – Virtual pipeline stage index for interleaved pipeline parallelism.

  • pp_rank – Pipeline model parallel rank.

Returns:

TransformerBlockSubmodules containing per-layer specs and final layer norm.

.. note::

Currently only supports transformer_engine backend. Kitchen backend can be used as a wrapper with TE fallback for unsupported operations.

core.models.gpt.experimental_attention_variant_module_specs.is_linear_attention_variant(
experimental_attention_variant: Optional[str],
) bool#

Check if the experimental attention variant is a linear attention variant.

core.models.gpt.experimental_attention_variant_module_specs.get_moe_layer_pattern(
config: megatron.core.transformer.transformer_config.TransformerConfig,
) List[int]#

Parse config.moe_layer_freq to get per-layer MoE pattern (1=MoE, 0=dense).

  • int N: one MoE layer every N layers (e.g., N=2 -> [1,0,1,0,…])

  • list: use directly as the pattern.

core.models.gpt.experimental_attention_variant_module_specs.get_linear_attention_pattern(
config: megatron.core.transformer.transformer_config.TransformerConfig,
) List[int]#

Parse config.linear_attention_freq to get per-layer attention pattern (1=LA, 0=SDPA).

  • int N: one SDPA layer every N layers (e.g., N=4 -> [1,1,1,0,1,1,1,0,…])

  • list: use directly as the pattern.

core.models.gpt.experimental_attention_variant_module_specs._get_backend_spec_provider(
config: megatron.core.transformer.transformer_config.TransformerConfig,
) megatron.core.models.backends.BackendSpecProvider#

Get backend spec provider for experimental attention variant.

core.models.gpt.experimental_attention_variant_module_specs._get_self_attention_module_spec(
config: megatron.core.transformer.transformer_config.TransformerConfig,
backend: megatron.core.models.backends.BackendSpecProvider = None,
) megatron.core.transformer.spec_utils.ModuleSpec#

Get non-experimental self-attention module spec. For hybrid models that mix experimental and non-experimental attention architectures.

Warning: This function may be deprecated in the future.

core.models.gpt.experimental_attention_variant_module_specs._get_dense_mlp_module_spec(
config: megatron.core.transformer.transformer_config.TransformerConfig,
backend: megatron.core.models.backends.BackendSpecProvider = None,
) megatron.core.transformer.spec_utils.ModuleSpec#

Get dense MLP module spec. For hybrid models that mix dense MLP and experimental attention architectures.

Warning: This function may be deprecated in the future.

core.models.gpt.experimental_attention_variant_module_specs._get_moe_module_spec(
config: megatron.core.transformer.transformer_config.TransformerConfig,
backend: megatron.core.models.backends.BackendSpecProvider = None,
) megatron.core.transformer.spec_utils.ModuleSpec#

Get MoE module spec. For hybrid models that mix MoE and experimental attention architectures.

Warning: This function may be deprecated in the future.