`core.models.gpt.experimental_attention_variant_module_specs`#

Module Contents#

Functions#

`get_gated_delta_net_module_spec`	Build module spec for GatedDeltaNet attention.
`get_dsa_module_spec_for_backend`	Helper function to get module spec for Sparse Attention.
`get_experimental_attention_variant_module_spec`	Helper function to get module spec for experimental attention variant
`get_transformer_block_with_experimental_attention_variant_spec`	Build transformer block spec with experimental attention variants (e.g., linear attention).
`is_linear_attention_variant`	Check if the experimental attention variant is a linear attention variant.
`get_moe_layer_pattern`	Parse config.moe_layer_freq to get per-layer MoE pattern (1=MoE, 0=dense).
`get_linear_attention_pattern`	Parse config.linear_attention_freq to get per-layer attention pattern (1=LA, 0=SDPA).
`_get_backend_spec_provider`	Get backend spec provider for experimental attention variant.
`_get_self_attention_module_spec`	Get non-experimental self-attention module spec. For hybrid models that mix experimental and non-experimental attention architectures.
`_get_dense_mlp_module_spec`	Get dense MLP module spec. For hybrid models that mix dense MLP and experimental attention architectures.
`_get_moe_module_spec`	Get MoE module spec. For hybrid models that mix MoE and experimental attention architectures.

API#

core.models.gpt.experimental_attention_variant_module_specs.get_gated_delta_net_module_spec( config: megatron.core.transformer.transformer_config.TransformerConfig, backend: megatron.core.models.backends.BackendSpecProvider = None, ) → megatron.core.transformer.spec_utils.ModuleSpec#: Build module spec for GatedDeltaNet attention.

core.models.gpt.experimental_attention_variant_module_specs.get_dsa_module_spec_for_backend( config: megatron.core.transformer.transformer_config.TransformerConfig, backend: megatron.core.models.backends.BackendSpecProvider = None, ) → megatron.core.transformer.spec_utils.ModuleSpec#: Helper function to get module spec for Sparse Attention.

core.models.gpt.experimental_attention_variant_module_specs.get_experimental_attention_variant_module_spec( config: megatron.core.transformer.transformer_config.TransformerConfig, backend: megatron.core.models.backends.BackendSpecProvider = None, ) → megatron.core.transformer.spec_utils.ModuleSpec#: Helper function to get module spec for experimental attention variant

core.models.gpt.experimental_attention_variant_module_specs.get_transformer_block_with_experimental_attention_variant_spec( config: megatron.core.transformer.transformer_config.TransformerConfig, vp_stage: Optional[int] = None, pp_rank: Optional[int] = None, ) → megatron.core.transformer.transformer_block.TransformerBlockSubmodules#

Build transformer block spec with experimental attention variants (e.g., linear attention).

This function constructs a heterogeneous transformer block that supports mixing different attention mechanisms (experimental vs standard) and MLP types (MoE vs dense) across layers. Note that, this API is a experimental API in the short term, and might be deprecated in the future. In the long run, we will move to a new design that better support hybrid models.

Key Design: 1. Attention and MLP patterns: The attention pattern and MLP pattern are orthogonal and determined independently. This allows flexible combinations (e.g., linear attention with MoE, or standard attention with dense MLP). - Attention pattern: derived from config.linear_attention_freq or config.experimental_attention_variant. - MLP pattern: derived from config.moe_layer_freq.

2. Per-Layer Spec Construction: Iterates through layers, constructing transformer
   layer specs based on attention and MLP patterns.

3. Pipeline Slicing: Extracts layer specs for the current pipeline stage.

Parameters:

config – Transformer configuration containing model hyperparameters and feature flags.
vp_stage – Virtual pipeline stage index for interleaved pipeline parallelism.
pp_rank – Pipeline model parallel rank.

Returns:

TransformerBlockSubmodules containing per-layer specs and final layer norm.

.. note::

Currently only supports transformer_engine backend. Kitchen backend can be used as a wrapper with TE fallback for unsupported operations.

core.models.gpt.experimental_attention_variant_module_specs.is_linear_attention_variant( experimental_attention_variant: Optional[str], ) → bool#: Check if the experimental attention variant is a linear attention variant.

core.models.gpt.experimental_attention_variant_module_specs.get_moe_layer_pattern( config: megatron.core.transformer.transformer_config.TransformerConfig, ) → List[int]#

Parse config.moe_layer_freq to get per-layer MoE pattern (1=MoE, 0=dense).

int N: one MoE layer every N layers (e.g., N=2 -> [1,0,1,0,…])
list: use directly as the pattern.

core.models.gpt.experimental_attention_variant_module_specs.get_linear_attention_pattern( config: megatron.core.transformer.transformer_config.TransformerConfig, ) → List[int]#

Parse config.linear_attention_freq to get per-layer attention pattern (1=LA, 0=SDPA).

int N: one SDPA layer every N layers (e.g., N=4 -> [1,1,1,0,1,1,1,0,…])
list: use directly as the pattern.

core.models.gpt.experimental_attention_variant_module_specs._get_backend_spec_provider( config: megatron.core.transformer.transformer_config.TransformerConfig, ) → megatron.core.models.backends.BackendSpecProvider#: Get backend spec provider for experimental attention variant.

core.models.gpt.experimental_attention_variant_module_specs._get_self_attention_module_spec( config: megatron.core.transformer.transformer_config.TransformerConfig, backend: megatron.core.models.backends.BackendSpecProvider = None, ) → megatron.core.transformer.spec_utils.ModuleSpec#

Get non-experimental self-attention module spec. For hybrid models that mix experimental and non-experimental attention architectures.

Warning: This function may be deprecated in the future.

core.models.gpt.experimental_attention_variant_module_specs._get_dense_mlp_module_spec( config: megatron.core.transformer.transformer_config.TransformerConfig, backend: megatron.core.models.backends.BackendSpecProvider = None, ) → megatron.core.transformer.spec_utils.ModuleSpec#

Get dense MLP module spec. For hybrid models that mix dense MLP and experimental attention architectures.

Warning: This function may be deprecated in the future.

core.models.gpt.experimental_attention_variant_module_specs._get_moe_module_spec( config: megatron.core.transformer.transformer_config.TransformerConfig, backend: megatron.core.models.backends.BackendSpecProvider = None, ) → megatron.core.transformer.spec_utils.ModuleSpec#

Get MoE module spec. For hybrid models that mix MoE and experimental attention architectures.

Warning: This function may be deprecated in the future.

core.models.gpt.experimental_attention_variant_module_specs#

Module Contents#

Functions#

API#

`core.models.gpt.experimental_attention_variant_module_specs`#