core.models.gpt.experimental_attention_variant_module_specs#
Module Contents#
Functions#
Build module spec for GatedDeltaNet attention. |
|
Helper function to get module spec for Sparse Attention. |
|
Helper function to get module spec for experimental attention variant |
|
|
Build transformer block spec with experimental attention variants (e.g., linear attention). |
Check if the experimental attention variant is a linear attention variant. |
|
Parse config.moe_layer_freq to get per-layer MoE pattern (1=MoE, 0=dense). |
|
Parse config.linear_attention_freq to get per-layer attention pattern (1=LA, 0=SDPA). |
|
Get backend spec provider for experimental attention variant. |
|
Get non-experimental self-attention module spec. For hybrid models that mix experimental and non-experimental attention architectures. |
|
Get dense MLP module spec. For hybrid models that mix dense MLP and experimental attention architectures. |
|
Get MoE module spec. For hybrid models that mix MoE and experimental attention architectures. |
API#
- core.models.gpt.experimental_attention_variant_module_specs.get_gated_delta_net_module_spec(
- config: megatron.core.transformer.transformer_config.TransformerConfig,
- backend: megatron.core.models.backends.BackendSpecProvider = None,
Build module spec for GatedDeltaNet attention.
- core.models.gpt.experimental_attention_variant_module_specs.get_dsa_module_spec_for_backend(
- config: megatron.core.transformer.transformer_config.TransformerConfig,
- backend: megatron.core.models.backends.BackendSpecProvider = None,
Helper function to get module spec for Sparse Attention.
- core.models.gpt.experimental_attention_variant_module_specs.get_experimental_attention_variant_module_spec(
- config: megatron.core.transformer.transformer_config.TransformerConfig,
- backend: megatron.core.models.backends.BackendSpecProvider = None,
Helper function to get module spec for experimental attention variant
- core.models.gpt.experimental_attention_variant_module_specs.get_transformer_block_with_experimental_attention_variant_spec(
- config: megatron.core.transformer.transformer_config.TransformerConfig,
- vp_stage: Optional[int] = None,
- pp_rank: Optional[int] = None,
Build transformer block spec with experimental attention variants (e.g., linear attention).
This function constructs a heterogeneous transformer block that supports mixing different attention mechanisms (experimental vs standard) and MLP types (MoE vs dense) across layers. Note that, this API is a experimental API in the short term, and might be deprecated in the future. In the long run, we will move to a new design that better support hybrid models.
Key Design: 1. Attention and MLP patterns: The attention pattern and MLP pattern are orthogonal and determined independently. This allows flexible combinations (e.g., linear attention with MoE, or standard attention with dense MLP). - Attention pattern: derived from
config.linear_attention_freqorconfig.experimental_attention_variant. - MLP pattern: derived fromconfig.moe_layer_freq.2. Per-Layer Spec Construction: Iterates through layers, constructing transformer layer specs based on attention and MLP patterns. 3. Pipeline Slicing: Extracts layer specs for the current pipeline stage.
- Parameters:
config – Transformer configuration containing model hyperparameters and feature flags.
vp_stage – Virtual pipeline stage index for interleaved pipeline parallelism.
pp_rank – Pipeline model parallel rank.
- Returns:
TransformerBlockSubmodules containing per-layer specs and final layer norm.
.. note::
Currently only supports transformer_engine backend. Kitchen backend can be used as a wrapper with TE fallback for unsupported operations.
- core.models.gpt.experimental_attention_variant_module_specs.is_linear_attention_variant(
- experimental_attention_variant: Optional[str],
Check if the experimental attention variant is a linear attention variant.
- core.models.gpt.experimental_attention_variant_module_specs.get_moe_layer_pattern(
- config: megatron.core.transformer.transformer_config.TransformerConfig,
Parse config.moe_layer_freq to get per-layer MoE pattern (1=MoE, 0=dense).
int N: one MoE layer every N layers (e.g., N=2 -> [1,0,1,0,…])
list: use directly as the pattern.
- core.models.gpt.experimental_attention_variant_module_specs.get_linear_attention_pattern(
- config: megatron.core.transformer.transformer_config.TransformerConfig,
Parse config.linear_attention_freq to get per-layer attention pattern (1=LA, 0=SDPA).
int N: one SDPA layer every N layers (e.g., N=4 -> [1,1,1,0,1,1,1,0,…])
list: use directly as the pattern.
- core.models.gpt.experimental_attention_variant_module_specs._get_backend_spec_provider(
- config: megatron.core.transformer.transformer_config.TransformerConfig,
Get backend spec provider for experimental attention variant.
- core.models.gpt.experimental_attention_variant_module_specs._get_self_attention_module_spec(
- config: megatron.core.transformer.transformer_config.TransformerConfig,
- backend: megatron.core.models.backends.BackendSpecProvider = None,
Get non-experimental self-attention module spec. For hybrid models that mix experimental and non-experimental attention architectures.
Warning: This function may be deprecated in the future.
- core.models.gpt.experimental_attention_variant_module_specs._get_dense_mlp_module_spec(
- config: megatron.core.transformer.transformer_config.TransformerConfig,
- backend: megatron.core.models.backends.BackendSpecProvider = None,
Get dense MLP module spec. For hybrid models that mix dense MLP and experimental attention architectures.
Warning: This function may be deprecated in the future.
- core.models.gpt.experimental_attention_variant_module_specs._get_moe_module_spec(
- config: megatron.core.transformer.transformer_config.TransformerConfig,
- backend: megatron.core.models.backends.BackendSpecProvider = None,
Get MoE module spec. For hybrid models that mix MoE and experimental attention architectures.
Warning: This function may be deprecated in the future.