core.transformer.transformer_layer#
Module Contents#
Classes#
Configuration class for specifying the submodules of a transformer layer. |
|
A common parent class for |
|
A single transformer layer. |
|
A Transformer layer specialized for Mixture-of-Experts (MoE) architectures. |
Functions#
Get the index offset of current pipeline stage, given the level of pipelining. |
Data#
API#
- core.transformer.transformer_layer.logger#
‘getLogger(…)’
- core.transformer.transformer_layer.get_transformer_layer_offset(
- config: megatron.core.transformer.transformer_config.TransformerConfig,
- vp_stage: Optional[int] = None,
- pp_rank: Optional[int] = None,
Get the index offset of current pipeline stage, given the level of pipelining.
- class core.transformer.transformer_layer.TransformerLayerSubmodules#
Configuration class for specifying the submodules of a transformer layer.
This class defines the structure and default implementations for various components of a transformer layer, allowing for flexible customization of the layer’s architecture.
- Parameters:
input_layernorm (Union[ModuleSpec, type]) – Specification for the input layer normalization.
self_attention (Union[ModuleSpec, type]) – Specification for the self-attention mechanism.
self_attn_bda (Union[ModuleSpec, type]) – Specification for the bias-dropout-add operation after self-attention.
pre_cross_attn_layernorm (Union[ModuleSpec, type]) – Specification for the layer normalization before cross-attention.
cross_attention (Union[ModuleSpec, type]) – Specification for the cross-attention mechanism.
cross_attn_bda (Union[ModuleSpec, type]) – Specification for the bias-dropout-add operation after cross-attention.
pre_mlp_layernorm (Union[ModuleSpec, type]) – Specification for the layer normalization before the MLP.
mlp (Union[ModuleSpec, type]) – Specification for the MLP in Dense layer.
mlp_bda (Union[ModuleSpec, type]) – Specification for the bias-dropout-add operation after the MLP.
sharded_state_dict_keys_map (Dict[str, str]) – Mapping for sharded tensor keys to be applied in the
sharded_state_dictmethod.
- input_layernorm: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#
None
- self_attention: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#
None
- self_attn_bda: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#
None
- pre_cross_attn_layernorm: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#
None
- cross_attention: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#
None
- cross_attn_bda: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#
None
- pre_mlp_layernorm: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#
None
- mlp: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#
None
- mlp_bda: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#
None
- sharded_state_dict_keys_map: Dict[str, str]#
‘field(…)’
- class core.transformer.transformer_layer.BaseTransformerLayer#
Bases:
abc.ABCA common parent class for
TransformerLayerlike implementations.A dummy class that is subclassed by similar
TransformerLayers e.g. theTransformerLayerin this file and possibly otherTransformerLayerimplementations that aim to useTransformerBlockas the base module. The main purpose is to check if any layer (or module) provided in the spec is a subclass of this class to allow fanning-out of that spec for all the layers in theTransformerBlock. See_get_block_submodulesmethod implementation intransformer_block.pyfile for more details.Initialization
- class core.transformer.transformer_layer.TransformerLayer(
- config: megatron.core.transformer.transformer_config.TransformerConfig,
- submodules: core.transformer.transformer_layer.TransformerLayerSubmodules,
- layer_number: int = 1,
- hidden_dropout: Optional[float] = None,
- pg_collection: Optional[megatron.core.process_groups_config.ProcessGroupCollection] = None,
- vp_stage: Optional[int] = None,
Bases:
megatron.core.transformer.module.GraphableMegatronModule,core.transformer.transformer_layer.BaseTransformerLayerA single transformer layer.
Transformer layer takes input with size [s, b, h] and returns an output of the same size.
Initialization
- create_mcore_cudagraph_manager(config)#
Register the transformer layer for cudagraphs.
- static _get_layer_offset(
- config: megatron.core.transformer.transformer_config.TransformerConfig,
Get the layer offset for the current pipeline stage.
Deprecated: please use
get_transformer_layer_offsetinstead.
- forward(*args, **kwargs)#
Perform a forward pass through the transformer layer.
This method calls the core computation of a transformer layer, including self-attention, cross-attention (if applicable), and feed-forward operations.
- _forward_attention(
- hidden_states: torch.Tensor,
- attention_mask: Optional[torch.Tensor] = None,
- context: Optional[torch.Tensor] = None,
- context_mask: Optional[torch.Tensor] = None,
- rotary_pos_emb: Optional[torch.Tensor] = None,
- rotary_pos_cos: Optional[torch.Tensor] = None,
- rotary_pos_sin: Optional[torch.Tensor] = None,
- rotary_pos_cos_sin: Optional[torch.Tensor] = None,
- attention_bias: Optional[torch.Tensor] = None,
- inference_context: Optional[Any] = None,
- packed_seq_params: Optional[megatron.core.packed_seq_params.PackedSeqParams] = None,
- sequence_len_offset: Optional[torch.Tensor] = None,
- padding_mask: Optional[torch.Tensor] = None,
- *,
- inference_params: Optional[Any] = None,
Perform a forward pass through the attention layer and the layernorms before and after the attention operations.
- Parameters:
hidden_states (Tensor) – Input tensor of shape [s, b, h] where s is sequence length, b is batch size, and h is hidden size.
attention_mask (Tensor) – Mask tensor for self-attention.
context (Tensor, optional) – Context tensor for cross-attention.
context_mask (Tensor, optional) – Mask tensor for cross-attention.
rotary_pos_emb (Tensor, optional) – Rotary positional embeddings.
rotary_pos_cos (Optional[Tensor]) – Rotary embedding cosine.
rotary_pos_sin (Optional[Tensor]) – Rotary embedding sine.
rotary_pos_cos_sin (Optional[Tensor]) – Combined rotary embedding cosine and sine.
RoPE. (Currently used exclusively for inference with dynamic batching and flashinfer)
attention_bias (Tensor, optional) – Bias tensor for Q * K.T.
inference_context (object, optional) – Parameters for inference-time optimizations.
packed_seq_params (object, optional) – Parameters for packed sequence processing.
sequence_len_offset (Tensor, optional) – Offset along sequence dimension during inference.
- Returns:
A tuple containing: hidden_states (Tensor): Transformed hidden states before the MLP layernorm. context (Tensor): Updated context tensor if cross-attention is used, otherwise None.
- Return type:
Tuple[Tensor, Tensor]
- _forward_pre_mlp_layernorm(hidden_states)#
- _forward_mlp(hidden_states, inference_context=None, padding_mask=None)#
Perform a forward pass through the feed-forward layer.
- Parameters:
hidden_states (Tensor) – Transformed hidden states before the MLP layernorm. Shape [seq_length, batch_size, hidden_size].
inference_context – Inference context for optimizations.
padding_mask (Tensor, optional) – Padding mask for MoE routing. Shape [bsz, seq_length]. True = padding (exclude), False = valid (include). Only used for MoE layers to exclude padding tokens from aux loss computations. The MoELayer will internally transform this to [seq_length, bsz] format.
- Returns:
Transformed hidden states of shape [s, b, h].
- Return type:
output (Tensor)
- _forward_post_mlp(mlp_output_with_bias, residual)#
Perform operations after the MLP computation.
- Parameters:
mlp_output_with_bias (Tensor) – Output tensor of the MLP layer with bias.
residual (Tensor) – Residual tensor.
- Returns:
Transformed hidden states of shape [s, b, h].
- Return type:
output (Tensor)
- sharded_state_dict(
- prefix: str = '',
- sharded_offsets: tuple = (),
- metadata: Optional[dict] = None,
Generate a sharded state dictionary for the transformer layer.
- Parameters:
prefix (str, optional) – Prefix to be added to all keys in the state dict.
sharded_offsets (tuple, optional) – Tuple of sharding offsets.
metadata (Optional[dict], optional) – Additional metadata for sharding.
- Returns:
A dictionary containing the sharded state of the transformer layer.
- Return type:
ShardedStateDict
- configure_fused_tp_inference(
- skip_qkv_norm_and_all_gather: bool = False,
- fc2_next_layer_norm_weights: Optional[torch.Tensor] = None,
Configure settings for fused TP communication in inference mode.
- Parameters:
skip_qkv_norm (bool) – Whether to skip norm and all-gather for linear_qkv.
fc2_next_layer_norm_weights (Optional[Tensor]) – Next layer’s QKV norm weights for current layer’s MLP FC2.
- _set_proj_next_layer_norm_weights(weights: torch.Tensor)#
Set next layer norm weights for attention/mixer’s linear_proj.
- _set_fc2_next_layer_norm_weights(
- weights: Optional[torch.Tensor],
Set next layer norm weights for MLP FC2.
- _set_proj_residual(residual: torch.Tensor)#
Set residual for attention’s/mixer’s out_proj (linear_proj).
- _set_fc2_residual(residual: torch.Tensor)#
Set residual for MLP FC2.
- get_mlp_layer_norm_weights() torch.Tensor#
Get the MLP FC1 layer norm weights.
- Returns:
The layer norm weight data.
- Return type:
Tensor
- get_qkv_layer_norm_weights() torch.Tensor#
Get the QKV layer norm weights.
- Returns:
The layer norm weight data.
- Return type:
Tensor
- get_layer_static_inputs(seq_length, micro_batch_size)#
Get the static inputs for the transformer layer. Besides the hidden_states that is generated in GraphableMegatronModule, we also add the attention_mask.
- Returns:
A dictionary containing the static inputs for the layer.
- Return type:
Dict[str, torch.Tensor]
- _get_submodules_under_cudagraphs()#
Get the submodules that are covered by cudagraphs.
- _te_cuda_graph_capture(*args, **kwargs)#
CUDA Graph capture for this layer using TE interface. There are some differences from the normal pass:
In some conditions CUDA graph cannot cover the entire layer. The
cuda_graph_scopeattribute can be set to control the scope of the CUDA graph.If context is None, it cannot be returned as output.
- _te_cuda_graph_replay(*args, **kwargs)#
CUDA graph replay for this layer and microbatch
self.current_microbatchusing TE interface. TransformerEngine versions>=1.10 allow keyword arguments with CUDA graph. However, CUDA graph accepts only Tensor inputs. Hence,inference_contextandpacked_seq_paramsare excluded from input list.
- _get_te_cuda_graph_replay_args(*args, **kwargs)#
Helper function to get tensor arguments for TE CUDA graph.
- _should_call_local_cudagraph(*args, **kwargs)#
Check if we should call the local cudagraph path.
- __call__(*args, **kwargs)#
- get_layer_norm_weights()#
Get the weights of all layernorms (attention and MLP) in the transformer layer.
- Returns:
A list of layernorm weight tensors.
- Return type:
List[Tensor]
- class core.transformer.transformer_layer.MoETransformerLayer(*args, **kwargs)#
Bases:
core.transformer.transformer_layer.TransformerLayerA Transformer layer specialized for Mixture-of-Experts (MoE) architectures.
Implements specific functionality to support CUDA graph capture for MoE layers. Due to the dynamic nature of MoE, capturing the entire layer in a single CUDA graph can be challenging. This class supports “partial” CUDA graphs by decomposing the MLP forward pass into router, expert-compute, and post-process stages.
Initialization
- create_mcore_cudagraph_manager(config)#
Initializes the CUDA graph manager(s) for the MoE layer.
Unlike the standard layer which typically uses a single manager, this method can configure multiple graph managers if partial CUDA graphs are enabled via
cuda_graph_scope. This allows capturing the static parts of the MoE pass while leaving the expert computation to execute eagerly.
- _forward_mlp_router(hidden_states, padding_mask=None)#
Executes the router phase of the MoE block.
This includes the pre-MLP layernorm and the routing logic. This method is isolated so it can be captured by
cudagraph_manager_router.
- _forward_mlp_expert_compute(hidden_states, probs)#
Executes the actual computation of the experts.
This phase takes the routing information and inputs, dispatches them to the appropriate experts, and computes the results. In partial graph modes, this step runs eagerly between the router and postprocess graph replays.
- _forward_mlp_postprocess(
- residual,
- output,
- shared_expert_output,
- mlp_bias,
Executes the post-processing phase of the MoE block.
Handles combining the expert outputs, applying biases, re-registering activation recomputation hooks if necessary, and performing the final Bias-Dropout-Add. This method is isolated so it can be captured by cudagraphs.
- _forward_mlp(hidden_states, inference_context=None, padding_mask=None)#
Orchestrates the MLP forward pass, handling partial CUDA graph execution logic.
If
use_partial_cudagraphsis True, this method stitches together the router, expert_compute, and postprocess calls.