core.transformer.transformer_layer#
Module Contents#
Classes#
Configuration class for specifying the submodules of a transformer layer. |
|
A common parent class for |
|
A single transformer layer. |
Functions#
Get the index offset of current pipeline stage, given the level of pipelining. |
Data#
API#
- core.transformer.transformer_layer.logger#
‘getLogger(…)’
- core.transformer.transformer_layer.get_transformer_layer_offset(
- config: megatron.core.transformer.transformer_config.TransformerConfig,
- vp_stage: Optional[int] = None,
- pp_rank: Optional[int] = None,
Get the index offset of current pipeline stage, given the level of pipelining.
- class core.transformer.transformer_layer.TransformerLayerSubmodules#
Configuration class for specifying the submodules of a transformer layer.
This class defines the structure and default implementations for various components of a transformer layer, allowing for flexible customization of the layer’s architecture.
- Parameters:
input_layernorm (Union[ModuleSpec, type]) – Specification for the input layer normalization.
self_attention (Union[ModuleSpec, type]) – Specification for the self-attention mechanism.
self_attn_bda (Union[ModuleSpec, type]) – Specification for the bias-dropout-add operation after self-attention.
pre_cross_attn_layernorm (Union[ModuleSpec, type]) – Specification for the layer normalization before cross-attention.
cross_attention (Union[ModuleSpec, type]) – Specification for the cross-attention mechanism.
cross_attn_bda (Union[ModuleSpec, type]) – Specification for the bias-dropout-add operation after cross-attention.
pre_mlp_layernorm (Union[ModuleSpec, type]) – Specification for the layer normalization before the MLP.
mlp (Union[ModuleSpec, type]) – Specification for the MLP in Dense layer.
mlp_bda (Union[ModuleSpec, type]) – Specification for the bias-dropout-add operation after the MLP.
sharded_state_dict_keys_map (Dict[str, str]) – Mapping for sharded tensor keys to be applied in the
sharded_state_dictmethod.
- input_layernorm: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#
None
- self_attention: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#
None
- self_attn_bda: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#
None
- pre_cross_attn_layernorm: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#
None
- cross_attention: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#
None
- cross_attn_bda: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#
None
- pre_mlp_layernorm: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#
None
- mlp: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#
None
- mlp_bda: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#
None
- sharded_state_dict_keys_map: Dict[str, str]#
‘field(…)’
- class core.transformer.transformer_layer.BaseTransformerLayer#
Bases:
abc.ABCA common parent class for
TransformerLayerlike implementations.A dummy class that is subclassed by similar
TransformerLayers e.g. theTransformerLayerin this file and possibly otherTransformerLayerimplementations that aim to useTransformerBlockas the base module. The main purpose is to check if any layer (or module) provided in the spec is a subclass of this class to allow fanning-out of that spec for all the layers in theTransformerBlock. See_get_block_submodulesmethod implementation intransformer_block.pyfile for more details.Initialization
- class core.transformer.transformer_layer.TransformerLayer(
- config: megatron.core.transformer.transformer_config.TransformerConfig,
- submodules: core.transformer.transformer_layer.TransformerLayerSubmodules,
- layer_number: int = 1,
- hidden_dropout: Optional[float] = None,
- pg_collection: Optional[megatron.core.process_groups_config.ProcessGroupCollection] = None,
- vp_stage: Optional[int] = None,
Bases:
megatron.core.transformer.module.GraphableMegatronModule,core.transformer.transformer_layer.BaseTransformerLayerA single transformer layer.
Transformer layer takes input with size [s, b, h] and returns an output of the same size.
Initialization
- static _get_layer_offset(
- config: megatron.core.transformer.transformer_config.TransformerConfig,
Get the layer offset for the current pipeline stage.
Deprecated: please use
get_transformer_layer_offsetinstead.
- forward(*args, **kwargs)#
Perform a forward pass through the transformer layer.
This method calls the core computation of a transformer layer, including self-attention, cross-attention (if applicable), and feed-forward operations.
- _forward_attention(
- hidden_states: torch.Tensor,
- attention_mask: Optional[torch.Tensor] = None,
- context: Optional[torch.Tensor] = None,
- context_mask: Optional[torch.Tensor] = None,
- rotary_pos_emb: Optional[torch.Tensor] = None,
- rotary_pos_cos: Optional[torch.Tensor] = None,
- rotary_pos_sin: Optional[torch.Tensor] = None,
- rotary_pos_cos_sin: Optional[torch.Tensor] = None,
- attention_bias: Optional[torch.Tensor] = None,
- inference_context: Optional[Any] = None,
- packed_seq_params: Optional[megatron.core.packed_seq_params.PackedSeqParams] = None,
- sequence_len_offset: Optional[torch.Tensor] = None,
- *,
- inference_params: Optional[Any] = None,
Perform a forward pass through the attention layer and the layernorms before and after the attention operations.
- Parameters:
hidden_states (Tensor) – Input tensor of shape [s, b, h] where s is sequence length, b is batch size, and h is hidden size.
attention_mask (Tensor) – Mask tensor for self-attention.
context (Tensor, optional) – Context tensor for cross-attention.
context_mask (Tensor, optional) – Mask tensor for cross-attention.
rotary_pos_emb (Tensor, optional) – Rotary positional embeddings.
rotary_pos_cos (Optional[Tensor]) – Rotary embedding cosine.
rotary_pos_sin (Optional[Tensor]) – Rotary embedding sine.
rotary_pos_cos_sin (Optional[Tensor]) – Combined rotary embedding cosine and sine.
RoPE. (Currently used exclusively for inference with dynamic batching and flashinfer)
attention_bias (Tensor, optional) – Bias tensor for Q * K.T.
inference_context (object, optional) – Parameters for inference-time optimizations.
packed_seq_params (object, optional) – Parameters for packed sequence processing.
sequence_len_offset (Tensor, optional) – Offset along sequence dimension during inference.
- Returns:
A tuple containing: hidden_states (Tensor): Transformed hidden states before the MLP layernorm. context (Tensor): Updated context tensor if cross-attention is used, otherwise None.
- Return type:
Tuple[Tensor, Tensor]
- _forward_mlp(hidden_states, inference_context=None)#
Perform a forward pass through the feed-forward layer.
- Parameters:
hidden_states (Tensor) – Transformed hidden states before the MLP layernorm.
- Returns:
Transformed hidden states of shape [s, b, h].
- Return type:
output (Tensor)
- sharded_state_dict(
- prefix: str = '',
- sharded_offsets: tuple = (),
- metadata: Optional[dict] = None,
Generate a sharded state dictionary for the transformer layer.
- Parameters:
prefix (str, optional) – Prefix to be added to all keys in the state dict.
sharded_offsets (tuple, optional) – Tuple of sharding offsets.
metadata (Optional[dict], optional) – Additional metadata for sharding.
- Returns:
A dictionary containing the sharded state of the transformer layer.
- Return type:
ShardedStateDict
- get_layer_static_inputs(seq_length, micro_batch_size)#
Get the static inputs for the transformer layer. Besides the hidden_states that is generated in GraphableMegatronModule, we also add the attention_mask.
- Returns:
A dictionary containing the static inputs for the layer.
- Return type:
Dict[str, torch.Tensor]
- _get_submodules_under_cudagraphs()#
Get the submodules that are covered by cudagraphs.
- _te_cuda_graph_capture(*args, **kwargs)#
CUDA Graph capture for this layer using TE interface. There are some differences from the normal pass:
In some conditions CUDA graph cannot cover the entire layer. The
cuda_graph_scopeattribute can be set to control the scope of the CUDA graph.If context is None, it cannot be returned as output.
- _te_cuda_graph_replay(*args, **kwargs)#
CUDA graph replay for this layer and microbatch
self.current_microbatchusing TE interface. TransformerEngine versions>=1.10 allow keyword arguments with CUDA graph. However, CUDA graph accepts only Tensor inputs. Hence,inference_contextandpacked_seq_paramsare excluded from input list.
- _get_te_cuda_graph_replay_args(*args, **kwargs)#
Helper function to get tensor arguments for TE CUDA graph.
- _should_call_local_cudagraph(*args, **kwargs)#
Check if we should call the local cudagraph path.
- __call__(*args, **kwargs)#