core.transformer.transformer_layer#

Module Contents#

Classes#

TransformerLayerSubmodules

Configuration class for specifying the submodules of a transformer layer.

BaseTransformerLayer

A common parent class for TransformerLayer like implementations.

TransformerLayer

A single transformer layer.

Functions#

get_transformer_layer_offset

Get the index offset of current pipeline stage, given the level of pipelining.

Data#

API#

core.transformer.transformer_layer.logger#

‘getLogger(…)’

core.transformer.transformer_layer.get_transformer_layer_offset(
config: megatron.core.transformer.transformer_config.TransformerConfig,
vp_stage: Optional[int] = None,
pp_rank: Optional[int] = None,
)#

Get the index offset of current pipeline stage, given the level of pipelining.

class core.transformer.transformer_layer.TransformerLayerSubmodules#

Configuration class for specifying the submodules of a transformer layer.

This class defines the structure and default implementations for various components of a transformer layer, allowing for flexible customization of the layer’s architecture.

Parameters:
  • input_layernorm (Union[ModuleSpec, type]) – Specification for the input layer normalization.

  • self_attention (Union[ModuleSpec, type]) – Specification for the self-attention mechanism.

  • self_attn_bda (Union[ModuleSpec, type]) – Specification for the bias-dropout-add operation after self-attention.

  • pre_cross_attn_layernorm (Union[ModuleSpec, type]) – Specification for the layer normalization before cross-attention.

  • cross_attention (Union[ModuleSpec, type]) – Specification for the cross-attention mechanism.

  • cross_attn_bda (Union[ModuleSpec, type]) – Specification for the bias-dropout-add operation after cross-attention.

  • pre_mlp_layernorm (Union[ModuleSpec, type]) – Specification for the layer normalization before the MLP.

  • mlp (Union[ModuleSpec, type]) – Specification for the MLP in Dense layer.

  • mlp_bda (Union[ModuleSpec, type]) – Specification for the bias-dropout-add operation after the MLP.

  • sharded_state_dict_keys_map (Dict[str, str]) – Mapping for sharded tensor keys to be applied in the sharded_state_dict method.

input_layernorm: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#

None

self_attention: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#

None

self_attn_bda: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#

None

pre_cross_attn_layernorm: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#

None

cross_attention: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#

None

cross_attn_bda: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#

None

pre_mlp_layernorm: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#

None

mlp: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#

None

mlp_bda: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#

None

sharded_state_dict_keys_map: Dict[str, str]#

‘field(…)’

class core.transformer.transformer_layer.BaseTransformerLayer#

Bases: abc.ABC

A common parent class for TransformerLayer like implementations.

A dummy class that is subclassed by similar TransformerLayers e.g. the TransformerLayer in this file and possibly other TransformerLayer implementations that aim to use TransformerBlock as the base module. The main purpose is to check if any layer (or module) provided in the spec is a subclass of this class to allow fanning-out of that spec for all the layers in the TransformerBlock. See _get_block_submodules method implementation in transformer_block.py file for more details.

Initialization

class core.transformer.transformer_layer.TransformerLayer(
config: megatron.core.transformer.transformer_config.TransformerConfig,
submodules: core.transformer.transformer_layer.TransformerLayerSubmodules,
layer_number: int = 1,
hidden_dropout: Optional[float] = None,
pg_collection: Optional[megatron.core.process_groups_config.ProcessGroupCollection] = None,
vp_stage: Optional[int] = None,
)#

Bases: megatron.core.transformer.module.GraphableMegatronModule, core.transformer.transformer_layer.BaseTransformerLayer

A single transformer layer.

Transformer layer takes input with size [s, b, h] and returns an output of the same size.

Initialization

static _get_layer_offset(
config: megatron.core.transformer.transformer_config.TransformerConfig,
)#

Get the layer offset for the current pipeline stage.

Deprecated: please use get_transformer_layer_offset instead.

forward(*args, **kwargs)#

Perform a forward pass through the transformer layer.

This method calls the core computation of a transformer layer, including self-attention, cross-attention (if applicable), and feed-forward operations.

_forward_attention(
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
context: Optional[torch.Tensor] = None,
context_mask: Optional[torch.Tensor] = None,
rotary_pos_emb: Optional[torch.Tensor] = None,
rotary_pos_cos: Optional[torch.Tensor] = None,
rotary_pos_sin: Optional[torch.Tensor] = None,
rotary_pos_cos_sin: Optional[torch.Tensor] = None,
attention_bias: Optional[torch.Tensor] = None,
inference_context: Optional[Any] = None,
packed_seq_params: Optional[megatron.core.packed_seq_params.PackedSeqParams] = None,
sequence_len_offset: Optional[torch.Tensor] = None,
*,
inference_params: Optional[Any] = None,
)#

Perform a forward pass through the attention layer and the layernorms before and after the attention operations.

Parameters:
  • hidden_states (Tensor) – Input tensor of shape [s, b, h] where s is sequence length, b is batch size, and h is hidden size.

  • attention_mask (Tensor) – Mask tensor for self-attention.

  • context (Tensor, optional) – Context tensor for cross-attention.

  • context_mask (Tensor, optional) – Mask tensor for cross-attention.

  • rotary_pos_emb (Tensor, optional) – Rotary positional embeddings.

  • rotary_pos_cos (Optional[Tensor]) – Rotary embedding cosine.

  • rotary_pos_sin (Optional[Tensor]) – Rotary embedding sine.

  • rotary_pos_cos_sin (Optional[Tensor]) – Combined rotary embedding cosine and sine.

  • RoPE. (Currently used exclusively for inference with dynamic batching and flashinfer)

  • attention_bias (Tensor, optional) – Bias tensor for Q * K.T.

  • inference_context (object, optional) – Parameters for inference-time optimizations.

  • packed_seq_params (object, optional) – Parameters for packed sequence processing.

  • sequence_len_offset (Tensor, optional) – Offset along sequence dimension during inference.

Returns:

A tuple containing: hidden_states (Tensor): Transformed hidden states before the MLP layernorm. context (Tensor): Updated context tensor if cross-attention is used, otherwise None.

Return type:

Tuple[Tensor, Tensor]

_forward_mlp(hidden_states, inference_context=None)#

Perform a forward pass through the feed-forward layer.

Parameters:

hidden_states (Tensor) – Transformed hidden states before the MLP layernorm.

Returns:

Transformed hidden states of shape [s, b, h].

Return type:

output (Tensor)

sharded_state_dict(
prefix: str = '',
sharded_offsets: tuple = (),
metadata: Optional[dict] = None,
) megatron.core.dist_checkpointing.mapping.ShardedStateDict#

Generate a sharded state dictionary for the transformer layer.

Parameters:
  • prefix (str, optional) – Prefix to be added to all keys in the state dict.

  • sharded_offsets (tuple, optional) – Tuple of sharding offsets.

  • metadata (Optional[dict], optional) – Additional metadata for sharding.

Returns:

A dictionary containing the sharded state of the transformer layer.

Return type:

ShardedStateDict

get_layer_static_inputs(seq_length, micro_batch_size)#

Get the static inputs for the transformer layer. Besides the hidden_states that is generated in GraphableMegatronModule, we also add the attention_mask.

Returns:

A dictionary containing the static inputs for the layer.

Return type:

Dict[str, torch.Tensor]

_get_submodules_under_cudagraphs()#

Get the submodules that are covered by cudagraphs.

_te_cuda_graph_capture(*args, **kwargs)#

CUDA Graph capture for this layer using TE interface. There are some differences from the normal pass:

  1. In some conditions CUDA graph cannot cover the entire layer. The cuda_graph_scope attribute can be set to control the scope of the CUDA graph.

  2. If context is None, it cannot be returned as output.

_te_cuda_graph_replay(*args, **kwargs)#

CUDA graph replay for this layer and microbatch self.current_microbatch using TE interface. TransformerEngine versions>=1.10 allow keyword arguments with CUDA graph. However, CUDA graph accepts only Tensor inputs. Hence, inference_context and packed_seq_params are excluded from input list.

_get_te_cuda_graph_replay_args(*args, **kwargs)#

Helper function to get tensor arguments for TE CUDA graph.

_should_call_local_cudagraph(*args, **kwargs)#

Check if we should call the local cudagraph path.

__call__(*args, **kwargs)#