core.transformer.transformer_block#
Module Contents#
Classes#
Dataclass for specifying the submodules of a transformer block. |
|
Transformer class. |
Functions#
Determine the number of transformer layers to build for the current pipeline stage. |
|
Retrieve or construct TransformerBlockSubmodules based on the provided specification. |
Data#
API#
- core.transformer.transformer_block.get_cpu_offload_context#
None
- core.transformer.transformer_block.te_checkpoint#
None
- core.transformer.transformer_block.logger#
‘getLogger(…)’
- core.transformer.transformer_block.get_num_layers_to_build(
- config: megatron.core.transformer.transformer_config.TransformerConfig,
- vp_stage: Optional[int] = None,
- pp_rank: Optional[int] = None,
Determine the number of transformer layers to build for the current pipeline stage.
- Parameters:
config (TransformerConfig) – Configuration object containing transformer model parameters.
vp_stage (Optional[int]) – Virtual pipeline stage number.
pp_rank (Optional[int]) – Pipeline parallel rank.
- Returns:
The number of layers to be built for the current pipeline stage.
- Return type:
int
- class core.transformer.transformer_block.TransformerBlockSubmodules#
Dataclass for specifying the submodules of a transformer block.
This class defines the structure for configuring the layers and normalization within a transformer block, allowing for flexible and customizable architecture designs.
- Parameters:
layer_specs (List[ModuleSpec], optional) – A list of module specifications for the layers within the transformer block. Each specification typically defines a complete transformer layer (e.g., self-attention, feed-forward network).
layer_norm (Optional[Union[ModuleSpec, torch.nn.Module]], optional) – Specification or instance of the layer normalization to be applied.
- layer_specs: List[megatron.core.transformer.spec_utils.ModuleSpec]#
None
- layer_norm: Optional[Union[megatron.core.transformer.spec_utils.ModuleSpec, torch.nn.Module]]#
None
- core.transformer.transformer_block._get_block_submodules(
- config: megatron.core.transformer.transformer_config.TransformerConfig,
- spec: Union[core.transformer.transformer_block.TransformerBlockSubmodules, megatron.core.transformer.spec_utils.ModuleSpec],
- vp_stage: Optional[int] = None,
- pp_rank: Optional[int] = None,
Retrieve or construct TransformerBlockSubmodules based on the provided specification.
- Parameters:
config (TransformerConfig) – Configuration object for the transformer model.
spec (Union[TransformerBlockSubmodules, ModuleSpec]) – Specification for the transformer block submodules. Can be either a TransformerBlockSubmodules instance or a ModuleSpec.
vp_stage (Optional[int]) – Virtual pipeline stage number.
- Returns:
The submodules for the transformer block.
- Return type:
- class core.transformer.transformer_block.TransformerBlock(
- config: megatron.core.transformer.transformer_config.TransformerConfig,
- spec: Union[core.transformer.transformer_block.TransformerBlockSubmodules, megatron.core.transformer.spec_utils.ModuleSpec],
- post_layer_norm: bool = True,
- pre_process: bool = True,
- post_process: bool = True,
- pg_collection: megatron.core.process_groups_config.ProcessGroupCollection = None,
- vp_stage: Optional[int] = None,
Bases:
megatron.core.transformer.module.GraphableMegatronModule,megatron.core.transformer.module.MegatronModuleTransformer class.
Initialization
- _build_layers()#
- _get_layer(layer_number: int)#
- _checkpointed_forward(
- hidden_states: torch.Tensor,
- attention_mask: torch.Tensor,
- context: torch.Tensor,
- context_mask: torch.Tensor,
- rotary_pos_emb: torch.Tensor,
- attention_bias: torch.Tensor,
- packed_seq_params: megatron.core.packed_seq_params.PackedSeqParams,
- use_inner_quantization_context: bool,
Forward method with activation checkpointing.
- set_input_tensor(input_tensor: torch.Tensor)#
Set input tensor to be used instead of forward()’s input.
When doing pipeline parallelism the input from the previous stage comes from communication, not from the input, so the model’s forward_step_func won’t have it. This function is thus used by internal code to bypass the input provided by the forward_step_func
- _should_call_local_cudagraph(*args, **kwargs)#
Check if we should call the local cudagraph path.
- __call__(*args, **kwargs)#
- forward(
- hidden_states: Union[torch.Tensor, megatron.core.utils.WrappedTensor],
- attention_mask: Optional[torch.Tensor],
- context: Optional[torch.Tensor] = None,
- context_mask: Optional[torch.Tensor] = None,
- rotary_pos_emb: Optional[torch.Tensor] = None,
- rotary_pos_cos: Optional[torch.Tensor] = None,
- rotary_pos_sin: Optional[torch.Tensor] = None,
- rotary_pos_cos_sin: Optional[torch.Tensor] = None,
- attention_bias: Optional[torch.Tensor] = None,
- inference_context: Optional[megatron.core.inference.contexts.BaseInferenceContext] = None,
- packed_seq_params: Optional[megatron.core.packed_seq_params.PackedSeqParams] = None,
- sequence_len_offset: Optional[torch.Tensor] = None,
- *,
- inference_params: Optional[megatron.core.inference.contexts.BaseInferenceContext] = None,
- dynamic_inference_decode_only: Optional[bool] = None,
Perform the forward pass through the transformer block.
This method handles the core computation of the transformer, including self-attention, optional cross-attention, and feed-forward operations.
- Parameters:
hidden_states (Union[Tensor, WrappedTensor]) – Input tensor of shape [s, b, h] where s is the sequence length, b is the batch size, and h is the hidden size. Can be passed as a WrappedTensor during inference to avoid an obsolete reference in the calling function.
attention_mask (Tensor) – Boolean tensor of shape [1, 1, s, s] for masking self-attention.
context (Tensor, optional) – Context tensor for cross-attention.
context_mask (Tensor, optional) – Mask for cross-attention context
rotary_pos_emb (Tensor, optional) – Rotary positional embeddings.
rotary_pos_cos (Optional[Tensor]) – Rotary embedding cosine.
rotary_pos_sin (Optional[Tensor]) – Rotary embedding sine.
rotary_pos_cos_sin (Optional[Tensor]) – Combined rotary embedding cosine and sine.
RoPE. (Currently used exclusively for inference with dynamic batching and flashinfer)
attention_bias (Tensor) – Bias tensor for Q * K.T of shape in shape broadcastable to [b, num_head, sq, skv], e.g. [1, 1, sq, skv]. Used as an alternative to apply attention mask for TE cuDNN attention.
inference_context (BaseInferenceContext, optional) – Parameters for inference-time optimizations.
packed_seq_params (PackedSeqParams, optional) – Parameters for packed sequence processing.
dynamic_inference_decode_only – Optional[bool]: If true, indicates that the current inference context is for decode-only. This args is only used to uniquely identify decode and non-decode cuda graph runners in the cuda graph manager.
- Returns:
The output hidden states tensor of shape [s, b, h], and optionally the updated context tensor if cross-attention is used.
- Return type:
Union[Tensor, Tuple[Tensor, Tensor]]
- sharded_state_dict(
- prefix: str = '',
- sharded_offsets: tuple = (),
- metadata: dict = None,
Generate a sharded state dictionary for the transformer block.
- Parameters:
prefix (str, optional) – Prefix to be added to all keys in the state dict. Defaults to an empty string.
sharded_offsets (tuple, optional) – Tuple of sharding offsets.
metadata (dict, optional) – Additional metadata for sharding. Can specify if layers are non-homogeneous. Defaults to None.
- Returns:
A dictionary containing the sharded state of the model.
- Return type:
ShardedStateDict