core.models.gpt.gpt_model#

Module Contents#

Classes#

GPTModel

GPT Transformer language model.

API#

class core.models.gpt.gpt_model.GPTModel(
config: megatron.core.transformer.transformer_config.TransformerConfig,
transformer_layer_spec: megatron.core.transformer.spec_utils.ModuleSpec,
vocab_size: int,
max_sequence_length: int,
pre_process: bool = True,
post_process: bool = True,
fp16_lm_cross_entropy: bool = False,
parallel_output: bool = True,
share_embeddings_and_output_weights: bool = False,
position_embedding_type: Literal[learned_absolute, rope, mrope, yarn, none] = 'learned_absolute',
rotary_percent: float = 1.0,
rotary_base: int = 10000,
rope_scaling: bool = False,
rope_scaling_factor: float = 8.0,
scatter_embedding_sequence_parallel: bool = True,
seq_len_interpolation_factor: Optional[float] = None,
mtp_block_spec: Optional[megatron.core.transformer.spec_utils.ModuleSpec] = None,
pg_collection: Optional[megatron.core.process_groups_config.ProcessGroupCollection] = None,
vp_stage: Optional[int] = None,
)#

Bases: megatron.core.models.common.language_module.language_module.LanguageModule

GPT Transformer language model.

Parameters:
  • config (TransformerConfig) – Transformer config

  • transformer_layer_spec (ModuleSpec) – Specifies module to use for transformer layers

  • vocab_size (int) – Vocabulary size

  • max_sequence_length (int) – maximum size of sequence. This is used for positional embedding

  • pre_process (bool, optional) – Include embedding layer (used with pipeline parallelism). Defaults to True.

  • post_process (bool, optional) – Include an output layer (used with pipeline parallelism). Defaults to True.

  • fp16_lm_cross_entropy (bool, optional) – Defaults to False.

  • parallel_output (bool, optional) – Do not gather the outputs, keep them split across tensor parallel ranks. Defaults to True.

  • share_embeddings_and_output_weights (bool, optional) – When True, input embeddings and output logit weights are shared. Defaults to False.

  • position_embedding_type (Literal[learned_absolute,rope], optional) – Position embedding type.. Defaults to ‘learned_absolute’.

  • rotary_percent (float, optional) – Percent of rotary dimension to use for rotary position embeddings. Ignored unless position_embedding_type is ‘rope’. Defaults to 1.0.

  • rotary_base (int, optional) – Base period for rotary position embeddings. Ignored unless position_embedding_type is ‘rope’. Defaults to 10000.

  • rope_scaling (bool, optional) – Toggle RoPE scaling.

  • rope_scaling_factor (float) – RoPE scaling factor. Default 8.

  • scatter_embedding_sequence_parallel (bool, optional) – Whether embeddings should be scattered across sequence parallel region or not. Defaults to True.

  • seq_len_interpolation_factor (Optional[float], optional) – scale of linearly interpolating RoPE for longer sequences. The value must be a float larger than 1.0. Defaults to None.

  • pg_collection (ProcessGroupCollection) – Model communication process groups

Initialization

set_input_tensor(input_tensor: torch.Tensor) None#

Sets input tensor to the model.

See megatron.model.transformer.set_input_tensor()

Parameters:

input_tensor (Tensor) – Sets the input tensor for the model.

_preprocess(
input_ids: torch.Tensor,
position_ids: torch.Tensor,
decoder_input: torch.Tensor = None,
inference_context: megatron.core.inference.contexts.BaseInferenceContext = None,
packed_seq_params: megatron.core.packed_seq_params.PackedSeqParams = None,
)#

Preprocesses inputs for the transformer decoder.

Applies embeddings to input tokens, or uses decoder_input from a previous pipeline stage. Also sets up rotary positional embeddings.

forward(
input_ids: torch.Tensor,
position_ids: torch.Tensor,
attention_mask: torch.Tensor,
decoder_input: torch.Tensor = None,
labels: torch.Tensor = None,
inference_context: megatron.core.inference.contexts.BaseInferenceContext = None,
packed_seq_params: megatron.core.packed_seq_params.PackedSeqParams = None,
extra_block_kwargs: dict = None,
runtime_gather_output: Optional[bool] = None,
*,
inference_params: Optional[megatron.core.inference.contexts.BaseInferenceContext] = None,
loss_mask: Optional[torch.Tensor] = None,
) torch.Tensor#

Forward function of the GPT Model This function passes the input tensors through the embedding layer, and then the decoder and finally into the post processing layer (optional).

It either returns the Loss values if labels are given or the final hidden units

Parameters:

runtime_gather_output (bool) – Gather output at runtime. Default None means parallel_output arg in the constructor will be used.

_postprocess(
hidden_states,
input_ids,
position_ids,
labels,
rotary_pos_emb,
rotary_pos_cos,
rotary_pos_sin,
mtp_in_postprocess=None,
loss_mask=None,
decoder_input=None,
attention_mask=None,
inference_params=None,
packed_seq_params=None,
sequence_len_offset=None,
runtime_gather_output=None,
extra_block_kwargs=None,
inference_context=None,
)#

Postprocesses decoder hidden states to generate logits or compute loss.

Applies Multi-Token Prediction if enabled, generates output logits through the output layer, and computes language model loss when labels are provided.

shared_embedding_or_output_weight() torch.Tensor#

Gets the embedding weight or output logit weights when share input embedding and output weights set to True or when use Multi-Token Prediction (MTP) feature.

Returns:

During pre processing or MTP process it returns the input embeddings weight. Otherwise, during post processing it returns the final output layers weight.

Return type:

Tensor

build_schedule_plan(
input_ids: torch.Tensor,
position_ids: torch.Tensor,
attention_mask: torch.Tensor,
decoder_input: torch.Tensor = None,
labels: torch.Tensor = None,
inference_context: megatron.core.inference.contexts.BaseInferenceContext = None,
packed_seq_params: megatron.core.packed_seq_params.PackedSeqParams = None,
extra_block_kwargs: dict = None,
runtime_gather_output: Optional[bool] = None,
inference_params: Optional[megatron.core.inference.contexts.BaseInferenceContext] = None,
loss_mask: Optional[torch.Tensor] = None,
)#

Builds a computation schedule plan for the model.

This function creates a schedule plan for a model chunk, including preprocessing, transformer layers, and postprocessing. The schedule plan is used to optimize computation and memory usage in distributed environments.

Parameters:
  • input_ids (Tensor) – Input token IDs.

  • position_ids (Tensor) – Position IDs.

  • attention_mask (Tensor) – Attention mask.

  • decoder_input (Tensor, optional) – Decoder input tensor. Defaults to None.

  • labels (Tensor, optional) – Labels for loss computation. Defaults to None.

  • inference_context (BaseInferenceContext, optional) – Inference context. Defaults to None.

  • packed_seq_params (PackedSeqParams, optional) – Parameters for packed sequences. Defaults to None.

  • extra_block_kwargs (dict, optional) – Additional keyword arguments for blocks. Defaults to None.

  • runtime_gather_output (Optional[bool], optional) – Whether to gather output at runtime. Defaults to None.

  • inference_params (InferenceParams, optional) – Parameters for inference. Defaults to None.

  • loss_mask (Optional[Tensor], optional) – Loss mask. Defaults to None.

Returns:

The model chunk schedule plan.

Return type:

TransformerModelChunkSchedulePlan

sharded_state_dict(
prefix: str = '',
sharded_offsets: tuple = (),
metadata: Optional[Dict] = None,
) megatron.core.dist_checkpointing.mapping.ShardedStateDict#

Sharded state dict implementation for GPTModel backward-compatibility.

Removing extra state. Tie word embeddings and output layer in mtp process stage.

Parameters:
  • prefix (str) – Module name prefix.

  • sharded_offsets (tuple) – PP related offsets, expected to be empty at this module level.

  • metadata (Optional[Dict]) – metadata controlling sharded state dict creation.

Returns:

sharded state dict for the GPTModel

Return type:

ShardedStateDict