What can I help you with?
Megatron Core User Guide

models.gpt package

This is the implementation of the popular GPT model. It supports several features like model parallelization (Tensor Parallel, Pipeline Parallel, Data Parallel) , mixture of experts, FP8 , Distributed optimizer etc. We are constantly adding new features. So be on the lookout or raise an issue if you want to have something added.

class core.models.gpt.gpt_model.GPTModel(*args: Any, **kwargs: Any)

Bases: megatron.core.models.common.language_module.language_module.LanguageModule

GPT Transformer language model.

Parameters
  • config (TransformerConfig) – Transformer config

  • transformer_layer_spec (ModuleSpec) – Specifies module to use for transformer layers

  • vocab_size (int) – Vocabulary size

  • max_sequence_length (int) – maximum size of sequence. This is used for positional embedding

  • pre_process (bool, optional) – Include embedding layer (used with pipeline parallelism). Defaults to True.

  • post_process (bool, optional) – Include an output layer (used with pipeline parallelism). Defaults to True.

  • fp16_lm_cross_entropy (bool, optional) – Defaults to False.

  • parallel_output (bool, optional) – Do not gather the outputs, keep them split across tensor parallel ranks. Defaults to True.

  • share_embeddings_and_output_weights (bool, optional) – When True, input embeddings and output logit weights are shared. Defaults to False.

  • position_embedding_type (Literal[learned_absolute,rope], optional) – Position embedding type.. Defaults to ‘learned_absolute’.

  • rotary_percent (float, optional) – Percent of rotary dimension to use for rotary position embeddings. Ignored unless position_embedding_type is ‘rope’. Defaults to 1.0.

  • rotary_base (int, optional) – Base period for rotary position embeddings. Ignored unless position_embedding_type is ‘rope’. Defaults to 10000.

  • scatter_embedding_sequence_parallel (bool, optional) – Whether embeddings should be scattered across sequence parallel region or not. Defaults to True.

  • seq_len_interpolation_factor (Optional[float], optional) – scale of linearly interpolating RoPE for longer sequences. The value must be a float larger than 1.0. Defaults to None.

forward(input_ids: torch.Tensor, position_ids: torch.Tensor, attention_mask: torch.Tensor, decoder_input: Optional[torch.Tensor] = None, labels: Optional[torch.Tensor] = None, inference_params: Optional[megatron.core.InferenceParams] = None, packed_seq_params: Optional[megatron.core.packed_seq_params.PackedSeqParams] = None, extra_block_kwargs: Optional[dict] = None, runtime_gather_output: Optional[bool] = None) → torch.Tensor

Forward function of the GPT Model This function passes the input tensors through the embedding layer, and then the decoeder and finally into the post processing layer (optional).

It either returns the Loss values if labels are given or the final hidden units

Parameters

runtime_gather_output (bool) – Gather output at runtime. Default None means parallel_output arg in the constructor will be used.

set_input_tensor(input_tensor: torch.Tensor) → None

Sets input tensor to the model.

See megatron.model.transformer.set_input_tensor()

Parameters

input_tensor (Tensor) – Sets the input tensor for the model.

sharded_state_dict(prefix: str = '', sharded_offsets: tuple = (), metadata: Optional[Dict] = None) → megatron.core.dist_checkpointing.mapping.ShardedStateDict

Sharded state dict implementation for GPTModel backward-compatibility (removing extra state).

Parameters
  • prefix (str) – Module name prefix.

  • sharded_offsets (tuple) – PP related offsets, expected to be empty at this module level.

  • metadata (Optional[Dict]) – metadata controlling sharded state dict creation.

Returns

sharded state dict for the GPTModel

Return type

ShardedStateDict

Previous models package
Next models.t5 package
© Copyright 2022-2025, NVIDIA. Last updated on Jan 14, 2025.