bridge.models.gpt.gpt_builder#

Module Contents#

Classes#

GPTModelConfig

Configuration for a Megatron Core GPT model.

GPTModelBuilder

Builder to construct Megatron Core GPT models.

Functions#

transformer_engine_layer_spec

Create a Transformer Engine layer specification based on the provided config.

transformer_engine_full_layer_spec

Create a full Transformer Engine layer specification with autocast support.

local_layer_spec

Create a local layer specification without Transformer Engine.

modelopt_transformer_layer_spec

Layer specification for quantization with ModelOpt.

default_layer_spec

Determine the most appropriate layer specification based on availability.

mtp_block_spec

Create MTP block spec if model has MTP layers.

Data#

API#

bridge.models.gpt.gpt_builder.logger#

‘getLogger(…)’

bridge.models.gpt.gpt_builder.transformer_engine_layer_spec(
config: GPTModelConfig,
) megatron.core.transformer.ModuleSpec#

Create a Transformer Engine layer specification based on the provided config.

bridge.models.gpt.gpt_builder.transformer_engine_full_layer_spec(
config: megatron.bridge.models.transformer_config.TransformerConfig,
) megatron.core.transformer.ModuleSpec#

Create a full Transformer Engine layer specification with autocast support.

Parameters:

config – GPT configuration object

Returns:

Module specification for full TE layers

Return type:

ModuleSpec

bridge.models.gpt.gpt_builder.local_layer_spec(
config: megatron.bridge.models.transformer_config.TransformerConfig,
) megatron.core.transformer.ModuleSpec#

Create a local layer specification without Transformer Engine.

Parameters:

config – GPT configuration object

Returns:

Module specification for local implementation layers

Return type:

ModuleSpec

bridge.models.gpt.gpt_builder.modelopt_transformer_layer_spec(
config: GPTModelConfig,
) megatron.core.transformer.ModuleSpec#

Layer specification for quantization with ModelOpt.

bridge.models.gpt.gpt_builder.default_layer_spec(
config: GPTModelConfig,
) megatron.core.transformer.ModuleSpec#

Determine the most appropriate layer specification based on availability.

class bridge.models.gpt.gpt_builder.GPTModelConfig#

Bases: megatron.bridge.models.common.ModelConfig

Configuration for a Megatron Core GPT model.

This is purely a configuration object. All model construction logic lives in GPTModelBuilder.

Contains a TransformerConfig alongside GPT-specific parameters. Attributes on the embedded transformer config are accessible directly on this object via __getattr__/__setattr__ proxying.

.. note:: vocab_size must be set before passing this config to GPTModelBuilder.

builder: ClassVar[str]#

‘megatron.bridge.models.GPTModelBuilder’

transformer: megatron.bridge.models.transformer_config.TransformerConfig#

None

transformer_layer_spec: megatron.core.transformer.ModuleSpec | Callable[[bridge.models.gpt.gpt_builder.GPTModelConfig], megatron.core.transformer.ModuleSpec]#

None

vocab_size: int | None#

None

This represents the unpadded vocab size. The padded vocab size is automatically calculated in the GPTModelBuilder.

make_vocab_size_divisible_by: int#

128

should_pad_vocab: bool#

False

Set if the tokenizer provides the vocab size. In this case, the vocab size will be padded. Controls whether vocab size should be padded for tensor parallelism.

seq_length: int#

1024

fp16_lm_cross_entropy: bool#

False

parallel_output: bool#

True

share_embeddings_and_output_weights: bool#

False

position_embedding_type: Literal[learned_absolute, rope, mrope, yarn, none]#

‘learned_absolute’

rotary_percent: float#

1.0

rotary_base: int#

10000

rope_scaling: bool#

False

rope_scaling_factor: float#

8.0

scatter_embedding_sequence_parallel: bool#

True

seq_len_interpolation_factor: float | None#

None

tp_comm_overlap_cfg: str | dict[str, Any] | None#

None

Config file when tp_comm_overlap is enabled.

use_transformer_engine_full_layer_spec: bool#

False

use_transformer_engine_op_fuser: bool#

False

use_arbitrary_attention_mask: bool | None#

None

__getattr__(name: str, /) Any#
__setattr__(name: str, value: Any, /) None#
finalize() None#

One time validation to run once config is ready to be used by builder.

class bridge.models.gpt.gpt_builder.GPTModelBuilder(
model_config: bridge.models.gpt.gpt_builder.GPTModelConfig,
)#

Bases: megatron.bridge.models.common.ModelBuilder[megatron.core.models.gpt.GPTModel, bridge.models.gpt.gpt_builder.GPTModelConfig]

Builder to construct Megatron Core GPT models.

.. rubric:: Example

transformer_cfg = TransformerConfig(num_layers=32, hidden_size=4096, …) model_cfg = GPTModelConfig(transformer=transformer_cfg, vocab_size=32000, seq_length=2048, …)

Single stage (e.g. inference)

model = GPTModelBuilder(model_cfg).build_model(pg_collection)

Distributed training

models = GPTModelBuilder(model_cfg).build_distributed_models(pg_collection)

Initialization

build_model(
pg_collection: megatron.core.process_groups_config.ProcessGroupCollection,
pre_process: bool | None = None,
post_process: bool | None = None,
vp_stage: int | None = None,
) megatron.core.models.gpt.GPTModel#

Build a single MCoreGPTModel stage.

Parameters:
  • pg_collection – Process groups for distributed training

  • pre_process – Include embedding layer

  • post_process – Include output layer

  • vp_stage – Virtual pipeline stage

Returns:

The constructed model

.. note:: Virtual pipeline model parallelism is not supported for Mamba models.

build_distributed_models(
pg_collection: megatron.core.process_groups_config.ProcessGroupCollection,
ddp_config: megatron.core.distributed.DistributedDataParallelConfig | None = None,
overlap_param_gather_with_optimizer_step: bool = False,
use_megatron_fsdp: bool = False,
use_torch_fsdp2: bool = False,
wrap_with_ddp: bool = True,
data_parallel_random_init: bool = True,
mixed_precision_wrapper: Callable[[Any, megatron.core.transformer.MegatronModule], megatron.core.transformer.MegatronModule] | None = Float16Module,
model_type: megatron.core.enums.ModelType = ModelType.encoder_or_decoder,
) list[megatron.core.models.gpt.GPTModel]#

Build model stages and wrap for distributed training.

Parameters:
  • pg_collection – Model communication process groups.

  • ddp_config – DistributedDataParallel configuration

  • overlap_param_gather_with_optimizer_step – Whether to overlap parameter gather with optimizer step.

  • use_megatron_fsdp – Whether to use Megatron FSDP

  • use_torch_fsdp2 – Whether to use Torch FSDP 2.0

  • wrap_with_ddp – Set to False to skip the DDP/FSDP wrapper.

  • data_parallel_random_init – Whether to use data parallel random initialization

  • mixed_precision_wrapper – Mixed precision wrapper, e.g. Float16Module

  • model_type – Deprecated flag, only used for backwards compatibility.

Returns:

List of model stages.

bridge.models.gpt.gpt_builder.mtp_block_spec(
config: bridge.models.gpt.gpt_builder.GPTModelConfig,
transformer_layer_spec: megatron.core.transformer.ModuleSpec,
vp_stage: int | None = None,
) megatron.core.transformer.ModuleSpec | None#

Create MTP block spec if model has MTP layers.

Parameters:

config – full model config

Returns:

The MTP module specification

Return type:

ModuleSpec