`nemo_automodel.components.models.gpt2`#

GPT-2 model utility wrappers for NeMo Automodel.

The canonical way to instantiate a GPT-2 with custom sizes is to pass a transformers.GPT2Config into NeMoAutoModelForCausalLM.from_config. For YAML-driven workflows, however, specifying the entire nested config can be verbose. This module provides a single-level builder function that exposes the most common GPT-2 hyper-parameters directly.

Example (YAML):

model:
  _target_: nemo_automodel.components.models.gpt2.build_gpt2_model
  n_layer: 24           # GPT-2 Medium
  n_embd: 1024
  n_head: 16
  vocab_size: 50257
  n_positions: 2048

Module Contents#

Classes#

`CausalSelfAttention`	Multi-head self-attention with a causal mask.
`MLP`	GPT-2 feed-forward network (GEGLU → Linear).
`TransformerBlock`	A single transformer block (LN → Attn → Add → LN → MLP → Add).
`GPT2LMHeadModel`	Minimal GPT-2 Causal-LM with tied input/output embeddings.

Functions#

build_gpt2_model

Instantiate and return a pure-PyTorch GPT-2 language model.

Data#

__all__

API#

nemo_automodel.components.models.gpt2.__all__#: [‘build_gpt2_model’]

class nemo_automodel.components.models.gpt2.CausalSelfAttention( embed_dim: int, num_heads: int, attn_dropout: float = 0.0, )#

Bases: torch.nn.Module

Multi-head self-attention with a causal mask.

Initialization

forward(x: torch.Tensor) → torch.Tensor#

class nemo_automodel.components.models.gpt2.MLP(embed_dim: int, expansion_factor: int = 4)#

Bases: torch.nn.Module

GPT-2 feed-forward network (GEGLU → Linear).

Initialization

forward(x: torch.Tensor) → torch.Tensor#

class nemo_automodel.components.models.gpt2.TransformerBlock(embed_dim: int, num_heads: int, dropout: float = 0.0)#

Bases: torch.nn.Module

A single transformer block (LN → Attn → Add → LN → MLP → Add).

Initialization

forward(x: torch.Tensor) → torch.Tensor#

class nemo_automodel.components.models.gpt2.GPT2LMHeadModel( *, vocab_size: int, n_positions: int, n_embd: int, n_layer: int, n_head: int, dropout: float = 0.1, )#

Bases: torch.nn.Module

Minimal GPT-2 Causal-LM with tied input/output embeddings.

Initialization

forward(input_ids: torch.LongTensor) → torch.Tensor#

_init_weights()#: Parameter initialization following GPT-2 conventions.

nemo_automodel.components.models.gpt2.build_gpt2_model(

*,

vocab_size: int = 50257,

n_positions: int = 2048,

n_ctx: int | None = None,

n_embd: int = 768,

n_layer: int = 12,

n_head: int = 12,

bos_token_id: int = 50256,

eos_token_id: int = 50256,

attn_implementation: str = 'flash_attention_2',

**extra_cfg: Any,

) → torch.nn.Module#

Instantiate and return a pure-PyTorch GPT-2 language model.

The function intentionally keeps the same signature as the original wrapper so existing YAML/CLI configurations continue to work. Extra keyword arguments are quietly ignored.

nemo_automodel.components.models.gpt2#

Module Contents#

Classes#

Functions#

Data#

API#

`nemo_automodel.components.models.gpt2`#