nemo_automodel.components.models.gpt2

GPT-2 model utility wrappers for NeMo Automodel.

The canonical way to instantiate a GPT-2 with custom sizes is to pass a transformers.GPT2Config into NeMoAutoModelForCausalLM.from_config. For YAML-driven workflows, however, specifying the entire nested config can be verbose. This module provides a single-level builder function that exposes the most common GPT-2 hyper-parameters directly.

Example (YAML):

model:
  _target_: nemo_automodel.components.models.gpt2.build_gpt2_model
  n_layer: 24           # GPT-2 Medium
  n_embd: 1024
  n_head: 16
  vocab_size: 50257
  n_positions: 2048

Module Contents

Classes

Name	Description
`CausalSelfAttention`	Multi-head self-attention with a causal mask.
`GPT2LMHeadModel`	Minimal GPT-2 Causal-LM with tied input/output embeddings.
`MLP`	GPT-2 feed-forward network (GEGLU → Linear).
`TransformerBlock`	A single transformer block (LN → Attn → Add → LN → MLP → Add).

Functions

Name	Description
`build_gpt2_model`	Instantiate and return a pure-PyTorch GPT-2 language model.

Data

__all__

API

class nemo_automodel.components.models.gpt2.CausalSelfAttention(
    embed_dim: int,
    num_heads: int,
    attn_dropout: float = 0.0
)

Bases: Module

Multi-head self-attention with a causal mask.

head_dim

= embed_dim // num_heads

out_proj

= nn.Linear(embed_dim, embed_dim)

qkv_proj

= nn.Linear(embed_dim, 3 * embed_dim)

nemo_automodel.components.models.gpt2.CausalSelfAttention.forward(
    x: torch.Tensor
) -> torch.Tensor

class nemo_automodel.components.models.gpt2.GPT2LMHeadModel(
    vocab_size: int,
    n_positions: int,
    n_embd: int,
    n_layer: int,
    n_head: int,
    dropout: float = 0.1
)

Bases: Module

Minimal GPT-2 Causal-LM with tied input/output embeddings.

drop

= nn.Dropout(dropout)

lm_head

= nn.Linear(n_embd, vocab_size, bias=False)

ln_f

= nn.LayerNorm(n_embd)

wpe

= nn.Embedding(n_positions, n_embd)

wte

= nn.Embedding(vocab_size, n_embd)

nemo_automodel.components.models.gpt2.GPT2LMHeadModel._init_weights()

Parameter initialization following GPT-2 conventions.

nemo_automodel.components.models.gpt2.GPT2LMHeadModel.forward(
    input_ids: torch.LongTensor,
    kwargs = {}
) -> torch.Tensor

nemo_automodel.components.models.gpt2.GPT2LMHeadModel.initialize_weights()

class nemo_automodel.components.models.gpt2.MLP(
    embed_dim: int,
    expansion_factor: int = 4
)

Bases: Module

GPT-2 feed-forward network (GEGLU → Linear).

act

= nn.GELU()

fc1

= nn.Linear(embed_dim, hidden_dim)

fc2

= nn.Linear(hidden_dim, embed_dim)

nemo_automodel.components.models.gpt2.MLP.forward(
    x: torch.Tensor
) -> torch.Tensor

class nemo_automodel.components.models.gpt2.TransformerBlock(
    embed_dim: int,
    num_heads: int,
    dropout: float = 0.0
)

Bases: Module

A single transformer block (LN → Attn → Add → LN → MLP → Add).

attn

= CausalSelfAttention(embed_dim, num_heads, dropout)

ln_1

= nn.LayerNorm(embed_dim)

ln_2

= nn.LayerNorm(embed_dim)

mlp

= MLP(embed_dim)

nemo_automodel.components.models.gpt2.TransformerBlock.forward(
    x: torch.Tensor
) -> torch.Tensor

nemo_automodel.components.models.gpt2.build_gpt2_model(
    vocab_size: int = 50257,
    n_positions: int = 2048,
    n_ctx: int | None = None,
    n_embd: int = 768,
    n_layer: int = 12,
    n_head: int = 12,
    bos_token_id: int = 50256,
    eos_token_id: int = 50256,
    attn_implementation: str = 'flash_attention_2',
    extra_cfg: typing.Any = {}
) -> torch.nn.Module

Instantiate and return a pure-PyTorch GPT-2 language model.

The function intentionally keeps the same signature as the original wrapper so existing YAML/CLI configurations continue to work. Extra keyword arguments are quietly ignored.

nemo_automodel.components.models.gpt2.__all__ = ['build_gpt2_model']