nemo_automodel.components.models.gpt2

View as Markdown

GPT-2 model utility wrappers for NeMo Automodel.

The canonical way to instantiate a GPT-2 with custom sizes is to pass a transformers.GPT2Config into NeMoAutoModelForCausalLM.from_config. For YAML-driven workflows, however, specifying the entire nested config can be verbose. This module provides a single-level builder function that exposes the most common GPT-2 hyper-parameters directly.

Example (YAML):

model:
_target_: nemo_automodel.components.models.gpt2.build_gpt2_model
n_layer: 24 # GPT-2 Medium
n_embd: 1024
n_head: 16
vocab_size: 50257
n_positions: 2048

Module Contents

Classes

NameDescription
CausalSelfAttentionMulti-head self-attention with a causal mask.
GPT2LMHeadModelMinimal GPT-2 Causal-LM with tied input/output embeddings.
MLPGPT-2 feed-forward network (GEGLU → Linear).
TransformerBlockA single transformer block (LN → Attn → Add → LN → MLP → Add).

Functions

NameDescription
build_gpt2_modelInstantiate and return a pure-PyTorch GPT-2 language model.

Data

__all__

API

class nemo_automodel.components.models.gpt2.CausalSelfAttention(
embed_dim: int,
num_heads: int,
attn_dropout: float = 0.0
)

Bases: Module

Multi-head self-attention with a causal mask.

head_dim
= embed_dim // num_heads
out_proj
= nn.Linear(embed_dim, embed_dim)
qkv_proj
= nn.Linear(embed_dim, 3 * embed_dim)
nemo_automodel.components.models.gpt2.CausalSelfAttention.forward(
x: torch.Tensor
) -> torch.Tensor
class nemo_automodel.components.models.gpt2.GPT2LMHeadModel(
vocab_size: int,
n_positions: int,
n_embd: int,
n_layer: int,
n_head: int,
dropout: float = 0.1
)

Bases: Module

Minimal GPT-2 Causal-LM with tied input/output embeddings.

drop
= nn.Dropout(dropout)
h
lm_head
= nn.Linear(n_embd, vocab_size, bias=False)
ln_f
= nn.LayerNorm(n_embd)
wpe
= nn.Embedding(n_positions, n_embd)
wte
= nn.Embedding(vocab_size, n_embd)
nemo_automodel.components.models.gpt2.GPT2LMHeadModel._init_weights()

Parameter initialization following GPT-2 conventions.

nemo_automodel.components.models.gpt2.GPT2LMHeadModel.forward(
input_ids: torch.LongTensor,
kwargs = {}
) -> torch.Tensor
nemo_automodel.components.models.gpt2.GPT2LMHeadModel.initialize_weights()
class nemo_automodel.components.models.gpt2.MLP(
embed_dim: int,
expansion_factor: int = 4
)

Bases: Module

GPT-2 feed-forward network (GEGLU → Linear).

act
= nn.GELU()
fc1
= nn.Linear(embed_dim, hidden_dim)
fc2
= nn.Linear(hidden_dim, embed_dim)
nemo_automodel.components.models.gpt2.MLP.forward(
x: torch.Tensor
) -> torch.Tensor
class nemo_automodel.components.models.gpt2.TransformerBlock(
embed_dim: int,
num_heads: int,
dropout: float = 0.0
)

Bases: Module

A single transformer block (LN → Attn → Add → LN → MLP → Add).

attn
= CausalSelfAttention(embed_dim, num_heads, dropout)
ln_1
= nn.LayerNorm(embed_dim)
ln_2
= nn.LayerNorm(embed_dim)
mlp
= MLP(embed_dim)
nemo_automodel.components.models.gpt2.TransformerBlock.forward(
x: torch.Tensor
) -> torch.Tensor
nemo_automodel.components.models.gpt2.build_gpt2_model(
vocab_size: int = 50257,
n_positions: int = 2048,
n_ctx: int | None = None,
n_embd: int = 768,
n_layer: int = 12,
n_head: int = 12,
bos_token_id: int = 50256,
eos_token_id: int = 50256,
attn_implementation: str = 'flash_attention_2',
extra_cfg: typing.Any = {}
) -> torch.nn.Module

Instantiate and return a pure-PyTorch GPT-2 language model.

The function intentionally keeps the same signature as the original wrapper so existing YAML/CLI configurations continue to work. Extra keyword arguments are quietly ignored.

nemo_automodel.components.models.gpt2.__all__ = ['build_gpt2_model']