LLM Pre-Training with NeMo AutoModel

This guide covers FineWeb data preparation, defining a NanoGPT‑style model, and launching and monitoring a NeMo AutoModel pre‑training run.

Set Up Your Environment

In this guide, we will use an interactive environment to install NeMo AutoModel from Git. You can also install NeMo AutoModel from PyPI or use our bi-monthly Docker container (see the Installation Guide).

$ # clone / install AutoModel (editable for local hacks)
$ cd /path/to/workspace/ # specify to your path as needed.
$ git clone https://github.com/NVIDIA-NeMo/Automodel.git
$ cd Automodel/
$ pip install -e ".[all]"    # installs NeMo AutoModel + optional extras

For this guide, we will use a single machine equipped with 8xH100 NVIDIA GPUs.

To run this guide on a single GPU, use the single-GPU command in the Launch Training section below and scale down the YAML (for example, reduce step_scheduler.global_batch_size / local_batch_size, and shrink the model using model.n_layer / model.n_embd / model.n_head). For more launch patterns, see Run on Your Local Workstation.

Preprocess the FineWeb Dataset

File Size Limitation: The nanogpt_data_processor.py script has a 4GB file size limit (~2^32 bytes) due to 32-bit position tracking in the BOS index. This translates to:

~2 billion tokens when using uint16 (vocabularies < 65,536 tokens, e.g., GPT-2)
~1 billion tokens when using uint32 (larger vocabularies)

Always use the --max-tokens flag to stay within these limits (e.g., --max-tokens 2B or --max-tokens 1.5B).

For larger datasets, please see pretraining.md which supports sharded preprocessing without these constraints.

Quick Introduction to the FineWeb Dataset

The FineWeb dataset consists of more than 18.5T tokens of cleaned and deduplicated English web data from CommonCrawl. For this guide, we use the sample-10BT subset (10 billion tokens), from which we extract a smaller sample (e.g., 500M tokens) that fits within the preprocessing tool’s limits.

Briefly, FineWeb is built by extracting main text from CommonCrawl WARC HTML, keeping English pages using fastText language scoring, applying multiple quality filters (e.g., Gopher repetition/quality checks, C4-style rules, and custom heuristics for list-like or repeated/poorly formatted lines), and then MinHash-deduplicating each crawl independently (5-gram shingling with 14×8 hash functions). Basic PII normalization is applied (e.g., anonymizing emails and public IPs). The result is released per-crawl (and convenient sampled subsets), ready for high-throughput streaming.

To train on more than 2B tokens from FineWeb, see pretraining.md which uses Megatron Core’s sharded dataset format without file size constraints.

Preprocessing and Tokenization

For the purposes of this guide, we provide a data preprocessing tool at nanogpt_data_processor.py that streams datasets from the Hugging Face Hub, tokenizes using Hugging Face’s transformers.AutoTokenizer (default: GPT-2), and writes the output in memory-mapped binary shards to files. During training, we use the NanogptDataset class that can stream efficiently at training time.

$ # Step into repo root
$ cd /path/to/workspace/Automodel/
$ 
$ # Generate 500 million tokens using the 10B raw split
$ python tools/nanogpt_data_processor.py \
>   --dataset HuggingFaceFW/fineweb \
>   --set-name sample-10BT \
>   --max-tokens 500M      # stop after 500 million tokens; specify as needed, reduce for smaller runs.
$ 
$ # Shards are stored in:  tools/fineweb_max_tokens_500M/
$ #    dataset.bin (single binary file with all tokens)

How the preprocessor works: The script streams data iteratively from the Hugging Face Hub (avoiding loading the entire dataset into memory), uses a multiprocessing pipeline with separate reader and writer processes, and parallelizes tokenization across multiple CPU cores using ProcessPoolExecutor. This design enables efficient processing of very large datasets while maintaining low memory overhead. By default, uses the gpt2 tokenizer, but can support other tokenizers using the --tokenizer option.

Consider the following options:

Adjust --max-tokens to control how many tokens to process (must stay within the 4GB file size limit mentioned above).
Adjust --chunk-size for processing batch size.
Use --num-workers to control parallelization.
Specify --output-dir to change the output location.

Understand the NeMo AutoModel Training Workflow

NeMo AutoModel follows a simple but powerful flow for training:

A Python recipe script (for example, examples/llm_pretrain/pretrain.py) serves as the entry point that wires up all training components based on a YAML configuration file. Any configuration option can be overridden using CLI arguments (e.g., --model.name abc).
The YAML file describes each component of the training job (such as model, dataset, optimizer, distributed, checkpoint, and optional wandb).
Each component is constructed from its _target_, which points to a Python callable (function or class constructor) to instantiate. The remaining keys in that YAML block become keyword arguments for that callable.

How _target_ is resolved:

Import path to a Python object (for example, my_pkg.models.build_model).
Local Python file path plus object name (for example, /abs/path/to/my_model.py:build_model).
Library callables such as Hugging Face transformers.AutoModelForCausalLM.from_config.

Nested objects can also specify their own _target_ (common when building Hugging Face config objects first and passing them into a from_config method). Any YAML key can be overridden at launch time from the CLI, making it easy to tweak hyperparameters without editing files.

With this context, let’s define a model using _target_, then point the dataset at your preprocessed shards, and finally review the full YAML.

Define Your Own Model Architecture

NeMo AutoModel relies on a YAML-driven configuration to build every training component. In particular, the model._target_ must reference a callable that returns an nn.Module (or a compatible Hugging Face model). You can point _target_ at:

An import path to a Python object.
A local Python file plus the object name using path.py:object_name.
A library callable such as transformers.AutoModelForCausalLM.from_config.

Below are examples for each pattern.

NanoGPT Source and File-Path `_target_`

Below is the minimal GPT‑2 implementation used for this NanoGPT‑style pretraining flow. It is a pure‑PyTorch model with tied embeddings and standard transformer blocks:

"""
Self-contained GPT-2 (Causal LM) implementation.
This module defines a pure-PyTorch model and defines the necessary
building blocks (attention, MLP, transformer block, and language-model head).
The public *build_gpt2_model* helper returns an ``nn.Module``.
"""
import math
from typing import Any
import torch
import torch.nn as nn
import torch.nn.functional as F
# The attention layer
class CausalSelfAttention(nn.Module):
    """Multi-head self-attention with a causal mask."""
    def __init__(self, embed_dim: int, num_heads: int, attn_dropout: float = 0.0):
        super().__init__()
        if embed_dim % num_heads != 0:
            raise ValueError("embed_dim must be divisible by num_heads")
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        self.qkv_proj = nn.Linear(embed_dim, 3 * embed_dim)
        self.out_proj = nn.Linear(embed_dim, embed_dim)
        self.attn_dropout = attn_dropout
    def forward(self, x: torch.Tensor) -> torch.Tensor:  # (B, T, C)
        bsz, seq_len, _ = x.shape
        # Project to QKV and reshape: (B, T, 3*C) → (B, n_head, T, head_dim)
        qkv = self.qkv_proj(x).view(bsz, seq_len, 3, self.num_heads, self.head_dim)
        q, k, v = qkv.unbind(dim=2)
        q, k, v = (t.transpose(1, 2) for t in (q, k, v))  # (B, n_head, T, head_dim)
        # Use torch's optimized SDPA when available (PyTorch ≥2.0)
        if hasattr(F, "scaled_dot_product_attention"):
            attn_output = F.scaled_dot_product_attention(
                q, k, v, dropout_p=self.attn_dropout, is_causal=True
            )  # (B, n_head, T, head_dim)
        else:
            # Fallback implementation with an explicit causal mask
            scores = q @ k.transpose(-2, -1) / math.sqrt(self.head_dim)
            causal_mask = torch.tril(torch.ones(seq_len, seq_len, device=x.device, dtype=torch.bool))
            scores = scores.masked_fill(~causal_mask, float("-inf"))
            attn_weights = F.softmax(scores, dim=-1)
            attn_weights = F.dropout(attn_weights, p=self.attn_dropout, training=self.training)
            attn_output = attn_weights @ v  # (B, n_head, T, head_dim)
        # Merge heads
        attn_output = attn_output.transpose(1, 2).contiguous().view(bsz, seq_len, self.embed_dim)
        return self.out_proj(attn_output)
# The MLP
class MLP(nn.Module):
    """GPT-2 feed-forward network (GEGLU → Linear)."""
    def __init__(self, embed_dim: int, expansion_factor: int = 4):
        super().__init__()
        hidden_dim = expansion_factor * embed_dim
        self.fc1 = nn.Linear(embed_dim, hidden_dim)
        self.act = nn.GELU()
        self.fc2 = nn.Linear(hidden_dim, embed_dim)
    def forward(self, x: torch.Tensor) -> torch.Tensor:  # (B, T, C)
        return self.fc2(self.act(self.fc1(x)))
# Transformers
class TransformerBlock(nn.Module):
    """A single transformer block (LN → Attn → Add → LN → MLP → Add)."""
    def __init__(self, embed_dim: int, num_heads: int, dropout: float = 0.0):
        super().__init__()
        self.ln_1 = nn.LayerNorm(embed_dim)
        self.attn = CausalSelfAttention(embed_dim, num_heads, dropout)
        self.ln_2 = nn.LayerNorm(embed_dim)
        self.mlp = MLP(embed_dim)
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x
# The GPT-2 model definition
class GPT2LMHeadModel(nn.Module):
    """Minimal GPT-2 Causal-LM with tied input/output embeddings."""
    def __init__(
        self,
        *,
        vocab_size: int,
        n_positions: int,
        n_embd: int,
        n_layer: int,
        n_head: int,
        dropout: float = 0.1,
    ) -> None:
        super().__init__()
        self.wte = nn.Embedding(vocab_size, n_embd)
        self.wpe = nn.Embedding(n_positions, n_embd)
        self.drop = nn.Dropout(dropout)
        self.h = nn.ModuleList([TransformerBlock(n_embd, n_head, dropout) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd)
        # Language model head (weights tied to token embedding matrix)
        self.lm_head = nn.Linear(n_embd, vocab_size, bias=False)
        self.lm_head.weight = self.wte.weight  # weight tying
        # Initialize parameters following GPT-2 scheme
        self._init_weights()
    def forward(self, input_ids: torch.LongTensor) -> torch.Tensor:  # (B, T) → (B, T, V)
        batch_size, seq_len = input_ids.shape
        if seq_len > self.wpe.num_embeddings:
            raise ValueError(f"Sequence length {seq_len} exceeds maximum context size {self.wpe.num_embeddings}.")
        pos_ids = torch.arange(seq_len, device=input_ids.device).unsqueeze(0).expand(batch_size, seq_len)
        x = self.wte(input_ids) + self.wpe(pos_ids)
        x = self.drop(x)
        for block in self.h:
            x = block(x)
        x = self.ln_f(x)
        logits = self.lm_head(x)
        return logits
    def _init_weights(self):
        """Parameter initialization following GPT-2 conventions."""
        for module in self.modules():
            if isinstance(module, nn.Linear):
                # GPT-2 uses normal(0, 0.02)
                nn.init.normal_(module.weight, mean=0.0, std=0.02)
                if module.bias is not None:
                    nn.init.zeros_(module.bias)
            elif isinstance(module, nn.Embedding):
                nn.init.normal_(module.weight, mean=0.0, std=0.02)
# Helper entrypoint
def build_gpt2_model(
    *,
    vocab_size: int = 50257,
    n_positions: int = 2048,
    n_ctx: int | None = None,
    n_embd: int = 768,
    n_layer: int = 12,
    n_head: int = 12,
    bos_token_id: int = 50256,  # kept for API backward-compat (unused)
    eos_token_id: int = 50256,  # kept for API backward-compat (unused)
    attn_implementation: str = "flash_attention_2",  # retained but ignored
    **extra_cfg: Any,  # ignored to preserve call-sites that used to pass config tweaks
) -> nn.Module:
    """Instantiate and return a *pure-PyTorch* GPT-2 language model.
    The function intentionally keeps the same signature as the original
    wrapper so existing YAML/CLI configurations continue to work.
    Extra keyword arguments are quietly ignored.
    """
    # Map legacy *n_ctx* to *n_positions* if provided.
    if n_ctx is not None and n_ctx != n_positions:
        n_positions = n_ctx
    # Issue a gentle warning if the user passes unused extra kwargs.
    if extra_cfg:
        invalid = ", ".join(extra_cfg.keys())
        print(
            f"[build_gpt2_model] Warning: Ignoring unsupported keyword arguments: {invalid}.",
            flush=True,
        )
    return GPT2LMHeadModel(
        vocab_size=vocab_size,
        n_positions=n_positions,
        n_embd=n_embd,
        n_layer=n_layer,
        n_head=n_head,
    )

In short, build_gpt2_model(...) constructs a compact GPT‑2 with configurable depth/width/heads and returns an nn.Module that outputs logits over the vocabulary. It’s intentionally lean (no KV‑cache or generation helpers) but perfectly suited for forward/backward passes and next‑token prediction.

To use this exact implementation directly from a file path, point _target_ to the file and object name (path.py:object). Absolute paths are recommended:

1 model:
2   _target_: /abs/path/to/repo/nemo_automodel/components/models/gpt2.py:build_gpt2_model
3   vocab_size: 50258
4   n_positions: 2048
5   n_embd: 768
6   n_layer: 12
7   n_head: 12

This loads the file on disk and calls build_gpt2_model(...) with the remaining keys as keyword arguments.

Import Path to a Callable (Function or Class)

Instead of a file path, you can reference the callable using its import path:

1 # examples/llm_pretrain/nanogpt_pretrain.yaml
2 model:
3   _target_: nemo_automodel.components.models.gpt2.build_gpt2_model
4   vocab_size: 50258
5   n_positions: 2048
6   n_embd: 768
7   n_layer: 12
8   n_head: 12

Hugging Face Models using `from_config` Function

You can instantiate any Hugging Face causal LM with a config-first flow by targeting a from_config callable and providing a nested config node. The nested node is itself resolved using _target_, so you can compose Hugging Face configs directly in YAML.

1 model:
2   _target_: transformers.AutoModelForCausalLM.from_config
3   # Nested object: built first, then passed to from_config(config=...)
4   config:
5     _target_: transformers.AutoConfig.from_pretrained
6     pretrained_model_name_or_path: gpt2   # or "Qwen/Qwen2-1.5B", etc.
7     n_layer: 12
8     n_head: 12
9     n_positions: 2048
10     vocab_size: 50258

Alternatively, target a specific architecture:

1 model:
2   _target_: transformers.GPT2LMHeadModel.from_config
3   config:
4     _target_: transformers.GPT2Config
5     n_layer: 12
6     n_head: 12
7     n_positions: 2048
8     vocab_size: 50258

The model._target_ may reference an import path or a local Python file using the path.py:object form.
Any nested mapping that includes _target_ (e.g., config:) is instantiated first and its result is passed upward. This is how the Hugging Face from_config pattern works.
You can keep using the same training recipe (optimizer, data, distributed settings); only the model: block changes.

Inspect and Adjust the YAML Configuration

examples/llm_pretrain/nanogpt_pretrain.yaml is a complete configuration that:

Defines a GPT-2 model using the build_gpt2_model shorthand (easy to scale up).
Points file_pattern at preprocessed binary data files (configure based on your preprocessing output).
Uses the new NanogptDataset with seq_len=1024.
Sets a vanilla AdamW optimizer with learning rate 2e-4.
Includes FSDP2 distributed training configuration.

Key configuration sections:

1 # Model configuration (two options available)
2 model:
3   _target_: nemo_automodel.components.models.gpt2.build_gpt2_model
4   vocab_size: 50258
5   n_positions: 2048
6   n_embd: 768
7   n_layer: 12
8   n_head: 12
9 
10 # Dataset configuration
11 dataset:
12   _target_: nemo_automodel.components.datasets.llm.nanogpt_dataset.NanogptDataset
13   file_pattern: "tools/fineweb_max_tokens_500M/dataset.bin"
14   seq_len: 1024
15   shuffle_files: true
16 
17 # Distributed training
18 distributed:
19   strategy: fsdp2
20   dp_size: null
21   tp_size: 1
22   cp_size: 1

About _target_ configuration: The _target_ field specifies import paths to classes and functions within the nemo_automodel package (or any Python module). For example, nemo_automodel.components.models.gpt2.build_gpt2_model imports and calls the GPT-2 model builder function. You can also specify paths to your own Python files (e.g., my_custom_models.MyTransformer) to use custom nn.Module implementations, allowing full flexibility in model architecture while leveraging the training infrastructure.

Update the file_pattern to match your data location. For example, if using tools/nanogpt_data_processor.py with the default settings: "tools/fineweb_max_tokens_500M/dataset.bin"

Scale width/depth, batch_size, or seq_len as needed - the recipe is model-agnostic.

Launch Training

$ # Single-GPU run (good for local testing)
$ python examples/llm_pretrain/pretrain.py \
>   --config examples/llm_pretrain/nanogpt_pretrain.yaml
$ 
$ # Multi-GPU (e.g., 8x H100)
$ torchrun --standalone --nproc-per-node 8 \
>   examples/llm_pretrain/pretrain.py \
>   --config examples/llm_pretrain/nanogpt_pretrain.yaml
$ 
$ # Using the automodel CLI:
$ # single-GPU
$ automodel examples/llm_pretrain/nanogpt_pretrain.yaml
$ 
$ # multi-GPU (8 GPUs)
$ automodel --nproc-per-node 8 examples/llm_pretrain/nanogpt_pretrain.yaml

Adjust the distributed section in the YAML config to change between DDP, FSDP2, etc.

The TrainFinetuneRecipeForNextTokenPrediction class handles:

Distributed (FSDP2 / TP / CP) wrapping if requested in the YAML.
Gradient accumulation, LR scheduling, checkpointing, optional W&B logging.
Validation loops if you supply validation_dataset.

Checkpoints are written under checkpoints/ by default as safetensors or torch_save (YAML-configurable).

Monitor and Evaluate Training

TPS (tokens per second), gradient norm, and loss statistics print every optimization step.
Enable wandb in the YAML for dashboards (wandb.project, wandb.entity, etc.).
Periodic checkpoints can be loaded using TrainFinetuneRecipeForNextTokenPrediction.load_checkpoint().

Example W&B configuration:

1 wandb:
2   project: "nanogpt-pretraining"
3   entity: "your-wandb-entity"
4   name: "nanogpt-500M-tokens"

Explore Further Work

Scaling up: Swap the GPT-2 config for LlamaForCausalLM, Qwen2, or any Hugging Face-compatible causal model; increase n_layer, n_embd, etc.
Mixed precision - FSDP2 + bfloat16 (dtype: bfloat16 in distributed config) for memory savings.
Sequence packing - set packed_sequence.packed_sequence_size >0 to pack variable-length contexts and boost utilization.
Custom datasets - implement your own IterableDataset or convert existing corpora to the .bin format using tools/nanogpt_data_processor.py as a template.
BOS alignment - set align_to_bos: true in the dataset config to ensure sequences start with BOS tokens (requires bos_token parameter).