LLM Pre-Training with NeMo AutoModel
LLM Pre-Training with NeMo AutoModel
This guide covers FineWeb data preparation, defining a NanoGPT‑style model, and launching and monitoring a NeMo AutoModel pre‑training run.
Set Up Your Environment
In this guide, we will use an interactive environment to install NeMo AutoModel from Git. You can also install NeMo AutoModel from PyPI or use our bi-monthly Docker container (see the Installation Guide).
For this guide, we will use a single machine equipped with 8xH100 NVIDIA GPUs.
To run this guide on a single GPU, use the single-GPU command in the Launch Training section below and scale down the YAML (for example, reduce step_scheduler.global_batch_size / local_batch_size, and shrink the model using model.n_layer / model.n_embd / model.n_head). For more launch patterns, see Run on Your Local Workstation.
Preprocess the FineWeb Dataset
File Size Limitation: The nanogpt_data_processor.py script has a 4GB file size limit (~2^32 bytes) due to 32-bit position tracking in the BOS index. This translates to:
- ~2 billion tokens when using uint16 (vocabularies < 65,536 tokens, e.g., GPT-2)
- ~1 billion tokens when using uint32 (larger vocabularies)
Always use the --max-tokens flag to stay within these limits (e.g., --max-tokens 2B or --max-tokens 1.5B).
For larger datasets, please see pretraining.md which supports sharded preprocessing without these constraints.
Quick Introduction to the FineWeb Dataset
The FineWeb dataset consists of more than 18.5T tokens of cleaned and deduplicated English web data from CommonCrawl. For this guide, we use the sample-10BT subset (10 billion tokens), from which we extract a smaller sample (e.g., 500M tokens) that fits within the preprocessing tool’s limits.
Briefly, FineWeb is built by extracting main text from CommonCrawl WARC HTML, keeping English pages using fastText language scoring, applying multiple quality filters (e.g., Gopher repetition/quality checks, C4-style rules, and custom heuristics for list-like or repeated/poorly formatted lines), and then MinHash-deduplicating each crawl independently (5-gram shingling with 14×8 hash functions). Basic PII normalization is applied (e.g., anonymizing emails and public IPs). The result is released per-crawl (and convenient sampled subsets), ready for high-throughput streaming.
To train on more than 2B tokens from FineWeb, see pretraining.md which uses Megatron Core’s sharded dataset format without file size constraints.
Preprocessing and Tokenization
For the purposes of this guide, we provide a data preprocessing tool at nanogpt_data_processor.py that streams datasets from the Hugging Face Hub, tokenizes using Hugging Face’s transformers.AutoTokenizer (default: GPT-2), and writes the output in memory-mapped binary shards to files. During training, we use the NanogptDataset class that can stream efficiently at training time.
How the preprocessor works: The script streams data iteratively from the Hugging Face Hub (avoiding loading the entire dataset into memory), uses a multiprocessing pipeline with separate reader and writer processes, and parallelizes tokenization across multiple CPU cores using ProcessPoolExecutor. This design enables efficient processing of very large datasets while maintaining low memory overhead. By default, uses the gpt2 tokenizer, but can support other tokenizers using the --tokenizer option.
Consider the following options:
- Adjust
--max-tokensto control how many tokens to process (must stay within the 4GB file size limit mentioned above). - Adjust
--chunk-sizefor processing batch size. - Use
--num-workersto control parallelization. - Specify
--output-dirto change the output location.
Understand the NeMo AutoModel Training Workflow
NeMo AutoModel follows a simple but powerful flow for training:
- A Python recipe script (for example,
examples/llm_pretrain/pretrain.py) serves as the entry point that wires up all training components based on a YAML configuration file. Any configuration option can be overridden using CLI arguments (e.g.,--model.name abc). - The YAML file describes each component of the training job (such as
model,dataset,optimizer,distributed,checkpoint, and optionalwandb). - Each component is constructed from its
_target_, which points to a Python callable (function or class constructor) to instantiate. The remaining keys in that YAML block become keyword arguments for that callable.
How _target_ is resolved:
- Import path to a Python object (for example,
my_pkg.models.build_model). - Local Python file path plus object name (for example,
/abs/path/to/my_model.py:build_model). - Library callables such as Hugging Face
transformers.AutoModelForCausalLM.from_config.
Nested objects can also specify their own _target_ (common when building Hugging Face config objects first and passing them into a from_config method). Any YAML key can be overridden at launch time from the CLI, making it easy to tweak hyperparameters without editing files.
With this context, let’s define a model using _target_, then point the dataset at your preprocessed shards, and finally review the full YAML.
Define Your Own Model Architecture
NeMo AutoModel relies on a YAML-driven configuration to build every training component. In particular, the model._target_ must reference a callable that returns an nn.Module (or a compatible Hugging Face model). You can point _target_ at:
- An import path to a Python object.
- A local Python file plus the object name using
path.py:object_name. - A library callable such as
transformers.AutoModelForCausalLM.from_config.
Below are examples for each pattern.
NanoGPT Source and File-Path _target_
Below is the minimal GPT‑2 implementation used for this NanoGPT‑style pretraining flow. It is a pure‑PyTorch model with tied embeddings and standard transformer blocks:
In short, build_gpt2_model(...) constructs a compact GPT‑2 with configurable depth/width/heads and returns an nn.Module that outputs logits over the vocabulary. It’s intentionally lean (no KV‑cache or generation helpers) but perfectly suited for forward/backward passes and next‑token prediction.
To use this exact implementation directly from a file path, point _target_ to the file and object name (path.py:object). Absolute paths are recommended:
This loads the file on disk and calls build_gpt2_model(...) with the remaining keys as keyword arguments.
Import Path to a Callable (Function or Class)
Instead of a file path, you can reference the callable using its import path:
Hugging Face Models using from_config Function
You can instantiate any Hugging Face causal LM with a config-first flow by targeting a from_config callable and providing a nested config node. The nested node is itself resolved using _target_, so you can compose Hugging Face configs directly in YAML.
Alternatively, target a specific architecture:
- The
model._target_may reference an import path or a local Python file using thepath.py:objectform. - Any nested mapping that includes
_target_(e.g.,config:) is instantiated first and its result is passed upward. This is how the Hugging Facefrom_configpattern works. - You can keep using the same training recipe (optimizer, data, distributed settings); only the
model:block changes.
Inspect and Adjust the YAML Configuration
examples/llm_pretrain/nanogpt_pretrain.yaml is a complete configuration that:
- Defines a GPT-2 model using the
build_gpt2_modelshorthand (easy to scale up). - Points
file_patternat preprocessed binary data files (configure based on your preprocessing output). - Uses the new
NanogptDatasetwithseq_len=1024. - Sets a vanilla
AdamWoptimizer with learning rate2e-4. - Includes FSDP2 distributed training configuration.
Key configuration sections:
About _target_ configuration: The _target_ field specifies import paths to classes and functions within the nemo_automodel package (or any Python module). For example, nemo_automodel.components.models.gpt2.build_gpt2_model imports and calls the GPT-2 model builder function. You can also specify paths to your own Python files (e.g., my_custom_models.MyTransformer) to use custom nn.Module implementations, allowing full flexibility in model architecture while leveraging the training infrastructure.
Update the file_pattern to match your data location. For example, if using tools/nanogpt_data_processor.py with the default settings: "tools/fineweb_max_tokens_500M/dataset.bin"
Scale width/depth, batch_size, or seq_len as needed - the recipe is model-agnostic.
Launch Training
Adjust the distributed section in the YAML config to change between DDP, FSDP2, etc.
The TrainFinetuneRecipeForNextTokenPrediction class handles:
- Distributed (FSDP2 / TP / CP) wrapping if requested in the YAML.
- Gradient accumulation, LR scheduling, checkpointing, optional W&B logging.
- Validation loops if you supply
validation_dataset.
Checkpoints are written under checkpoints/ by default as safetensors or torch_save (YAML-configurable).
Monitor and Evaluate Training
- TPS (tokens per second), gradient norm, and loss statistics print every optimization step.
- Enable
wandbin the YAML for dashboards (wandb.project,wandb.entity, etc.). - Periodic checkpoints can be loaded using
TrainFinetuneRecipeForNextTokenPrediction.load_checkpoint().
Example W&B configuration:
Explore Further Work
- Scaling up: Swap the GPT-2 config for
LlamaForCausalLM,Qwen2, or any Hugging Face-compatible causal model; increasen_layer,n_embd, etc. - Mixed precision - FSDP2 +
bfloat16(dtype: bfloat16in distributed config) for memory savings. - Sequence packing - set
packed_sequence.packed_sequence_size>0 to pack variable-length contexts and boost utilization. - Custom datasets - implement your own
IterableDatasetor convert existing corpora to the.binformat usingtools/nanogpt_data_processor.pyas a template. - BOS alignment - set
align_to_bos: truein the dataset config to ensure sequences start with BOS tokens (requiresbos_tokenparameter).