NeMo 2.0 Data Modules#
NeMo provides two primary data modules for working with Large Language Models (LLMs):
PreTrainingDataModule#
Located in nemo.collections.llm.gpt.data.pre_training
, this module is optimized for unsupervised pre-training of
LLMs from scratch on large corpora of text data.
In this case, the dataset is pre-tokenized and saved as token indices on disk using the
Megatron dataset format.
It supports:
Training on multiple data distributions with customizable weights
Efficient data loading through memory mapping
Automatic validation and test set creation
Built-in data validation and accessibility checks
Support for distributed training with Megatron-style data parallelism
FineTuningDataModule#
Located in nemo.collections.llm.gpt.data.fine_tuning
, this module is designed for supervised fine-tuning
(including parameter-efficient fine-tuning) of pre-trained models on specific tasks or domains.
Key features include:
Support for standard fine-tuning datasets in JSONL format
Packed sequence training for improved efficiency
Automatic handling of train/validation/test splits
Integration with various tokenizers
Memory-efficient data loading
Both modules inherit from PyTorch Lightning’s LightningDataModule
, providing a consistent interface while being
optimized for their respective use cases. The separation between pre-training and fine-tuning data modules reflects
the distinct requirements and optimizations needed for these two phases of LLM development.
For detailed usage of the two data modules, please see the following pages.