NeMo 2.0 Data Modules#

NeMo provides two primary data modules for working with Large Language Models (LLMs):

PreTrainingDataModule#

Located in nemo.collections.llm.gpt.data.pre_training, this module is optimized for unsupervised pre-training of LLMs from scratch on large corpora of text data. In this case, the dataset is pre-tokenized and saved as token indices on disk using the Megatron dataset format.

It supports:

Training on multiple data distributions with customizable weights
Efficient data loading through memory mapping
Automatic validation and test set creation
Built-in data validation and accessibility checks
Support for distributed training with Megatron-style data parallelism

FineTuningDataModule#

Located in nemo.collections.llm.gpt.data.fine_tuning, this module is designed for supervised fine-tuning (including parameter-efficient fine-tuning) of pre-trained models on specific tasks or domains.

Key features include:

Support for standard fine-tuning datasets in JSONL format
Packed sequence training for improved efficiency
Automatic handling of train/validation/test splits
Integration with various tokenizers
Memory-efficient data loading

Both modules inherit from PyTorch Lightning’s LightningDataModule, providing a consistent interface while being optimized for their respective use cases. The separation between pre-training and fine-tuning data modules reflects the distinct requirements and optimizations needed for these two phases of LLM development.

For detailed usage of the two data modules, please see the following pages.