NeMo 2.0 Data Modules#

NeMo provides two primary data modules for working with Large Language Models (LLMs):

PreTrainingDataModule#

Located in nemo.collections.llm.gpt.data.pre_training, this module is optimized for unsupervised pre-training of LLMs from scratch on large corpora of text data. In this case, the dataset is pre-tokenized and saved as token indices on disk using the Megatron dataset format.

It supports:

  • Training on multiple data distributions with customizable weights

  • Efficient data loading through memory mapping

  • Automatic validation and test set creation

  • Built-in data validation and accessibility checks

  • Support for distributed training with Megatron-style data parallelism

FineTuningDataModule#

Located in nemo.collections.llm.gpt.data.fine_tuning, this module is designed for supervised fine-tuning (including parameter-efficient fine-tuning) of pre-trained models on specific tasks or domains.

Key features include:

  • Support for standard fine-tuning datasets in JSONL format

  • Packed sequence training for improved efficiency

  • Automatic handling of train/validation/test splits

  • Integration with various tokenizers

  • Memory-efficient data loading

Both modules inherit from PyTorch Lightning’s LightningDataModule, providing a consistent interface while being optimized for their respective use cases. The separation between pre-training and fine-tuning data modules reflects the distinct requirements and optimizations needed for these two phases of LLM development.

For detailed usage of the two data modules, please see the following pages.