Data Preparation to Use Megatron-Energon Dataloader#

The Megatron-Energon data loader is designed to handle large-scale distributed training environments efficiently, especially for multimodal models. This documentation outlines how to prepare data in the required format to use it with the NeMo Framework.

For details about Megatron-Energon, refer to the Megatron-Energon introduction. For details about the WebDataset format, see the WebDataset documentation.

The first step in using the Megatron-Energon data loader is to prepare data in the supported format. This involves converting raw datasets into the WebDataset (WDS) format and generating metadata required by Megatron-Energon.

In this example, we show how to convert the LLaVA-Pretrain dataset (in JSON format) to WebDataset format and then prepare it for consumption by Energon datamodules in NeMo.

Step 1: Download the Dataset#

Download the dataset from Hugging Face using the link below:

LLaVA-Pretrain Dataset

Below is an example of the JSON structure:

[
  {
    "id": "GCC_train_002582585",
    "image": "GCC_train_002582585.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "Provide a brief description of the given image.\n<image>"
      },
      {
        "from": "gpt",
        "value": "Olive oil is a healthy ingredient used liberally."
      }
    ]
  }
]

Step 2: Convert JSON Data to WebDataset Format#

The JSON dataset can be converted to WebDataset format using a Python script. You can also refer to this example provided in Megatron-LM: convert_llava_pretrain_to_wds.py. Below is an example Python script to perform the conversion. Modify the script as necessary to account for the strucure and keys of the data. The script can be run from any directory. Simply modify the input path (‘llava_pretrain_dir’) and output path below. The converted data is stored in the output path <path_to_LLaVA-Pretrain>/wds.

import json
import os
import webdataset as wds
from tqdm import tqdm

# Set the path to the LLaVA-Pretrain dataset directory
llava_pretrain_dir = '<path_to_LLaVA-Pretrain>'

# Paths to the dataset files
json_file = os.path.join(llava_pretrain_dir, 'blip_laion_cc_sbu_558k.json')
output_path = os.path.join(llava_pretrain_dir, 'wds')

if not os.path.exists(output_path):
    os.mkdir(output_path)

# Load data
with open(json_file, 'r') as f:
    data = json.load(f)

# Convert JSON to WebDataset
with wds.ShardWriter(os.path.join(output, 'pretrain-%d.tar'), maxcount=10000) as shard_writer:
    for entry in tqdm(data):
        with open(os.path.join(llava_pretrain_dir, entry['image']), "rb") as img_file:
            image_data = img_file.read()
        sample = {
            "__key__": entry['id'],
            "jpg": image_data,
            "json": json.dumps(entry['conversations']).encode("utf-8"),
        }
        shard_writer.write(sample)

print(f"Dataset successfully converted to WebDataset format.")

Step 3: Generate Metadata for Megatron-Energon#

Once the dataset is converted to WebDataset format, metadata must be generated for the Megatron-Energon data loader. Run the following command in the directory containing the WebDataset files: Navigate to the above output directory path for running the following command.

energon prepare .

or

energon prepare <path_to_LLaVA-Pretrain>/wds

This command creates a .nv-meta folder containing essential metadata for the Megatron-Energon data loader.

For more details, refer to the data preparation section in the Megatron-Energon documentation.

Once the dataset is converted and metadata is generated, your folder will look like this:

.
├── .nv-meta
│   ├── dataset.yaml
│   ├── split.yaml
├── pretrain-0.tar
├── pretrain-16.tar
├── pretrain-16.tar.idx
├── pretrain-23.tar
├── pretrain-23.tar.idx
├── pretrain-31.tar
├── pretrain-31.tar.idx
├── pretrain-39.tar
├── pretrain-39.tar.idx
├── pretrain-47.tar
├── pretrain-47.tar.idx
├── pretrain-55.tar
└── pretrain-55.tar.idx

.nv-meta Directory Contents#

The .nv-meta folder contains metadata files essential for the Megatron-Energon dataloader:

.nv-meta/
├── dataset.yaml
└── split.yaml

Example Content of dataset.yaml#

The dataset.yaml file provides information about the dataset structure and field mappings. Below is an example:

__class__: VQAWebdataset
__module__: megatron.energon
field_map:
  answers: json[1][value]
  context: json[0][value]
  image: jpg

Additional References#

For more information related to dataset formats and samples, refer to: Megatron-Energon Dataset Formats

For details on creating custom samples and custom dataloaders, refer to: Advanced Data Format Documentation.