Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Note

Attention: CodeGemma Users

The following guide prepares a natural language dataset as an example. Code datasets can be prepared in a similar fashion. See Starcoder2 Data Preparation for an example.

Data Preparation for SFT and PEFT

This section provides detailed steps to prepare a packed Supervised Fine-Tuning (SFT) dataset for Gemma models, using the example of the “dolly” dataset. Although we focus on “dolly”, the methodology should be applicable to any dataset. This provide a significant boost to SFT / PEFT performance in NeMo.

Databricks-dolly-15k is an open-source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization For more details about the data refer to databricks-dolly-15k | Hugging Face

Step 1: Download the Dataset

Download the dolly-15k dataset from Hugging Face.

git clone https://huggingface.co/datasets/databricks/databricks-dolly-15k

Once downloaded, check the size of the file (databricks-dolly-15k.jsonl).

$ du -sh databricks-dolly-15k/databricks-dolly-15k.jsonl
13M     databricks-dolly-15k/databricks-dolly-15k.jsonl

If the sizes do not match, delete the old file and manually copy the download link address and directly wget the file.

wget https://huggingface.co/datasets/databricks/databricks-dolly-15k/resolve/main/databricks-dolly-15k.jsonl

Step 2: Data Preprocessing

Next you need to pre-process the data to ensure it’s in the correct format. The expected format is a JSONL file with {‘input’: ‘xxx’, ‘output’: ‘yyy’} pairs.
To run the pre-processing, you will use the script that has already been prepared for you. Run this script and pass your jsonl file as –input. To run the script, you need to launch the container.

If the container is not already running, use the following command:

docker run --gpus device=1 --shm-size=2g --net=host --ulimit memlock=-1 --rm -it -v ${PWD}:/workspace -w /workspace -v ${PWD}/results:/results nvcr.io/nvidia/nemo:24.07 bash

Next, run the following data preprocess script:

python3 /opt/NeMo-Framework-Launcher/launcher_scripts/nemo_launcher/collections/dataprep_scripts/dolly_dataprep/preprocess.py --input databricks-dolly-15k/databricks-dolly-15k.jsonl

The following shows the example output:

Preprocessing data to jsonl format...
Data was successfully preprocessed and saved by databricks-dolly-15k/databricks-dolly-15k-output.jsonl .

Check that the output jsonl files exists:

$ ls databricks-dolly-15k/
.git/
.gitattributes
README.md
databricks-dolly-15k-output.jsonl
databricks-dolly-15k.jsonl

Check the first example in the output jsonl file:

$ head -n 1 databricks-dolly-15k/databricks-dolly-15k-output.jsonl
{"input": "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.\n\nWhen did Virgin Australia start operating?", "output": "Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.", "category": "closed_qa"}

Step 3: Split the Data Into Train, Validation, and Test

Generate the train, test, and validation splits. You can use your own script to do this or create a new script and use the following sample split_train_val.py by copying it over in the databricks-dolly-15k directory.

import json
import random

input_file = "databricks-dolly-15k-output.jsonl"
training_output_file = "training.jsonl"
validation_output_file = "validation.jsonl"
test_output_file = "test.jsonl"

# Specify the proportion of data for training and validation
train_proportion = 0.80
validation_proportion = 0.15
test_proportion = 0.05

# Read the JSONL file and shuffle the JSON objects
with open(input_file, "r") as f:
    lines = f.readlines()
    random.shuffle(lines)

# Calculate split indices
total_lines = len(lines)
train_index = int(total_lines * train_proportion)
val_index = int(total_lines * validation_proportion)

# Distribute JSON objects into training and validation sets
train_data = lines[:train_index]
validation_data = lines[train_index:train_index+val_index]
test_data = lines[train_index+val_index:]

# Write JSON objects to training file
with open(training_output_file, "w") as f:
    for line in train_data:
        f.write(line.strip() + "\n")

# Write JSON objects to validation file
with open(validation_output_file, "w") as f:
    for line in validation_data:
        f.write(line.strip() + "\n")

# Write JSON objects to training file
with open(test_output_file, "w") as f:
    for line in test_data:
        f.write(line.strip() + "\n")

Then go to the databricks-dolly-15k directory and generate the splits:

python3 split_train_val.py

Check for the train, test, and validation jsonl files:

$ ls
README.md
databricks-dolly-15k.jsonl
databricks-dolly-15k-output.jsonl
split_train_val.py
training.jsonl
validation.jsonl
test.jsonl

Step 4: Prepare Packed Dataset

Please refer to Sequence Packing for SFT/PEFT for instructions on preparing the packed sequence dataset. An example is provided below.

python scripts/nlp_language_modeling/prepare_packed_ft_dataset.py \
       model.data.train_ds.file_names=[/path/to/training.jsonl] \
       model.data.train_ds.max_seq_length=2048 \
       +tokenizer_path=/path/to/tokenizer.model \
       +output_dir=/path/to/output_folder \
       +pack_sizes=[2048,4096,8192] \