Data Preparation for SFT and PEFT

This section provides detailed steps to prepare a packed Sequence-to-Sequence Fine-Tuning (SFT) dataset for Gemma models, using the example of the “dolly” dataset. Although we focus on “dolly”, the methodology should be applicable to any dataset. This provide a significant boost to SFT / PEFT performance in NeMo.

Databricks-dolly-15k is an open-source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization For more details about the data refer to databricks-dolly-15k | Hugging Face

Step 1: Download dataset

Download the dolly-15k dataset from huggingface

Copy
Copied!

            
            git clone https://huggingface.co/datasets/databricks/databricks-dolly-15k

Once downloaded, check the size of the file (databricks-dolly-15k.jsonl)

Copy
Copied!

            
            $ du -sh databricks-dolly-15k/databricks-dolly-15k.jsonl
13M     databricks-dolly-15k/databricks-dolly-15k.jsonl

If the sizes do not match, delete the old file and manually copy the download link address and directly wget the file

Copy
Copied!

            
            wget https://huggingface.co/datasets/databricks/databricks-dolly-15k/resolve/main/databricks-dolly-15k.jsonl

Step 2: Data Preprocessing

Next we need to pre-process the data to ensure it’s in the correct format.
The expected format is a JSONL file with {‘input’: ‘xxx’, ‘output’: ‘yyy’} pairs.
In order to run the pre-processing you will use the script that has already been prepared for you. Run this script and passing your jsonl file as –input. In order to run the script you need to launch the container.

If the container is not already running use the following command

Copy
Copied!

            
            docker run --gpus device=1 --shm-size=2g --net=host --ulimit memlock=-1 --rm -it -v ${PWD}:/workspace -w /workspace -v ${PWD}/results:/results nvcr.io/nvidia/nemo:24.01.gemma bash

And then run the following data preprocess script

Copy
Copied!

            
            python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/nemo_launcher/collections/dataprep_scripts/dolly_dataprep/preprocess.py --input databricks-dolly-15k/databricks-dolly-15k.jsonl

Example output

Copy
Copied!

            
            Preprocessing data to jsonl format...
Data was successfully preprocessed and saved by databricks-dolly-15k/databricks-dolly-15k-output.jsonl .

Check that the output jsonl files exists

Copy
Copied!

            
            $ ls databricks-dolly-15k/
.git/
.gitattributes
README.md
databricks-dolly-15k-output.jsonl
databricks-dolly-15k.jsonl

Check the first example in the output jsonl file

Copy
Copied!

            
            $ head -n 1 databricks-dolly-15k/databricks-dolly-15k-output.jsonl
{"input": "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.\n\nWhen did Virgin Australia start operating?", "output": "Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.", "category": "closed_qa"}

Step 3: Split the data into train, validation and test.

Generate the train, test and validation splits- you may use your own script to do this or create a new script and use the following sample split_train_val.py by copying it over in the databricks-dolly-15k directory

Copy
Copied!

            
            import json
import random

input_file = "databricks-dolly-15k-output.jsonl"
training_output_file = "training.jsonl"
validation_output_file = "validation.jsonl"
test_output_file = "test.jsonl"

# Specify the proportion of data for training and validation
train_proportion = 0.80
validation_proportion = 0.15
test_proportion = 0.05

# Read the JSONL file and shuffle the JSON objects
with open(input_file, "r") as f:
    lines = f.readlines()
    random.shuffle(lines)

# Calculate split indices
total_lines = len(lines)
train_index = int(total_lines * train_proportion)
val_index = int(total_lines * validation_proportion)

# Distribute JSON objects into training and validation sets
train_data = lines[:train_index]
validation_data = lines[train_index:train_index+val_index]
test_data = lines[train_index+val_index:]

# Write JSON objects to training file
with open(training_output_file, "w") as f:
    for line in train_data:
        f.write(line.strip() + "\n")

# Write JSON objects to validation file
with open(validation_output_file, "w") as f:
    for line in validation_data:
        f.write(line.strip() + "\n")

# Write JSON objects to training file
with open(test_output_file, "w") as f:
    for line in test_data:
        f.write(line.strip() + "\n")

Then go to the databricks-dolly-15k directory and generate the splits:

Copy
Copied!

            
            python3 split_train_val.py

Check for the train, test and validation jsonl files

Copy
Copied!

            
            $ ls
README.md
databricks-dolly-15k.jsonl
databricks-dolly-15k-output.jsonl
split_train_val.py
training.jsonl
validation.jsonl
test.jsonl

Step 4: Prepare Packed Dataset

Tokenize Dataset: Tokenize the entire dataset to obtain tokenized sequences, represented by indices.

Copy
Copied!

            
            python /workspace/sequence_packing/tokenize_dataset.py \
  model.data.train_ds.file_names=[/path/to/training.jsonl] \
  model.data.train_ds.max_seq_length=4096 \
  model.restore_from_path=/path/to/gemma-7b.nemo \# any gemma .nemo models works here
  +output_path=/path/to/my_dataset.npy

Note

model.data.train_ds.max_seq_length is used to truncate long sequences.
A full nemo model file is required for simplicity and readability.

Group by Length: Group the tokenized sequences by their sequence length.

Copy
Copied!

            
            python /workspace/sequence_packing/create_hist.py \
  --input_file=my_dataset.npy \
  [--output_dir=./per_seq_data] \
  [--output_histogram=histogram.npy] \
  [--truncate_seq_len=2048]

Run Packing Algorithm: Run the packing algorithm to find an efficient packing strategy.

Copy
Copied!

            
            python run_packing.py \
  --output_dir <OUTPUT_DIR> \
  --pack_size <PACK_SIZE> \
  [--input_dir=./per_seq_data] \
  [--histogram=histogram.npy] \
  [--packing_algorithm=first_fit_shuffle] \
  [--seed=0]

Another Example for SQuAD

First run the python script below to process squad into jsonl format:

Copy
Copied!

            
            from datasets import load_dataset
import json
import os

main_dataset = load_dataset("squad")
subset="squad"
isExist = os.path.exists(subset)
if not isExist:
    os.makedirs(subset)

# Loop through the splits and save them as JSONL files
splits= ["train", "validation"]
save_splits = {}
for split in splits:
    dataset = main_dataset.get(split, None)
    if dataset is None:
        continue
    if split == "train":
        split_dataset = dataset.train_test_split(test_size=0.05, seed=1234)
        save_splits['train'] = split_dataset['train']
        save_splits['validation'] = split_dataset['test']
    elif split == "validation":
        save_splits['test'] = dataset
    else:
        save_splits['test_no_gt'] = dataset


for split_name, dataset in save_splits.items():
    output_file = f"squad/{split_name}.jsonl"


    with open(output_file, "w", encoding="utf-8") as f:
        for example in dataset:
            # Write each example as a JSON line in the output file
            example["input"] = "Context: " + example["context"] + " Question: " + example['question'] + " Answer:"
            example["output"] = example["answers"]["text"][0]
            example["original_answers"] = example["answers"]["text"]
            f.write(json.dumps(example)  + "\n")

    print(f"{split_name}split saved to{output_file}")

Then prepare the packed dataset as following:

Copy
Copied!

            
            python tokenize_dataset.py \
  model.data.train_ds.file_names=[/path/to/datasets/squad/1_squad_train.jsonl] \
  model.data.train_ds.max_seq_length=4096 \
  model.restore_from_path=/path/to/gemma-7b.nemo \
  +output_path=gemma_squad_packed/my_dataset.npy
python create_hist.py --input_file=gemma_squad_packed/my_dataset.npy
python run_packing.py --output_dir gemma_squad_packed --pack_size 4096