Note
Attention: Dedicated Container for Gemma
For Gemma models, please use the nvcr.io/nvidia/nemo:24.05
container. Also check our Gemma playbooks.
Note
Attention: Dedicated Container for CodeGemma
For CodeGemma models, please use the nvcr.io/nvidia/nemo:24.05
container. The following guide prepares a natural language dataset as an example. Code datasets
can be prepared in a similar fashion. See Starcoder2 Data Preparation for an example .
Data Preparation for SFT and PEFT
This section provides detailed steps to prepare a packed Sequence-to-Sequence Fine-Tuning (SFT) dataset for Gemma models, using the example of the “dolly” dataset. Although we focus on “dolly”, the methodology should be applicable to any dataset. This provide a significant boost to SFT / PEFT performance in NeMo.
Databricks-dolly-15k is an open-source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization For more details about the data refer to databricks-dolly-15k | Hugging Face
Step 1: Download dataset
Download the dolly-15k dataset from huggingface
git clone https://huggingface.co/datasets/databricks/databricks-dolly-15k
Once downloaded, check the size of the file (databricks-dolly-15k.jsonl)
$ du -sh databricks-dolly-15k/databricks-dolly-15k.jsonl
13M databricks-dolly-15k/databricks-dolly-15k.jsonl
If the sizes do not match, delete the old file and manually copy the download link address and directly wget the file
wget https://huggingface.co/datasets/databricks/databricks-dolly-15k/resolve/main/databricks-dolly-15k.jsonl
Step 2: Data Preprocessing
Next we need to pre-process the data to ensure it’s in the correct format.
The expected format is a JSONL file with {‘input’: ‘xxx’, ‘output’: ‘yyy’} pairs.
In order to run the pre-processing you will use the script that has already been prepared for you. Run this script and passing your jsonl file as –input. In order to run the script you need to launch the container.
If the container is not already running use the following command
docker run --gpus device=1 --shm-size=2g --net=host --ulimit memlock=-1 --rm -it -v ${PWD}:/workspace -w /workspace -v ${PWD}/results:/results nvcr.io/nvidia/nemo:24.05 bash
And then run the following data preprocess script
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/nemo_launcher/collections/dataprep_scripts/dolly_dataprep/preprocess.py --input databricks-dolly-15k/databricks-dolly-15k.jsonl
Example output
Preprocessing data to jsonl format...
Data was successfully preprocessed and saved by databricks-dolly-15k/databricks-dolly-15k-output.jsonl .
Check that the output jsonl files exists
$ ls databricks-dolly-15k/
.git/
.gitattributes
README.md
databricks-dolly-15k-output.jsonl
databricks-dolly-15k.jsonl
Check the first example in the output jsonl file
$ head -n 1 databricks-dolly-15k/databricks-dolly-15k-output.jsonl
{"input": "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.\n\nWhen did Virgin Australia start operating?", "output": "Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.", "category": "closed_qa"}
Step 3: Split the data into train, validation and test.
Generate the train, test and validation splits- you may use your own script to do this or create a new script and use the following sample split_train_val.py by copying it over in the databricks-dolly-15k directory
import json
import random
input_file = "databricks-dolly-15k-output.jsonl"
training_output_file = "training.jsonl"
validation_output_file = "validation.jsonl"
test_output_file = "test.jsonl"
# Specify the proportion of data for training and validation
train_proportion = 0.80
validation_proportion = 0.15
test_proportion = 0.05
# Read the JSONL file and shuffle the JSON objects
with open(input_file, "r") as f:
lines = f.readlines()
random.shuffle(lines)
# Calculate split indices
total_lines = len(lines)
train_index = int(total_lines * train_proportion)
val_index = int(total_lines * validation_proportion)
# Distribute JSON objects into training and validation sets
train_data = lines[:train_index]
validation_data = lines[train_index:train_index+val_index]
test_data = lines[train_index+val_index:]
# Write JSON objects to training file
with open(training_output_file, "w") as f:
for line in train_data:
f.write(line.strip() + "\n")
# Write JSON objects to validation file
with open(validation_output_file, "w") as f:
for line in validation_data:
f.write(line.strip() + "\n")
# Write JSON objects to training file
with open(test_output_file, "w") as f:
for line in test_data:
f.write(line.strip() + "\n")
Then go to the databricks-dolly-15k
directory and generate the splits:
python3 split_train_val.py
Check for the train, test and validation jsonl files
$ ls
README.md
databricks-dolly-15k.jsonl
databricks-dolly-15k-output.jsonl
split_train_val.py
training.jsonl
validation.jsonl
test.jsonl
Step 4: Prepare Packed Dataset
Tokenize Dataset: Tokenize the entire dataset to obtain tokenized sequences, represented by indices.
python /workspace/sequence_packing/tokenize_dataset.py \
model.data.train_ds.file_names=[/path/to/training.jsonl] \
model.data.train_ds.max_seq_length=4096 \
model.restore_from_path=/path/to/gemma-7b.nemo \ # any gemma .nemo models works here
+output_path=/path/to/my_dataset.npy
Note
model.data.train_ds.max_seq_length
is used to truncate long sequences.A full nemo model file is required for simplicity and readability.
Group by Length: Group the tokenized sequences by their sequence length.
python /workspace/sequence_packing/create_hist.py \
--input_file=my_dataset.npy \
[--output_dir=./per_seq_data] \
[--output_histogram=histogram.npy] \
[--truncate_seq_len=2048]
Run Packing Algorithm: Run the packing algorithm to find an efficient packing strategy.
python run_packing.py \
--output_dir <OUTPUT_DIR> \
--pack_size <PACK_SIZE> \
[--input_dir=./per_seq_data] \
[--histogram=histogram.npy] \
[--packing_algorithm=first_fit_shuffle] \
[--seed=0]
Another Example for SQuAD
First run the python script below to process squad into jsonl format:
from datasets import load_dataset
import json
import os
main_dataset = load_dataset("squad")
subset="squad"
isExist = os.path.exists(subset)
if not isExist:
os.makedirs(subset)
# Loop through the splits and save them as JSONL files
splits= ["train", "validation"]
save_splits = {}
for split in splits:
dataset = main_dataset.get(split, None)
if dataset is None:
continue
if split == "train":
split_dataset = dataset.train_test_split(test_size=0.05, seed=1234)
save_splits['train'] = split_dataset['train']
save_splits['validation'] = split_dataset['test']
elif split == "validation":
save_splits['test'] = dataset
else:
save_splits['test_no_gt'] = dataset
for split_name, dataset in save_splits.items():
output_file = f"squad/{split_name}.jsonl"
with open(output_file, "w", encoding="utf-8") as f:
for example in dataset:
# Write each example as a JSON line in the output file
example["input"] = "Context: " + example["context"] + " Question: " + example['question'] + " Answer:"
example["output"] = example["answers"]["text"][0]
example["original_answers"] = example["answers"]["text"]
f.write(json.dumps(example) + "\n")
print(f"{split_name} split saved to {output_file}")
Then prepare the packed dataset as following:
python tokenize_dataset.py \
model.data.train_ds.file_names=[/path/to/datasets/squad/1_squad_train.jsonl] \
model.data.train_ds.max_seq_length=4096 \
model.restore_from_path=/path/to/gemma-7b.nemo \
+output_path=gemma_squad_packed/my_dataset.npy
python create_hist.py --input_file=gemma_squad_packed/my_dataset.npy
python run_packing.py --output_dir gemma_squad_packed --pack_size 4096