Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Note
Attention: CodeGemma Users
The following guide prepares a natural language dataset as an example. Code datasets can be prepared in a similar fashion. See Starcoder2 Data Preparation for an example.
Data Preparation for SFT and PEFT
This section provides detailed steps to prepare a packed Supervised Fine-Tuning (SFT) dataset for Gemma models, using the example of the “dolly” dataset. Although we focus on “dolly”, the methodology should be applicable to any dataset. This provide a significant boost to SFT / PEFT performance in NeMo.
Databricks-dolly-15k is an open-source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization For more details about the data refer to databricks-dolly-15k | Hugging Face
Step 1: Download the Dataset
Download the dolly-15k dataset from Hugging Face.
git clone https://huggingface.co/datasets/databricks/databricks-dolly-15k
Once downloaded, check the size of the file (databricks-dolly-15k.jsonl).
$ du -sh databricks-dolly-15k/databricks-dolly-15k.jsonl
13M databricks-dolly-15k/databricks-dolly-15k.jsonl
If the sizes do not match, delete the old file and manually copy the download link address and directly wget the file.
wget https://huggingface.co/datasets/databricks/databricks-dolly-15k/resolve/main/databricks-dolly-15k.jsonl
Step 2: Data Preprocessing
Next you need to pre-process the data to ensure it’s in the correct format. The expected format is a JSONL file with {‘input’: ‘xxx’, ‘output’: ‘yyy’} pairs.
To run the pre-processing, you will use the script that has already been prepared for you. Run this script and pass your jsonl file as –input. To run the script, you need to launch the container.
If the container is not already running, use the following command:
docker run --gpus device=1 --shm-size=2g --net=host --ulimit memlock=-1 --rm -it -v ${PWD}:/workspace -w /workspace -v ${PWD}/results:/results nvcr.io/nvidia/nemo:24.07 bash
Next, run the following data preprocess script:
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/nemo_launcher/collections/dataprep_scripts/dolly_dataprep/preprocess.py --input databricks-dolly-15k/databricks-dolly-15k.jsonl
The following shows the example output:
Preprocessing data to jsonl format...
Data was successfully preprocessed and saved by databricks-dolly-15k/databricks-dolly-15k-output.jsonl .
Check that the output jsonl files exists:
$ ls databricks-dolly-15k/
.git/
.gitattributes
README.md
databricks-dolly-15k-output.jsonl
databricks-dolly-15k.jsonl
Check the first example in the output jsonl file:
$ head -n 1 databricks-dolly-15k/databricks-dolly-15k-output.jsonl
{"input": "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.\n\nWhen did Virgin Australia start operating?", "output": "Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.", "category": "closed_qa"}
Step 3: Split the Data Into Train, Validation, and Test
Generate the train, test, and validation splits. You can use your own script to do this or create a new script and use the following sample split_train_val.py by copying it over in the databricks-dolly-15k directory.
import json
import random
input_file = "databricks-dolly-15k-output.jsonl"
training_output_file = "training.jsonl"
validation_output_file = "validation.jsonl"
test_output_file = "test.jsonl"
# Specify the proportion of data for training and validation
train_proportion = 0.80
validation_proportion = 0.15
test_proportion = 0.05
# Read the JSONL file and shuffle the JSON objects
with open(input_file, "r") as f:
lines = f.readlines()
random.shuffle(lines)
# Calculate split indices
total_lines = len(lines)
train_index = int(total_lines * train_proportion)
val_index = int(total_lines * validation_proportion)
# Distribute JSON objects into training and validation sets
train_data = lines[:train_index]
validation_data = lines[train_index:train_index+val_index]
test_data = lines[train_index+val_index:]
# Write JSON objects to training file
with open(training_output_file, "w") as f:
for line in train_data:
f.write(line.strip() + "\n")
# Write JSON objects to validation file
with open(validation_output_file, "w") as f:
for line in validation_data:
f.write(line.strip() + "\n")
# Write JSON objects to training file
with open(test_output_file, "w") as f:
for line in test_data:
f.write(line.strip() + "\n")
Then go to the databricks-dolly-15k
directory and generate the splits:
python3 split_train_val.py
Check for the train, test, and validation jsonl files:
$ ls
README.md
databricks-dolly-15k.jsonl
databricks-dolly-15k-output.jsonl
split_train_val.py
training.jsonl
validation.jsonl
test.jsonl
Step 4: Prepare Packed Dataset
Please refer to Sequence Packing for SFT/PEFT for instructions on preparing the packed sequence dataset. An example is provided below.
python scripts/nlp_language_modeling/prepare_packed_ft_dataset.py \
model.data.train_ds.file_names=[/path/to/training.jsonl] \
model.data.train_ds.max_seq_length=2048 \
+tokenizer_path=/path/to/tokenizer.model \
+output_dir=/path/to/output_folder \
+pack_sizes=[2048,4096,8192] \