Data Preparation for SFT and PEFT

This section provides detailed steps to prepare a packed Sequence-to-Sequence Fine-Tuning (SFT) dataset for Starcoder2 models, using the example of the Alpaca Python Code Instructions dataset. Although we focus on “Alpaca”, the methodology should be applicable to any code dataset. PAcked sequence data enables a significant boost to SFT / PEFT performance in NeMo.

For more details about the data, please refer to CodeAlpaca-20k | Hugging Face and CodeAlpaca-python-instructions-18k | Hugging Face

Step 1: Download dataset

Download the Alpaca code instructions dataset from huggingface

Copy
Copied!

            
            git clone git@hf.co:datasets/iamtarun/python_code_instructions_18k_alpaca

Once downloaded, check the size of the file (python_code_instructions_18k_alpaca.jsonl)

Copy
Copied!

            
            $ du -sh python_code_instructions_18k_alpaca
11M     python_code_instructions_18k_alpaca

If the sizes do not match, delete the old file and manually copy the download link and directly wget the file

Copy
Copied!

            
            wget https://huggingface.co/datasets/iamtarun/python_code_instructions_18k_alpaca/resolve/main/data/train-00000-of-00001-8b6e212f3e1ece96.parquet

Step 2: Data Preprocessing

Next we need to pre-process the data to ensure it’s in the correct format.
The expected format is a JSONL file with {‘input’: ‘xxx’, ‘output’: ‘yyy’} pairs.

If the container is not already running use the following command

Copy
Copied!

            
            docker run --gpus device=1 --shm-size=2g --net=host --ulimit memlock=-1 --rm -it -v ${PWD}:/workspace -w /workspace -v ${PWD}/results:/results nvcr.io/nvidia/nemo:24.01.starcoder2 bash

And then run the following data preprocess script

Copy
Copied!

            
            import glob
import json
import pandas as pd

print('Preprocessing data to jsonl format...')
output_file = 'alpaca_python_18k-output.jsonl'
parquet_file_path = glob.glob('./data/*.parquet')
parquet_file_list = ''.join(parquet_file_path)
df = pd.read_parquet(parquet_file_list)
instruct2code_list = df.to_dict('records')
with open(output_file, 'wt') as f:
for o in instruct2code_list:
   prompt = o['prompt'][:o['prompt'].find('### Output')]
   completion = o['output']
   f.write(json.dumps({"input": prompt, "output": completion})+"\n")
print(f'Data was successfully preprocessed and saved by{output_file}')

Example output

Copy
Copied!

            
            Preprocessing data to jsonl format...
Data was successfully preprocessed and saved by python_code_instructions_18k_alpac/alpaca_python_18k-output.jsonl .

Check that the output jsonl files exists

Copy
Copied!

            
            $ ls python_code_instructions_18k_alpaca/
.git/
.gitattributes
data/
alpaca_python_18k-output.jsonl

Check the first example in the output jsonl file

Copy
Copied!

            
            $ head -n 1 python_code_instructions_18k_alpac/alpaca_python_18k-output.jsonl
{"input": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nCreate a function to calculate the sum of a sequence of integers.\n\n### Input:\n[1, 2, 3, 4, 5]\n\n", "output": "# Python code\ndef sum_sequence(sequence):\n  sum = 0\n  for num in sequence:\n    sum += num\n  return sum"}

Step 3: Split the data into train, validation and test.

Generate the train, test and validation splits- you may use your own script to do this or create a new script and use the following sample split_train_val.py by copying it over in the python_code_instructions_18k_alpaca directory

Copy
Copied!

            
            import json
import random

input_file = "alpaca_python_18k-output.jsonl"
training_output_file = "training.jsonl"
validation_output_file = "validation.jsonl"
test_output_file = "test.jsonl"

# Specify the proportion of data for training and validation
train_proportion = 0.80
validation_proportion = 0.15
test_proportion = 0.05

# Read the JSONL file and shuffle the JSON objects
with open(input_file, "r") as f:
    lines = f.readlines()
    random.shuffle(lines)

# Calculate split indices
total_lines = len(lines)
train_index = int(total_lines * train_proportion)
val_index = int(total_lines * validation_proportion)

# Distribute JSON objects into training and validation sets
train_data = lines[:train_index]
validation_data = lines[train_index:train_index+val_index]
test_data = lines[train_index+val_index:]

# Write JSON objects to training file
with open(training_output_file, "w") as f:
    for line in train_data:
        f.write(line.strip() + "\n")

# Write JSON objects to validation file
with open(validation_output_file, "w") as f:
    for line in validation_data:
        f.write(line.strip() + "\n")

# Write JSON objects to training file
with open(test_output_file, "w") as f:
    for line in test_data:
        f.write(line.strip() + "\n")

Then go to the python_code_instructions_18k_alpaca directory and generate the splits:

Copy
Copied!

            
            python3 split_train_val.py

Check for the train, test and validation jsonl files

Copy
Copied!

            
            $ ls python_code_instructions_18k_alpaca/
.git/
.gitattributes
data/
alpaca_python_18k-output.jsonl
training.jsonl
validation.jsonl
test.jsonl

Step 4: Prepare Packed Dataset

Tokenize Dataset: Tokenize the entire dataset to obtain tokenized sequences, represented by indices.

Copy
Copied!

            
            python /workspace/sequence_packing/tokenize_dataset.py \
  model.data.train_ds.file_names=[/path/to/training.jsonl] \
  model.data.train_ds.max_seq_length=4096 \
  model.restore_from_path=/path/to/starcoder2.nemo \# any starcoder2 .nemo models works here
  +output_path=/path/to/my_dataset.npy

Note

model.data.train_ds.max_seq_length is used to truncate long sequences.
A full nemo model file is required for simplicity and readability.

Group by Length: Group the tokenized sequences by their sequence length.

Copy
Copied!

            
            python /workspace/sequence_packing/create_hist.py \
  --input_file=my_dataset.npy \
  [--output_dir=./per_seq_data] \
  [--output_histogram=histogram.npy] \
  [--truncate_seq_len=2048]

Run Packing Algorithm: Run the packing algorithm to find an efficient packing strategy.

Copy
Copied!

            
            python run_packing.py \
  --output_dir <OUTPUT_DIR> \
  --pack_size <PACK_SIZE> \
  [--input_dir=./per_seq_data] \
  [--histogram=histogram.npy] \
  [--packing_algorithm=first_fit_shuffle] \
  [--seed=0]