Data Preparation for SFT and PEFT

This section provides detailed steps to prepare a packed Sequence-to-Sequence Fine-Tuning (SFT) dataset for Starcoder2 models, using the example of the Alpaca Python Code Instructions dataset. Although we focus on “Alpaca”, the methodology should be applicable to any code dataset. PAcked sequence data enables a significant boost to SFT / PEFT performance in NeMo.

For more details about the data, please refer to CodeAlpaca-20k | Hugging Face and CodeAlpaca-python-instructions-18k | Hugging Face

Download the Alpaca code instructions dataset from huggingface

Copy
Copied!
            

git clone git@hf.co:datasets/iamtarun/python_code_instructions_18k_alpaca

Once downloaded, check the size of the file (python_code_instructions_18k_alpaca.jsonl)

Copy
Copied!
            

$ du -sh python_code_instructions_18k_alpaca 11M python_code_instructions_18k_alpaca

If the sizes do not match, delete the old file and manually copy the download link and directly wget the file

Copy
Copied!
            

wget https://huggingface.co/datasets/iamtarun/python_code_instructions_18k_alpaca/resolve/main/data/train-00000-of-00001-8b6e212f3e1ece96.parquet

  1. Next we need to pre-process the data to ensure it’s in the correct format.

  2. The expected format is a JSONL file with {‘input’: ‘xxx’, ‘output’: ‘yyy’} pairs.

If the container is not already running use the following command

Copy
Copied!
            

docker run --gpus device=1 --shm-size=2g --net=host --ulimit memlock=-1 --rm -it -v ${PWD}:/workspace -w /workspace -v ${PWD}/results:/results nvcr.io/nvidia/nemo:24.01.starcoder2 bash

And then run the following data preprocess script

Copy
Copied!
            

import glob import json import pandas as pd print('Preprocessing data to jsonl format...') output_file = 'alpaca_python_18k-output.jsonl' parquet_file_path = glob.glob('./data/*.parquet') parquet_file_list = ''.join(parquet_file_path) df = pd.read_parquet(parquet_file_list) instruct2code_list = df.to_dict('records') with open(output_file, 'wt') as f: for o in instruct2code_list: prompt = o['prompt'][:o['prompt'].find('### Output')] completion = o['output'] f.write(json.dumps({"input": prompt, "output": completion})+"\n") print(f'Data was successfully preprocessed and saved by{output_file}')

Example output

Copy
Copied!
            

Preprocessing data to jsonl format... Data was successfully preprocessed and saved by python_code_instructions_18k_alpac/alpaca_python_18k-output.jsonl .

Check that the output jsonl files exists

Copy
Copied!
            

$ ls python_code_instructions_18k_alpaca/ .git/ .gitattributes data/ alpaca_python_18k-output.jsonl

Check the first example in the output jsonl file

Copy
Copied!
            

$ head -n 1 python_code_instructions_18k_alpac/alpaca_python_18k-output.jsonl {"input": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nCreate a function to calculate the sum of a sequence of integers.\n\n### Input:\n[1, 2, 3, 4, 5]\n\n", "output": "# Python code\ndef sum_sequence(sequence):\n sum = 0\n for num in sequence:\n sum += num\n return sum"}

Generate the train, test and validation splits- you may use your own script to do this or create a new script and use the following sample split_train_val.py by copying it over in the python_code_instructions_18k_alpaca directory

Copy
Copied!
            

import json import random input_file = "alpaca_python_18k-output.jsonl" training_output_file = "training.jsonl" validation_output_file = "validation.jsonl" test_output_file = "test.jsonl" # Specify the proportion of data for training and validation train_proportion = 0.80 validation_proportion = 0.15 test_proportion = 0.05 # Read the JSONL file and shuffle the JSON objects with open(input_file, "r") as f: lines = f.readlines() random.shuffle(lines) # Calculate split indices total_lines = len(lines) train_index = int(total_lines * train_proportion) val_index = int(total_lines * validation_proportion) # Distribute JSON objects into training and validation sets train_data = lines[:train_index] validation_data = lines[train_index:train_index+val_index] test_data = lines[train_index+val_index:] # Write JSON objects to training file with open(training_output_file, "w") as f: for line in train_data: f.write(line.strip() + "\n") # Write JSON objects to validation file with open(validation_output_file, "w") as f: for line in validation_data: f.write(line.strip() + "\n") # Write JSON objects to training file with open(test_output_file, "w") as f: for line in test_data: f.write(line.strip() + "\n")

Then go to the python_code_instructions_18k_alpaca directory and generate the splits:

Copy
Copied!
            

python3 split_train_val.py

Check for the train, test and validation jsonl files

Copy
Copied!
            

$ ls python_code_instructions_18k_alpaca/ .git/ .gitattributes data/ alpaca_python_18k-output.jsonl training.jsonl validation.jsonl test.jsonl

Tokenize Dataset: Tokenize the entire dataset to obtain tokenized sequences, represented by indices.

Copy
Copied!
            

python /workspace/sequence_packing/tokenize_dataset.py \ model.data.train_ds.file_names=[/path/to/training.jsonl] \ model.data.train_ds.max_seq_length=4096 \ model.restore_from_path=/path/to/starcoder2.nemo \# any starcoder2 .nemo models works here +output_path=/path/to/my_dataset.npy

Note
  • model.data.train_ds.max_seq_length is used to truncate long sequences.

  • A full nemo model file is required for simplicity and readability.

Group by Length: Group the tokenized sequences by their sequence length.

Copy
Copied!
            

python /workspace/sequence_packing/create_hist.py \ --input_file=my_dataset.npy \ [--output_dir=./per_seq_data] \ [--output_histogram=histogram.npy] \ [--truncate_seq_len=2048]

Run Packing Algorithm: Run the packing algorithm to find an efficient packing strategy.

Copy
Copied!
            

python run_packing.py \ --output_dir <OUTPUT_DIR> \ --pack_size <PACK_SIZE> \ [--input_dir=./per_seq_data] \ [--histogram=histogram.npy] \ [--packing_algorithm=first_fit_shuffle] \ [--seed=0]

Previous StarCoder2
Next Checkpoint Conversion
© Copyright 2023-2024, NVIDIA. Last updated on May 17, 2024.