Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Note
Attention: Dedicated Container for StarCoder2
To use StarCoder2 models, you can use the nvcr.io/nvidia/nemo:24.01.starcoder2
container. Additionally, take a look at our StarCoder2 playbooks <https://github.com/NVIDIA/GenerativeAIExamples/tree/main/models/StarCoder2>`__.
Data Preparation for SFT and PEFT
This section explains how to prepare a packed Supervised Fine-Tuning (SFT) dataset for Starcoder2 models using the example of the Alpaca Python Code Instructions dataset. Although we focus on “Alpaca”, the methodology should be applicable to any code dataset. Packed sequence data provides a significant boost to SFT and PEFT performance in NeMo.
For more details about the data, please refer to CodeAlpaca-20k | Hugging Face and CodeAlpaca-python-instructions-18k | Hugging Face.
Step 1: Download the Dataset
Download the Alpaca code instructions dataset from Hugging Face:
git clone git@hf.co:datasets/iamtarun/python_code_instructions_18k_alpaca
Once downloaded, check the size of the file (python_code_instructions_18k_alpaca.jsonl):
$ du -sh python_code_instructions_18k_alpaca
11M python_code_instructions_18k_alpaca
If the sizes do not match, delete the old file and manually copy the download link and directly wget the file:
wget https://huggingface.co/datasets/iamtarun/python_code_instructions_18k_alpaca/resolve/main/data/train-00000-of-00001-8b6e212f3e1ece96.parquet
Step 2: Data Preprocessing
Next, you need to pre-process the data to ensure it’s in the correct format. The expected format is a JSONL file with {‘input’: ‘xxx’, ‘output’: ‘yyy’} pairs.
If the container is not already running use the following command:
docker run --gpus device=1 --shm-size=2g --net=host --ulimit memlock=-1 --rm -it -v ${PWD}:/workspace -w /workspace -v ${PWD}/results:/results nvcr.io/nvidia/nemo:24.01.starcoder2 bash
Next, run the following data preprocess script:
import glob
import json
import pandas as pd
print('Preprocessing data to jsonl format...')
output_file = 'alpaca_python_18k-output.jsonl'
parquet_file_path = glob.glob('./data/*.parquet')
parquet_file_list = ''.join(parquet_file_path)
df = pd.read_parquet(parquet_file_list)
instruct2code_list = df.to_dict('records')
with open(output_file, 'wt') as f:
for o in instruct2code_list:
prompt = o['prompt'][:o['prompt'].find('### Output')]
completion = o['output']
f.write(json.dumps({"input": prompt, "output": completion})+"\n")
print(f'Data was successfully preprocessed and saved by {output_file}')
The following shows an example output
Preprocessing data to jsonl format...
Data was successfully preprocessed and saved by python_code_instructions_18k_alpac/alpaca_python_18k-output.jsonl .
Check that the output jsonl files exists:
$ ls python_code_instructions_18k_alpaca/
.git/
.gitattributes
data/
alpaca_python_18k-output.jsonl
Check the first example in the output jsonl file:
$ head -n 1 python_code_instructions_18k_alpac/alpaca_python_18k-output.jsonl
{"input": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nCreate a function to calculate the sum of a sequence of integers.\n\n### Input:\n[1, 2, 3, 4, 5]\n\n", "output": "# Python code\ndef sum_sequence(sequence):\n sum = 0\n for num in sequence:\n sum += num\n return sum"}
Step 3: Split the Data Into Train, Validation, and Test
Generate the train, test, and validation splits. You can use your script to do this or create a new script and use the following sample split_train_val.py by copying it over in the python_code_instructions_18k_alpaca directory.
import json
import random
input_file = "alpaca_python_18k-output.jsonl"
training_output_file = "training.jsonl"
validation_output_file = "validation.jsonl"
test_output_file = "test.jsonl"
# Specify the proportion of data for training and validation
train_proportion = 0.80
validation_proportion = 0.15
test_proportion = 0.05
# Read the JSONL file and shuffle the JSON objects
with open(input_file, "r") as f:
lines = f.readlines()
random.shuffle(lines)
# Calculate split indices
total_lines = len(lines)
train_index = int(total_lines * train_proportion)
val_index = int(total_lines * validation_proportion)
# Distribute JSON objects into training and validation sets
train_data = lines[:train_index]
validation_data = lines[train_index:train_index+val_index]
test_data = lines[train_index+val_index:]
# Write JSON objects to training file
with open(training_output_file, "w") as f:
for line in train_data:
f.write(line.strip() + "\n")
# Write JSON objects to validation file
with open(validation_output_file, "w") as f:
for line in validation_data:
f.write(line.strip() + "\n")
# Write JSON objects to training file
with open(test_output_file, "w") as f:
for line in test_data:
f.write(line.strip() + "\n")
Then, go to the python_code_instructions_18k_alpaca
directory and generate the splits:
python3 split_train_val.py
Check for the train, test, and validation jsonl files
$ ls python_code_instructions_18k_alpaca/
.git/
.gitattributes
data/
alpaca_python_18k-output.jsonl
training.jsonl
validation.jsonl
test.jsonl
Step 4: Prepare Packed Dataset
Tokenize Dataset: Tokenize the entire dataset to obtain tokenized sequences, represented by indices.
python /workspace/sequence_packing/tokenize_dataset.py \
model.data.train_ds.file_names=[/path/to/training.jsonl] \
model.data.train_ds.max_seq_length=4096 \
model.restore_from_path=/path/to/starcoder2.nemo \ # any starcoder2 .nemo models works here
+output_path=/path/to/my_dataset.npy
Note
model.data.train_ds.max_seq_length
is used to truncate long sequences.A full nemo model file is required for simplicity and readability.
Group by Length: Group the tokenized sequences by their sequence length.
python /workspace/sequence_packing/create_hist.py \
--input_file=my_dataset.npy \
[--output_dir=./per_seq_data] \
[--output_histogram=histogram.npy] \
[--truncate_seq_len=2048]
Run Packing Algorithm: Run the packing algorithm to find an efficient packing strategy.
python run_packing.py \
--output_dir <OUTPUT_DIR> \
--pack_size <PACK_SIZE> \
[--input_dir=./per_seq_data] \
[--histogram=histogram.npy] \
[--packing_algorithm=first_fit_shuffle] \
[--seed=0]