Learning Goals
Often we want to adapt or customize foundation models to be more performant on our specific task. Fine-tuning refers to how we can modify the weights of a pre-trained foundation model with additional custom data. Supervised fine-tuning (SFT) refers to unfreezing all the weights and layers in our model and training on a newly labeled set of examples. We can fine-tune to incorporate new, domain-specific knowledge, or teach the foundation model what type of response to provide. One specific type of SFT is also referred to as “instruction tuning” where we use SFT to teach a model to follow instructions better.
In this project, you’ll test out the supervised fine-tuning method on the Llama 2 model using an instructive dataset.
NeMo Tools and Resources
NeMo Framework Training container:
nvcr.io/ea-bignlp/ga-participants/nemofw-training:24.03
Software Requirements
Use the latest NeMo Framework Training container
This playbook has been tested using the nemo 24.03 container. It is expected to work similarly on other environments.
Hardware Requirements
Minimum 8xA100 80G (1 node) for SFT on 7B and 13B
SFT can be run on all (7B/13B/70B) model sizes on multiple nodes
Data
Databricks-dolly-15k is an open-source dataset created by the collaborative efforts of Databricks employees. It consists of high-quality human-generated prompt/response pairs specifically designed for instruction tuning LLMs. These pairs cover a diverse range of behaviors, from brainstorming and content generation to information extraction and summarization. For more information, refer to databricks-dolly-15k | Hugging Face.
If you already have a .nemo file for Llama models, you can skip this step.
Optional Step 1: Download Llama 2 in Hugging Face format
First, request download permission from both Hugging Face and Meta. Then, you need to create the destination directory. Two options are available.
Download by CLI login
mkdir llama2-7b-hf
huggingface-cli login
Utilize your Hugging Face API token to download data by running the following Python code
from huggingface_hub import snapshot_download
snapshot_download(repo_id="meta-llama/Llama-2-7b-hf",
local_dir="llama2-7b-hf",
local_dir_use_symlinks=False,
token=<YOUR HF TOKEN>)
In this example, the Llama 2 Hugging Face model will be downloaded to ./llama2-7b-hf.
Optional Step 2: Convert to .nemo
Run the container using the following command
docker run --gpus device=1 --shm-size=2g --net=host --ulimit memlock=-1 --rm -it -v ${PWD}:/workspace -w /workspace -v ${PWD}/results:/results nvcr.io/nvidia/nemo:24.03 bash
Convert the Hugging Face model to .nemo model
python /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py --input_name_or_path=./llama2-7b-hf/ --output_path=llama2-7b.nemo
The generated llama2-7b.nemo file uses distributed checkpointing. It can be loaded with any Tensor Parallel (TP) or Pipeline Parallel (PP) combination without reshaping/splitting.
Step 1: Download dataset
Download the databricks-dolly-15k dataset from Hugging Face
git clone https://huggingface.co/datasets/databricks/databricks-dolly-15k
Once downloaded, check the size of the file (databricks-dolly-15k.jsonl)
du -sh databricks-dolly-15k/databricks-dolly-15k.jsonl
13M databricks-dolly-15k/databricks-dolly-15k.jsonl
If the file sizes do not match, delete the old file, manually copy the download link address, and directly wget the file
wget https://huggingface.co/datasets/databricks/databricks-dolly-15k/resolve/main/databricks-dolly-15k.jsonl
Step 2: Data Preprocessing
Next, you need to preprocess the data to ensure it’s in the correct format. The expected format is a JSONL file with {‘input’: ‘xxx’, ‘output’: ‘yyy’} pairs.
To run the preprocessing, use the script that has already been prepared for you. Run this script and pass your jsonl file as –input. To run the script, you need to launch the container.
If the container is not already running, use the following command
docker run --gpus device=1 --shm-size=2g --net=host --ulimit memlock=-1 --rm -it -v ${PWD}:/workspace -w /workspace -v ${PWD}/results:/results nvcr.io/nvidia/nemo:24.03 bash
And then run the following data preprocess script
python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/nemo_launcher/collections/dataprep_scripts/dolly_dataprep/preprocess.py --input databricks-dolly-15k/databricks-dolly-15k.jsonl
Example output
Preprocessing data to jsonl format...
Data was successfully preprocessed and saved by databricks-dolly-15k/databricks-dolly-15k-output.jsonl .
Check that the output jsonl files exists
ls databricks-dolly-15k/
.git/
.gitattributes
README.md
databricks-dolly-15k-output.jsonl
databricks-dolly-15k.jsonl
Check the first example in the output jsonl file
head -n 1 databricks-dolly-15k/databricks-dolly-15k-output.jsonl
{"input": "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.\n\nWhen did Virgin Australia start operating?", "output": "Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.", "category": "closed_qa"}
Step 3: Split the data into train, validation and test
To create the train, test, and validation splits, you have two options. You can utilize your own script or create a new script. To create a new script, incorporate the provided sample split_train_val.py by copying it over in the databricks-dolly-15k directory
import json
import random
input_file = "databricks-dolly-15k-output.jsonl"
training_output_file = "training.jsonl"
validation_output_file = "validation.jsonl"
test_output_file = "test.jsonl"
# Specify the proportion of data for training and validation
train_proportion = 0.80
validation_proportion = 0.15
test_proportion = 0.05
# Read the JSONL file and shuffle the JSON objects
with open(input_file, "r") as f:
lines = f.readlines()
random.shuffle(lines)
# Calculate split indices
total_lines = len(lines)
train_index = int(total_lines * train_proportion)
val_index = int(total_lines * validation_proportion)
# Distribute JSON objects into training and validation sets
train_data = lines[:train_index]
validation_data = lines[train_index:train_index+val_index]
test_data = lines[train_index+val_index:]
# Write JSON objects to training file
with open(training_output_file, "w") as f:
for line in train_data:
f.write(line.strip() + "\n")
# Write JSON objects to validation file
with open(validation_output_file, "w") as f:
for line in validation_data:
f.write(line.strip() + "\n")
# Write JSON objects to training file
with open(test_output_file, "w") as f:
for line in test_data:
f.write(line.strip() + "\n")
Then go to the databricks-dolly-15k
directory and generate the splits
python3 split_train_val.py
Check for the train, test, and validation jsonl files
ls
README.md
databricks-dolly-15k.jsonl
databricks-dolly-15k-output.jsonl
split_train_val.py
training.jsonl
validation.jsonl
test.jsonl
Set the environment variables and then pass the paths to your train, test, and validation data files
MODEL="YOUR PATH TO llama2-7b.nemo"
TRAIN="[YOUR PATH TO databricks-dolly-15k/train.jsonl]"
VALID="[YOUR PATH TO databricks-dolly-15k/validation.jsonl]"
TEST="[YOUR PATH TO databricks-dolly-15k/test.jsonl]"
VALID_NAMES="[databricks-dolly-15k]"
Set the concat sampling probability. This depends on the number of files being passed in the train set and how much percentage of the fine-tuning data would you like to use from each file. Note sum of concat sampling probabilities should be 1.0. For example, the following is an example for setting concat sampling probability for a train set with 2 jsonl files.
TRAIN="[/path/to/dataset_1.jsonl,/path/to/dataset_2.jsonl]"
CONCAT_SAMPLING_PROBS="[0.3,0.7]"
In our example, we are using one train file so CONCAT_SAMPLING_PROBS="[1.0]"
.
Set the TP and PP values based on the model you are using.
CONCAT_SAMPLING_PROBS="[1]"
TP_SIZE=2
PP_SIZE=1
Run the SFT command and set the values for the parameters, including the number of steps, model checkpoint path, batch sizes, etc. For a full reference of parameter settings, refer to the config file
torchrun --nproc_per_node=8 \
/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \
trainer.precision=bf16 \
trainer.devices=8 \
trainer.num_nodes=1 \
trainer.val_check_interval=0.1 \
trainer.max_steps=50 \
model.restore_from_path=${MODEL} \
model.micro_batch_size=1 \
model.global_batch_size=128 \
model.tensor_model_parallel_size=${TP_SIZE} \
model.pipeline_model_parallel_size=${PP_SIZE} \
model.megatron_amp_O2=True \
model.sequence_parallel=True \
model.activations_checkpoint_granularity=selective \
model.activations_checkpoint_method=uniform \
model.optim.name=distributed_fused_adam \
model.optim.lr=5e-6 \
model.answer_only_loss=True \
model.data.train_ds.file_names=${TRAIN_DS} \
model.data.validation_ds.file_names=${VALID_DS} \
model.data.test_ds.file_names=${TEST_DS} \
model.data.train_ds.concat_sampling_probabilities=${CONCAT_SAMPLING_PROBS} \
model.data.train_ds.max_seq_length=2048 \
model.data.validation_ds.max_seq_length=2048 \
model.data.train_ds.micro_batch_size=1 \
model.data.train_ds.global_batch_size=128 \
model.data.validation_ds.micro_batch_size=1 \
model.data.validation_ds.global_batch_size=128 \
model.data.test_ds.micro_batch_size=1 \
model.data.test_ds.global_batch_size=256 \
model.data.train_ds.num_workers=0 \
model.data.validation_ds.num_workers=0 \
model.data.test_ds.num_workers=0 \
model.data.validation_ds.metric.name=loss \
model.data.test_ds.metric.name=loss \
exp_manager.create_wandb_logger=False \
exp_manager.explicit_log_dir=/results \
exp_manager.resume_if_exists=True \
exp_manager.resume_ignore_no_checkpoint=True \
exp_manager.create_checkpoint_callback=True \
exp_manager.checkpoint_callback_params.monitor=validation_loss \
exp_manager.checkpoint_callback_params.save_best_model=False \
exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \
++cluster_type=BCP
Change the following settings for Llama 13b SFT
model.tensor_model_parallel_size=4
model.pipeline_model_parallel_size=1
Change the following settings for Llama 70b SFT and use four nodes for 70b SFT
model.tensor_model_parallel_size=8
model.pipeline_model_parallel_size=4
Note: To run SFT on multiple nodes (for example, 70B model) on a Slurm cluster, replace the torchrun --nproc_per_node=8
with python
.
After completion of SFT, you should get an output similar to the following. If you want to get the wandb output, make sure to set exp_manager.create_wandb_logger=True
and sign up for W&B to get the API key. You can follow the steps on the terminal to accomplish this task.
wandb: Waiting for W&B process to finish... (success).
wandb:
wandb: Run history:
wandb: consumed_samples ▁▃▅▆█
wandb: epoch ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: global_step ▁▃▅▆█
wandb: grad_norm █▃▄▃▁
wandb: lr ▁▁▁▁▁
wandb: reduced_train_loss █▅▇▆▁
wandb: train_backward_timing ▇█▅▁▇
wandb: train_step_timing ▃▁█▆▂
wandb: trainer/global_step ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb: val_loss █▅▄▃▂▂▂▁▁▁
wandb: validation_loss █▅▄▃▂▂▂▁▁▁
wandb: validation_loss_databricks-dolly-15k █▅▄▃▂▂▂▁▁▁
wandb: validation_step_timing ▂██▂▂▆▃█▂██▂▂▆▇█▂██▂█▆▇█▂█▁▃█▆▇█▂▂▁▃█▆▇▇
wandb:
wandb: Run summary:
wandb: consumed_samples 6272.0
wandb: epoch 0
wandb: global_step 49.0
wandb: grad_norm 10.05424
wandb: lr 0.0
wandb: reduced_train_loss 1.66673
wandb: train_backward_timing 5e-05
wandb: train_step_timing 17.50282
wandb: trainer/global_step 49
wandb: val_loss 1.65022
wandb: validation_loss 1.65022
wandb: validation_loss_databricks-dolly-15k 1.65022
wandb: validation_step_timing 9.01902
wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /results/wandb/offline-run-20230714_032640-iu65oacs
wandb: Find logs at: /results/wandb/offline-run-20230714_032640-iu65oacs/logs
Step 6: Run evaluation
Run evaluation using megatron_gpt_generate.py. First, set the appropriate model checkpoint path, test file path, batch sizes, number of tokens, etc. Then, run evaluation on the test file
PATH_TO_TRAINED_MODEL=/results/checkpoints/megatron_gpt_sft.nemo
TEST_DS="[YOUR PATH TO test.jsonl]"
python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py \
model.restore_from_path=${PATH_TO_TRAINED_MODEL} \
trainer.devices=8 \
model.data.test_ds.file_names=${TEST_DS} \
model.data.test_ds.names=['dolly-15k_test'] \
model.data.test_ds.global_batch_size=16 \
model.data.test_ds.micro_batch_size=2 \
model.data.test_ds.tokens_to_generate=20 \
model.tensor_model_parallel_size=1 \
model.pipeline_model_parallel_size=1 \
inference.greedy=True \
model.data.test_ds.output_file_path_prefix=/results/sft_results \
model.data.test_ds.write_predictions_to_file=True
Sample Output
tail -n 4 sft_results.jsonl
{"sentence": "What is Azure HDInsight? Azure HDInsight is a cloud service that provides a high-performance, scalable, and cost-effective way to run Apache Hadoop on the"}
{"sentence": "What is carnitine? Carnitine is a fatty acid that is found in the body. It is used to produce energy in the mitochondria of the cells. Carnit"}
{"sentence": "List some TV shows that Canadian actor William B. Davis has been in."}
{"sentence": "Identify which instrument is string or percussion: Handbell, Dobro, Drum"}
Note, This is only a sample output (based of a toy SFT example) and your output may vary. The performance can be further improved by fine-tuning the model for more steps.