NeMo Framework Supervised fine-tuning (SFT) with Llama2

Learning Goals

Often we want to adapt or customize foundation models to be more performant on our specific task. Fine-tuning refers to how we can modify the weights of a pre-trained foundation model with additional custom data. Supervised fine-tuning (SFT) refers to unfreezing all the weights and layers in our model and training on a newly labeled set of examples. We can fine-tune to incorporate new, domain-specific knowledge or teach the foundation model what type of response to provide. One specific type of SFT is also referred to as “instruction tuning” where we use SFT to teach a model to follow instructions better.

In this project, you’ll test out the supervised finetuning method on the llama2 model using an instructive dataset.

NeMo Tools and Resources

  1. NeMo Github repo

  2. NeMo Framework Training container: nvcr.io/ea-bignlp/ga-participants/nemofw-training:23.08.03

Software Requirements

  1. Use the latest NeMo Framework Training container

  2. This playbook has been tested using the container: nvcr.io/ea-bignlp/ga-participants/nemofw-training:23.08.03 on DGX Cloud. It is expected to work similarly on other environments.

Hardware Requirements

  1. Minimum 8xA100 80G (1 node) for SFT on 7B and 13B.

  2. However, SFT on all (7B/13B/70B) model sizes can be run on multiple nodes.

Data

Databricks-dolly-15k is an open-source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization For more details about the data refer to databricks-dolly-15k | Hugging Face

The following steps have been tested with this container: nvcr.io/ea-bignlp/ga-participants/nemofw-training:23.08.03

If you already have a .nemo file for llama models, you can skip this step.

Optional Step 1: Download llama2 in huggingface format

First request download permission from both Huggingface and Meta. Create the destination directory. Then you can download by either first login

Copy
Copied!
            

mkdir llama2-7b-hf huggingface-cli login

Or use your huggingface API token to download using the following Python code

Copy
Copied!
            

from huggingface_hub import snapshot_download snapshot_download(repo_id="meta-llama/Llama-2-7b-hf ", local_dir="llama2-7b-hf", local_dir_use_symlinks=False, token=<YOUR HF TOKEN>)

In this example, llama2 huggingface model will be downloaded to ./llama2-7b-hf

Optional Step 2: Convert to .nemo

Run the container using the following command

Copy
Copied!
            

docker run --gpus device=1 --shm-size=2g --net=host --ulimit memlock=-1 --rm -it -v ${PWD}:/workspace -w /workspace -v ${PWD}/results:/results nvcr.io/ea-bignlp/ga-participants/nemofw-training:23.08.03 bash

Convert the huggingface model to .nemo model

Copy
Copied!
            

python /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py --input_name_or_path=./llama2-7b-hf/ --output_path=llama2-7b.nemo

The generated llama2-7b.nemo file uses distributed checkpointing and can be loaded with any tensor parallel (tp) or pipeline parallel (pp) combination without reshaping/splitting.

Step 1: Download dataset

Download the dolly-15k dataset from huggingface

Copy
Copied!
            

git clone https://huggingface.co/datasets/databricks/databricks-dolly-15k

Once downloaded, check the size of the file (databricks-dolly-15k.jsonl)

Copy
Copied!
            

$ du -sh databricks-dolly-15k/databricks-dolly-15k.jsonl 13M databricks-dolly-15k/databricks-dolly-15k.jsonl

If the sizes do not match, delete the old file and manually copy the download link address and directly wget the file

Copy
Copied!
            

wget https://huggingface.co/datasets/databricks/databricks-dolly-15k/resolve/main/databricks-dolly-15k.jsonl

Step 2: Data Preprocessing

  1. Next we need to pre-process the data to ensure it’s in the correct format.

  2. The expected format is a JSONL file with {‘input’: ‘xxx’, ‘output’: ‘yyy’} pairs.

  3. In order to run the pre-processing you will use the script that has already been prepared for you. Run this script and passing your jsonl file as –input. In order to run the script you need to launch the container.

If the container is not already running use the following command

Copy
Copied!
            

docker run --gpus device=1 --shm-size=2g --net=host --ulimit memlock=-1 --rm -it -v ${PWD}:/workspace -w /workspace -v ${PWD}/results:/results nvcr.io/ea-bignlp/ga-participants/nemofw-training:23.08.03 bash

And then run the following data preprocess script

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/nemo_launcher/collections/dataprep_scripts/dolly_dataprep/preprocess.py --input databricks-dolly-15k/databricks-dolly-15k.jsonl

Example output

Copy
Copied!
            

Preprocessing data to jsonl format... Data was successfully preprocessed and saved by databricks-dolly-15k/databricks-dolly-15k-output.jsonl .

Check that the output jsonl files exists

Copy
Copied!
            

$ ls databricks-dolly-15k/ .git/ .gitattributes README.md databricks-dolly-15k-output.jsonl databricks-dolly-15k.jsonl

Check the first example in the output jsonl file

Copy
Copied!
            

$ head -n 1 databricks-dolly-15k/databricks-dolly-15k-output.jsonl {"input": "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.\n\nWhen did Virgin Australia start operating?", "output": "Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.", "category": "closed_qa"}

Step 3: Split the data into train, validation and test.

Generate the train, test and validation splits- you may use your own script to do this or create a new script and use the following sample split_train_val.py by copying it over in the databricks-dolly-15k directory

Copy
Copied!
            

import json import random input_file = "databricks-dolly-15k-output.jsonl" training_output_file = "training.jsonl" validation_output_file = "validation.jsonl" test_output_file = "test.jsonl" # Specify the proportion of data for training and validation train_proportion = 0.80 validation_proportion = 0.15 test_proportion = 0.05 # Read the JSONL file and shuffle the JSON objects with open(input_file, "r") as f: lines = f.readlines() random.shuffle(lines) # Calculate split indices total_lines = len(lines) train_index = int(total_lines * train_proportion) val_index = int(total_lines * validation_proportion) # Distribute JSON objects into training and validation sets train_data = lines[:train_index] validation_data = lines[train_index:train_index+val_index] test_data = lines[train_index+val_index:] # Write JSON objects to training file with open(training_output_file, "w") as f: for line in train_data: f.write(line.strip() + "\n") # Write JSON objects to validation file with open(validation_output_file, "w") as f: for line in validation_data: f.write(line.strip() + "\n") # Write JSON objects to training file with open(test_output_file, "w") as f: for line in test_data: f.write(line.strip() + "\n")

Then go to the databricks-dolly-15k directory and generate the splits:

Copy
Copied!
            

python3 split_train_val.py

Check for the train, test and validation jsonl files

Copy
Copied!
            

$ ls README.md databricks-dolly-15k.jsonl databricks-dolly-15k-output.jsonl split_train_val.py training.jsonl validation.jsonl test.jsonl

Set the environment variables, pass the paths to your train, test, and validation data files

Copy
Copied!
            

MODEL="YOUR PATH TO llama2-7b.nemo" TRAIN="[YOUR PATH TO databricks-dolly-15k/train.jsonl]" VALID="[YOUR PATH TO databricks-dolly-15k/validation.jsonl]" TEST="[YOUR PATH TO databricks-dolly-15k/test.jsonl]" VALID_NAMES="[databricks-dolly-15k]"

Set the concat sampling probability. This depends on the number of files being passed in the train set and how much percentage of the fine tuning data would you like to use from each file. Note sum of concat sampling probabilities should be 1.0. For example, the following is an example for setting concat sampling probability for a train set with 2 jsonl files.

Copy
Copied!
            

TRAIN="[/path/to/dataset_1.jsonl,/path/to/dataset_2.jsonl]" CONCAT_SAMPLING_PROBS="[0.3,0.7]"

In our example we are using 1 train file so CONCAT_SAMPLING_PROBS="[1.0]" Set the tensor parallelism and pipeline parallelism values based on the model you are using.

Copy
Copied!
            

CONCAT_SAMPLING_PROBS="[1]" TP_SIZE=8 PP_SIZE=1

Run the SFT command by appropriately setting the values for the parameters such as the number of steps, model checkpoint path, batch sizes etc. For a full reference of parameter settings refer to the config file

Copy
Copied!
            

torchrun --nproc_per_node=8 \ /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_sft.py \ trainer.precision=bf16 \ trainer.devices=8 \ trainer.num_nodes=1 \ trainer.val_check_interval=0.1 \ trainer.max_steps=50 \ model.restore_from_path=${MODEL} \ model.micro_batch_size=1 \ model.global_batch_size=128 \ model.tensor_model_parallel_size=${TP_SIZE} \ model.pipeline_model_parallel_size=${PP_SIZE} \ model.megatron_amp_O2=True \ model.sequence_parallel=True \ model.activations_checkpoint_granularity=selective \ model.activations_checkpoint_method=uniform \ model.optim.name=distributed_fused_adam \ model.optim.lr=5e-6 \ model.answer_only_loss=True \ model.data.train_ds.file_names=${TRAIN_DS} \ model.data.validation_ds.file_names=${VALID_DS} \ model.data.test_ds.file_names=${TEST_DS} \ model.data.train_ds.concat_sampling_probabilities=${CONCAT_SAMPLING_PROBS} \ model.data.train_ds.max_seq_length=2048 \ model.data.validation_ds.max_seq_length=2048 \ model.data.train_ds.micro_batch_size=1 \ model.data.train_ds.global_batch_size=128 \ model.data.validation_ds.micro_batch_size=1 \ model.data.validation_ds.global_batch_size=128 \ model.data.test_ds.micro_batch_size=1 \ model.data.test_ds.global_batch_size=256 \ model.data.train_ds.num_workers=0 \ model.data.validation_ds.num_workers=0 \ model.data.test_ds.num_workers=0 \ model.data.validation_ds.metric.name=loss \ model.data.test_ds.metric.name=loss \ exp_manager.create_wandb_logger=False \ exp_manager.explicit_log_dir=/results \ exp_manager.resume_if_exists=True \ exp_manager.resume_ignore_no_checkpoint=True \ exp_manager.create_checkpoint_callback=True \ exp_manager.checkpoint_callback_params.monitor=validation_loss \ exp_manager.checkpoint_callback_params.save_best_model=False \ exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \ ++cluster_type=BCP

Change following settings for Llama 13b SFT

Copy
Copied!
            

trainer.num_nodes=1 trainer.devices=8 model.micro_batch_size=1 model.tensor_model_parallel_size=4 model.pipeline_model_parallel_size=1

Change following settings for Llama 70b SFT

Copy
Copied!
            

trainer.num_nodes=32 trainer.devices=256 model.tensor_model_parallel_size=8 model.pipeline_model_parallel_size=1 model.activations_checkpoint_granularity=full model.activations_checkpoint_num_layers=1

Note: For running SFT on multiple nodes (for example, 70B model) on a Slurm cluster, replace the torchrun --nproc_per_node=8 with python.

After completion of SFT, you should get an output similar to the following. If you want to get the wandb output, make sure to set exp_manager.create_wandb_logger=True and sign up for W&B to get the API key. You can follow the steps on the terminal to accomplish that.

Copy
Copied!
            

wandb: Waiting for W&B process to finish... (success). wandb: wandb: Run history: wandb: consumed_samples ▁▃▅▆█ wandb: epoch ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ wandb: global_step ▁▃▅▆█ wandb: grad_norm █▃▄▃▁ wandb: lr ▁▁▁▁▁ wandb: reduced_train_loss █▅▇▆▁ wandb: train_backward_timing ▇█▅▁▇ wandb: train_step_timing ▃▁█▆▂ wandb: trainer/global_step ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███ wandb: val_loss █▅▄▃▂▂▂▁▁▁ wandb: validation_loss █▅▄▃▂▂▂▁▁▁ wandb: validation_loss_databricks-dolly-15k █▅▄▃▂▂▂▁▁▁ wandb: validation_step_timing ▂██▂▂▆▃█▂██▂▂▆▇█▂██▂█▆▇█▂█▁▃█▆▇█▂▂▁▃█▆▇▇ wandb: wandb: Run summary: wandb: consumed_samples 6272.0 wandb: epoch 0 wandb: global_step 49.0 wandb: grad_norm 10.05424 wandb: lr 0.0 wandb: reduced_train_loss 1.66673 wandb: train_backward_timing 5e-05 wandb: train_step_timing 17.50282 wandb: trainer/global_step 49 wandb: val_loss 1.65022 wandb: validation_loss 1.65022 wandb: validation_loss_databricks-dolly-15k 1.65022 wandb: validation_step_timing 9.01902 wandb: wandb: You can sync this run to the cloud by running: wandb: wandb sync /results/wandb/offline-run-20230714_032640-iu65oacs wandb: Find logs at: /results/wandb/offline-run-20230714_032640-iu65oacs/logs

Step 6: Run evaluation

Run evaluation using megatron_gpt_generate.py

Set the appropriate model checkpoint path, test file path, batch sizes, number of tokens etc. and run evaluation on the test file

Copy
Copied!
            

PATH_TO_TRAINED_MODEL=/results/checkpoints/megatron_gpt_sft.nemo TEST_DS="[YOUR PATH TO test.jsonl]" python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py \ model.restore_from_path=${PATH_TO_TRAINED_MODEL} \ trainer.devices=8 \ model.data.test_ds.file_names=${TEST_DS} \ model.data.test_ds.names=['dolly-15k_test'] \ model.data.test_ds.global_batch_size=16 \ model.data.test_ds.micro_batch_size=2 \ model.data.test_ds.tokens_to_generate=20 \ model.tensor_model_parallel_size=1 \ model.pipeline_model_parallel_size=1 \ inference.greedy=True \ model.data.test_ds.output_file_path_prefix=/results/sft_results \ model.data.test_ds.write_predictions_to_file=True

Sample Output

Copy
Copied!
            

$ tail -n 4 sft_results.jsonl {"sentence": "What is Azure HDInsight? Azure HDInsight is a cloud service that provides a high-performance, scalable, and cost-effective way to run Apache Hadoop on the"} {"sentence": "What is carnitine? Carnitine is a fatty acid that is found in the body. It is used to produce energy in the mitochondria of the cells. Carnit"} {"sentence": "List some TV shows that Canadian actor William B. Davis has been in."} {"sentence": "Identify which instrument is string or percussion: Handbell, Dobro, Drum"}

Note, This is only a sample output (based of a toy SFT example) and your output may vary. The performance can be further improved by fine tuning the model for more steps.

Previous NeMo Framework AutoConfigurator
Next NeMo Framework Supervised fine-tuning (SFT) with Mixtral-8x7B
© Copyright 2023-2024, NVIDIA. Last updated on Apr 12, 2024.