Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
NeMo Framework SFT with Mixtral-8x7B and Nemotron 4 340B
Project Description
Learning Goals
Often we want to adapt or customize foundation models to be more performant on our specific task. Fine-tuning refers to how we can modify the weights of a pre-trained foundation model with additional custom data. Supervised Fine-tuning (SFT) refers to unfreezing all the weights and layers in our model and training on a newly labeled set of examples. We can fine-tune to incorporate new, domain-specific knowledge or teach the foundation model what type of response to provide. One specific type of SFT is also referred to as “instruction tuning” where we use SFT to teach a model to follow instructions better.
In this project, you’ll test out the Supervised Fine-Tuning method on the Mixtral-8x7B or Nemotron 340B model using an instructive dataset.
NeMo Tools and Resources
NeMo Framework Training container:
nvcr.io/nvidia/nemo:24.07
Software Requirements
Use the latest NeMo Framework Training container
This playbook has been tested on:
nvcr.io/nvidia/nemo:24.07
. It is expected to work similarly on other environments.
Hardware Requirements
Minimum 8xA100 80G (4 nodes) for SFT on Mixtral-8x7B
Minimum 96xA100 80G (12 nodes) for SFT on Nemotron 340B
Data
Databricks-dolly-15k is an open-source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization For more details about the data refer to databricks-dolly-15k | Hugging Face.
The following steps have been tested with this container: nvcr.io/nvidia/nemo:24.07
.
For Nemotron, you can skip the following conversion step and download directly from NVIDIA NGC: Nemotron-4-340B-Base.
Convert Mixtral-8x7B from Hugging Face Format to NeMo Format
If you already have a .nemo file for the Mixtral-8x7B model, you can skip this step.
Step 1: Download Mixtral-8x7B from Huggingface-hub
Request download permission from the model’s Hugging Face page and create the destination directory. Two options are available to download the checkpoint to local disk.
To download using the CLI tool:
mkdir mixtral-8x7B-hf
HF_TOKEN=<your-hf-token> huggingface-cli download mistralai/Mixtral-8x7B-v0.1 --local-dir mixtral-8x7B-hf
To download using the Hugging Face API, run the following Python code and replace the value for the token with your Hugging Face token:
from huggingface_hub import snapshot_download
snapshot_download(repo_id="mistralai/Mixtral-8x7B-v0.1",
local_dir="mixtral-8x7B-hf", local_dir_use_symlinks=False, token=<YOUR HF TOKEN>)
If you are not logged in to your Hugging Face account, you need to pass your Hugging Face token via the HF_TOKEN environment flag (either to huggingface-cli or the python command) to download the checkpoint. If you are already logged in to your HF account on the machine running these commands, you can skip passing the HF_TOKEN environment variable.
In this example, the Mixtral-8x7B Hugging Face model will be downloaded to ./mixtral-8x7B-hf.
Step 2: Convert to .nemo
Run the container using the following command:
docker run --gpus device=1 --shm-size=2g --net=host --ulimit memlock=-1 --rm -it -v ${PWD}:/workspace -w /workspace -v ${PWD}/results:/results nvcr.io/nvidia/nemo:24.07 bash
Convert the Hugging Face model to .nemo model:
torchrun --nproc_per_node=1 /opt/NeMo/scripts/checkpoint_converters/convert_mixtral_hf_to_nemo.py --input_name_or_path=./mixtral-8x7B-hf/ --output_path=mixtral.nemo
The generated mixtral.nemo file uses distributed checkpointing and can be loaded with any tensor parallel (tp) or pipeline parallel (pp) combination without modifying (e.g. reshaping/splitting) the mixtral.nemo checkpoint.
Prepare Data
Step 1: Download the dataset
Download the dolly-15k dataset from Hugging Face:
git clone https://huggingface.co/datasets/databricks/databricks-dolly-15k;
wget https://huggingface.co/datasets/databricks/databricks-dolly-15k/resolve/main/databricks-dolly-15k.jsonl -O databricks-dolly-15k/databricks-dolly-15k.jsonl
Once downloaded, verify the size of the file (databricks-dolly-15k.jsonl):
$ du -sh databricks-dolly-15k/databricks-dolly-15k.jsonl
13M databricks-dolly-15k/databricks-dolly-15k.jsonl
In addition, you can verify the integrity of the file using checksum:
$ sha256sum databricks-dolly-15k/databricks-dolly-15k.jsonl
2df9083338b4abd6bceb5635764dab5d833b393b55759dffb0959b6fcbf794ec databricks-dolly-15k/databricks-dolly-15k.jsonl
If the sizes or checksum do not match, please inspect the log to confirm all commands run successfully.
Step 2: Data Preprocessing
You’ll need to pre-process the data to ensure it’s in the correct format. The expected format is a JSONL file with {‘input’: ‘xxx’, ‘output’: ‘yyy’} pairs. To run the pre-processing you will use the script that has already been prepared for you. Run this script and pass your jsonl file as –input.
To run the script, you need to launch the container. If the container is not already running, use the following command:
docker run --gpus device=1 --shm-size=2g --net=host --ulimit memlock=-1 --rm -it -v ${PWD}:/workspace -w /workspace -v ${PWD}/results:/results nvcr.io/nvidia/nemo:24.07 bash
Next, run the following data preprocess script:
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/nemo_launcher/collections/dataprep_scripts/dolly_dataprep/preprocess.py --input databricks-dolly-15k/databricks-dolly-15k.jsonl
The following shows an example output:
Preprocessing data to jsonl format...
Data was successfully preprocessed and saved by databricks-dolly-15k/databricks-dolly-15k-output.jsonl.
Check that the output jsonl files exists:
$ ls databricks-dolly-15k/
.git/
.gitattributes
README.md
databricks-dolly-15k-output.jsonl
databricks-dolly-15k.jsonl
Check the first example in the output jsonl file:
$ head -n 1 databricks-dolly-15k/databricks-dolly-15k-output.jsonl
{"input": "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.\n\nWhen did Virgin Australia start operating?", "output": "Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.", "category": "closed_qa"}
Step 3: Split the Data into Train, Validation, and Test
Generate the train, test, and validation splits. You can use your script or you create a new script. If you create your script, use the following sample split_train_val.py by copying it over in the databricks-dolly-15k directory:
import json
import random
input_file = "databricks-dolly-15k-output.jsonl"
training_output_file = "training.jsonl"
validation_output_file = "validation.jsonl"
test_output_file = "test.jsonl"
# Specify the proportion of data for training and validation
train_proportion = 0.80
validation_proportion = 0.15
test_proportion = 0.05
# Read the JSONL file and shuffle the JSON objects
with open(input_file, "r") as f:
lines = f.readlines()
random.shuffle(lines)
# Calculate split indices
total_lines = len(lines)
train_index = int(total_lines - train_proportion)
val_index = int(total_lines - validation_proportion)
# Distribute JSON objects into training and validation sets
train_data = lines[:train_index]
validation_data = lines[train_index:train_index+val_index]
test_data = lines[train_index+val_index:]
# Write JSON objects to training file
with open(training_output_file, "w") as f:
for line in train_data:
f.write(line.strip() + "\n")
# Write JSON objects to validation file
with open(validation_output_file, "w") as f:
for line in validation_data:
f.write(line.strip() + "\n")
# Write JSON objects to training file
with open(test_output_file, "w") as f:
for line in test_data:
f.write(line.strip() + "\n")
Then go to the databricks-dolly-15k
directory and generate the splits:
python3 split_train_val.py
Check for the train, test, and validation jsonl files:
$ ls
README.md
databricks-dolly-15k.jsonl
databricks-dolly-15k-output.jsonl
split_train_val.py
training.jsonl
validation.jsonl
test.jsonl
Step 4: Run the SFT Script
Set the environment variables and then pass the paths to your train, test, and validation data files:
MODEL="YOUR PATH TO model.nemo or nemo-model-repository"
TRAIN_DS="[YOUR PATH TO databricks-dolly-15k/train.jsonl]"
VALID_DS="[YOUR PATH TO databricks-dolly-15k/validation.jsonl]"
TEST_DS="[YOUR PATH TO databricks-dolly-15k/test.jsonl]"
VALID_NAMES="[databricks-dolly-15k]"
Set the concat sampling probability. The value depends on the number of files passed in the train set and the percentage of the fine-tuning data you want to use from each file.
The sum of the concat sampling probabilities should be 1.0.
The following example shows how to set the concat sampling probability for a train set with two jsonl files.
TRAIN_DS="[/path/to/dataset_1.jsonl,/path/to/dataset_2.jsonl]"
CONCAT_SAMPLING_PROBS="[0.3,0.7]"
In our example, we are using one train file, so CONCAT_SAMPLING_PROBS="[1.0]"
.
Set the TP and PP values based on the model you are using. For Mixtral 8x7B, use
CONCAT_SAMPLING_PROBS="[1]"
TP_SIZE=8
PP_SIZE=4
For Nemotron 340B, use
CONCAT_SAMPLING_PROBS="[1]"
TP_SIZE=8
PP_SIZE=12
Run the SFT command by appropriately setting the values for parameters such as the number of steps, model checkpoint path, batch sizes, etc. For a full reference of parameter settings, refer to the config file. Since these models require multiple nodes, we will use a SLURM cluster to run the NeMo fine-tuning commands. Alternatively, you can also use NeMo-Framework-Launcher.
For our Mixtral-8x7B model, we will use 4 nodes and the following script:
#!/bin/bash
#SBATCH -p <partition> --exclusive --nodes=4 --mem=0 --overcommit --ntasks-per-node=8 --time=4:30:00 --dependency=singleton --job-name=multinode_sft_example:mixtral
# Load necessary modules and set environment variables
export CUDA_DEVICE_MAX_CONNECTIONS=1
# Set model and training parameters
TRAIN="[/path/to/databricks-dolly-15k/training.jsonl]"
VALID="[/path/to/databricks-dolly-15k/validation.jsonl]"
CKPT="/path/to/mixtral_extracted"
TS=$(date +%s)
OUTPUT_PATH="/path/to/output"
RESULTS_DIR="$OUTPUT_PATH/results_${TS}"
CONCAT_SAMPLING_PROBS="[1]"
TP_SIZE=8
PP_SIZE=1
BS=8
MAX_LEN=512
# The NeMo command to run on each node.
run_cmd="python3 /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \
trainer.precision=bf16 \
trainer.devices=\$SLURM_NTASKS_PER_NODE \
trainer.num_nodes=\$SLURM_JOB_NUM_NODES \
trainer.val_check_interval=0.5 \
trainer.max_steps=50 \
model.restore_from_path=${CKPT} \
model.micro_batch_size=1 \
model.global_batch_size=${BS} \
model.tensor_model_parallel_size=${TP_SIZE} \
model.pipeline_model_parallel_size=${PP_SIZE} \
model.sequence_parallel=True \
model.megatron_amp_O2=True \
model.activations_checkpoint_granularity=selective \
model.activations_checkpoint_method=uniform \
model.optim.name=distributed_fused_adam \
model.optim.lr=5e-6 \
model.answer_only_loss=True \
model.data.train_ds.file_names=${TRAIN} \
model.data.validation_ds.file_names=${VALID} \
model.data.test_ds.file_names=${TEST} \
model.data.train_ds.concat_sampling_probabilities=${CONCAT_SAMPLING_PROBS} \
model.data.train_ds.max_seq_length=${MAX_LEN} \
model.data.validation_ds.max_seq_length=${MAX_LEN} \
model.data.train_ds.micro_batch_size=1 \
model.data.train_ds.global_batch_size=${BS} \
model.data.validation_ds.micro_batch_size=1 \
model.data.validation_ds.global_batch_size=${BS} \
model.data.test_ds.micro_batch_size=1 \
model.data.test_ds.global_batch_size=${BS} \
model.data.train_ds.num_workers=0 \
model.data.validation_ds.num_workers=0 \
model.data.test_ds.num_workers=0 \
model.data.validation_ds.metric.name=loss \
model.data.test_ds.metric.name=loss \
++model.peft.peft_scheme='none' \
exp_manager.create_wandb_logger=False \
exp_manager.explicit_log_dir=${RESULTS_DIR} \
exp_manager.resume_if_exists=True \
exp_manager.resume_ignore_no_checkpoint=True \
exp_manager.create_checkpoint_callback=True \
exp_manager.checkpoint_callback_params.monitor=validation_loss \
exp_manager.checkpoint_callback_params.save_best_model=False \
exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True"
# Container
CONT=nvcr.io/nvidian/nemo:24.07.rc0
CONT_NAME=nemofw-training
CONT_MOUNT=/path/to/:/path/to/
# run on SLURM
srun -l \
--ntasks-per-node=8 \
--container-name="${CONT_NAME}" \
--container-image="${CONT}" \
--container-mounts="${CONT_MOUNT}" \
--container-entrypoint \
--no-container-mount-home \
bash -c "${run_cmd}"
Modify the above command to adjust the appropriate paths to the extracted checkpoint (i.e. CKPT="/path/to/mixtral_extracted"
), the path you want to save the output (i.e. OUTPUT_PATH="/path/to/output"
), the required mountpoints (i.e. CONT_MOUNT=/path/to/:/path/to/
), the training datapaths (i.e. TRAIN="[/path/to/databricks-dolly-15k/training.jsonl]"
and VALID="[/path/to/databricks-dolly-15k/validation.jsonl]"
) and the SLURM partition at the second line of the file (i.e. ``#SBATCH -p <partition> ``).
After adjusting these settings, save the above commands to a file named sft_example.sh
and launch it by running sbatch sft_example.sh
.
After completion of SFT, you should get an output similar to the following. If you want to get the wandb output, make sure to set exp_manager.create_wandb_logger=True
and sign up for W&B to get the API key. You can follow the steps on the terminal to accomplish this task.
wandb: Waiting for W&B process to finish... (success).
wandb:
wandb: Run history:
wandb: consumed_samples ▁▃▅▆█
wandb: epoch ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: global_step ▁▃▅▆█
wandb: grad_norm █▃▄▃▁
wandb: lr ▁▁▁▁▁
wandb: reduced_train_loss █▅▇▆▁
wandb: train_backward_timing ▇█▅▁▇
wandb: train_step_timing ▃▁█▆▂
wandb: trainer/global_step ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb: val_loss █▅▄▃▂▂▂▁▁▁
wandb: validation_loss █▅▄▃▂▂▂▁▁▁
wandb: validation_loss_databricks-dolly-15k █▅▄▃▂▂▂▁▁▁
wandb: validation_step_timing ▂██▂▂▆▃█▂██▂▂▆▇█▂██▂█▆▇█▂█▁▃█▆▇█▂▂▁▃█▆▇▇
wandb:
wandb: Run summary:
wandb: consumed_samples 6272.0
wandb: epoch 0
wandb: global_step 49.0
wandb: grad_norm 10.05424
wandb: lr 0.0
wandb: reduced_train_loss 1.66673
wandb: train_backward_timing 5e-05
wandb: train_step_timing 17.50282
wandb: trainer/global_step 49
wandb: val_loss 1.65022
wandb: validation_loss 1.65022
wandb: validation_loss_databricks-dolly-15k 1.65022
wandb: validation_step_timing 9.01902
wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /results/wandb/offline-run-20230714_032640-iu65oacs
wandb: Find logs at: /results/wandb/offline-run-20230714_032640-iu65oacs/logs
Step 5: Run Evaluation
Run evaluation using megatron_gpt_generate.py.
Set the appropriate model checkpoint path as obtained from the training, test file path, batch sizes, number of tokens, etc. To run this command with the Mixtral-8x7B model we will use one node equipped with eight GPUs with 80GB VRAM capacity. Then, run the evaluation on the test file:
PATH_TO_TRAINED_MODEL=/results/checkpoints/megatron_gpt_sft.nemo
TEST_DS="[YOUR PATH TO test.jsonl]"
python3 \
/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py \
model.restore_from_path=${PATH_TO_TRAINED_MODEL} \
trainer.devices=8 \
model.data.test_ds.file_names=${TEST_DS} \
model.data.test_ds.names=['dolly-15k_test'] \
model.data.test_ds.global_batch_size=16 \
model.data.test_ds.micro_batch_size=1 \
model.data.test_ds.tokens_to_generate=20 \
model.tensor_model_parallel_size=1 \
model.pipeline_model_parallel_size=1 \
inference.greedy=True \
model.data.test_ds.output_file_path_prefix=/results/sft_results \
model.data.test_ds.write_predictions_to_file=True
The following shows a sample output:
$ tail -n 4 sft_results.jsonl
{"sentence": "What is Azure HDInsight? Azure HDInsight is a cloud service that provides a high-performance, scalable, and cost-effective way to run Apache Hadoop on the"}
{"sentence": "What is carnitine? Carnitine is a fatty acid that is found in the body. It is used to produce energy in the mitochondria of the cells. Carnit"}
{"sentence": "List some TV shows that Canadian actor William B. Davis has been in."}
{"sentence": "Identify which instrument is string or percussion: Handbell, Dobro, Drum"}
This is only a sample output based on a toy SFT example and your output may vary. The performance can be further improved by fine-tuning the model for more steps.