Note

Attention: Dedicated Container for Gemma

For Gemma models, please use the nvcr.io/nvidia/nemo:24.05 container. Also check our Gemma playbooks.

Note

Attention: Dedicated Container for CodeGemma

For CodeGemma models, please use the nvcr.io/nvidia/nemo:24.05 container. Please modify the dataset files accordingly in the commands below.

Parameter Efficient Fine-Tuning (PEFT)

Please prepare the datasets according to Data Preparation for SFT and PEFT section before proceeding.

Run PEFT inside NeMo Container

Step 1: Start NeMo Container

If the container is not already running use the following command

docker run --gpus device=1 --shm-size=2g --net=host --ulimit memlock=-1 --rm -it -v ${PWD}:/workspace -w /workspace -v ${PWD}/results:/results nvcr.io/nvidia/nemo:24.05 bash

Step 2: Run PEFT

The megatron_gpt_finetuning_config.yaml file is used to configure the parameters for the running PEFT training jobs in NeMo with P-Tuning and LoRA techniques for language model tuning. Set the environment variables, pass the paths to your train, test and validation data files

MODEL="YOUR PATH TO gemma-7b.nemo"
TRAIN="[YOUR PATH TO databricks-dolly-15k/train.jsonl]"
VALID="[YOUR PATH TO databricks-dolly-15k/validation.jsonl]"
TEST="[YOUR PATH TO databricks-dolly-15k/test.jsonl]"
VALID_NAMES="[databricks-dolly-15k]"
SCHEME="lora"

Set the concat sampling probability. This depends on the number of files being passed in the train set and how much percentage of the fine tuning data would you like to use from each file. Note sum of concat sampling probabilities should be 1.0. For example, the following is an example for setting concat sampling probability for a train set with 2 jsonl files.

TRAIN="[/path/to/dataset_1.jsonl,/path/to/dataset_2.jsonl]"
CONCAT_SAMPLING_PROBS="[0.3,0.7]"

In our example we are using 1 train file so CONCAT_SAMPLING_PROBS="[1.0]" Set the tensor parallelism and pipeline parallelism values based on the model you are using.

CONCAT_SAMPLING_PROBS="[1]"
TP_SIZE=1
PP_SIZE=1

Run the PEFT command by appropriately setting the values for the parameters such as the number of steps, model checkpoint path, batch sizes etc. For a full reference of parameter settings refer to the config file:

torchrun --nproc_per_node=8 \
/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \
    trainer.devices=8 \
    trainer.num_nodes=1 \
    trainer.precision=bf16 \
    trainer.val_check_interval=20 \
    trainer.max_steps=50 \
    model.megatron_amp_O2=False \
    ++model.mcore_gpt=True \
    model.tensor_model_parallel_size=${TP_SIZE} \
    model.pipeline_model_parallel_size=${PP_SIZE} \
    model.micro_batch_size=1 \
    model.global_batch_size=8 \
    model.restore_from_path=${MODEL} \
    model.data.train_ds.num_workers=0 \
    model.data.train_ds.add_bos=True \
    model.data.validation_ds.num_workers=0 \
    model.data.train_ds.file_names=${TRAIN_DS} \
    model.data.train_ds.concat_sampling_probabilities=[1.0] \
    model.data.validation_ds.file_names=${VALID_DS} \
    model.peft.peft_scheme=${SCHEME} \
    exp_manager.explicit_log_dir=/results \
    ++model.bias_activation_fusion=True \
    ++model.fp8=False \
        ++model.fp8_e4m3=False \
        ++model.fp8_hybrid=True \
        ++model.fp8_margin=0 \
        ++model.fp8_interval=1 \
        ++model.fp8_amax_history_len=128 \
        ++model.fp8_amax_compute_algo=max \
    ++model.fp8_params=True

Note: For running PEFT on multiple nodes (for example on a Slurm cluster, replace the torchrun --nproc_per_node=8 with python.

Tuning with packed dataset: Enable training with packed sequences by adjusting configs. We also need to reduce both micro batch size and global batch size due to packing. Here we set global_batch_size=1 with sequence length 4096:

model.data.train_ds.file_names=/path/to/dolly/packed_4096_seed0.npy \
+model.data.train_ds.packed_sequence=True \
model.micro_batch_size=1 \
model.global_batch_size=8

Tuning with FP8: Enable training with FP8 by adjusting configs:

++model.fp8=True

Step 3: Run evaluation

Run evaluation using megatron_gpt_generate.py

Set the appropriate model checkpoint path, test file path, batch sizes, number of tokens etc. and run evaluation on the test file

PATH_TO_TRAINED_MODEL=/results/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning.nemo
TEST_DS="[YOUR PATH TO test.jsonl]"
python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py \
    model.restore_from_path=${PATH_TO_TRAINED_MODEL} \
    trainer.devices=8 \
    model.data.test_ds.file_names=${TEST_DS} \
    model.data.test_ds.names=['dolly-15k_test'] \
    model.data.test_ds.global_batch_size=16 \
    model.data.test_ds.micro_batch_size=2 \
    model.data.test_ds.tokens_to_generate=20 \
    model.tensor_model_parallel_size=1 \
    model.pipeline_model_parallel_size=1 \
    inference.greedy=True \
    model.data.test_ds.output_file_path_prefix=/results/sft_results \
    model.data.test_ds.write_predictions_to_file=True

Sample Output

$ tail -n 4 sft_results.jsonl

{"sentence": "What is Azure HDInsight? Azure HDInsight is a cloud service that provides a high-performance, scalable, and cost-effective way to run Apache Hadoop on the"}
{"sentence": "What is carnitine? Carnitine is a fatty acid that is found in the body. It is used to produce energy in the mitochondria of the cells. Carnit"}
{"sentence": "List some TV shows that Canadian actor William B. Davis has been in."}
{"sentence": "Identify which instrument is string or percussion: Handbell, Dobro, Drum"}

Note, This is only a sample output (based of a toy SFT example) and your output may vary. The performance can be further improved by fine tuning the model for more steps.

Step 4 (Optional): Merge LORA weights

If needed, you can merge LORA weights into a base GPT LM. Currently, only PP=1 is supported.

PATH_TO_MERGED_MODEL=/results/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_lora_merged.nemo
python /opt/NeMo/scripts/nlp_language_modeling/merge_lora_weights/merge.py \
    trainer.accelerator=gpu \  # Use 'cpu' if the model cannot fit in memory
    tensor_model_parallel_size=${TP_SIZE} \
    pipeline_model_parallel_size=1 \
    gpt_model_file=${MODEL} \
    lora_model_path=${PATH_TO_TRAINED_MODEL} \
    merged_model_path=${PATH_TO_MERGED_MODEL}

To find the TP of the LORA checkpoint, you can visually examine the output of:

tar -tvf ${PATH_TO_MERGED_MODEL}

Replace ${PATH_TO_MERGED_MODEL} with the path to your merged model checkpoint.

Run PEFT with NeMo Launcher

To run PEFT update conf/config.yaml:

defaults:
  - peft: gemma/squad

stages:
  - peft

Execute the launcher pipeline: python3 main.py.

Configuration

Default configurations for PEFT with squad can be found in conf/peft/gemma/squad.yaml. Fine-tuning configuration is divided into four sections run, trainer, exp_manger and model.

run:
  name: peft_gemma_7b
  time_limit: "04:00:00"
  dependency: "singleton"
  task_name: "squad"
  results_dir: ${base_results_dir}/peft_${.name}

Set the number of nodes and devices for fine-tuning:

trainer:
  num_nodes: 1
  devices: 8

model:
  restore_from_path: /path/to/gemma-7b.nemo

restore_from_path sets the path to the .nemo checkpoint to run fine-tuning.

peft_scheme sets the fine-tuning scheme to be used. Supported schemes include: lora, adapter, ia3, ptuning.