Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Note
Attention: Dedicated Container for StarCoder2
For StarCoder2 models, please use the nvcr.io/nvidia/nemo:24.07
container. Also check our StarCoder2 playbooks.
Parameter Efficient Fine-Tuning (PEFT)
Please prepare the datasets according to Data Preparation for SFT and PEFT section before proceeding.
Run PEFT inside NeMo Container
Step 1: Start NeMo Container
If the container is not already running use the following command
docker run --gpus device=1 --shm-size=2g --net=host --ulimit memlock=-1 --rm -it -v ${PWD}:/workspace -w /workspace -v ${PWD}/results:/results nvcr.io/nvidia/nemo:24.07 bash
Step 2: Run PEFT
The megatron_gpt_finetuning_config.yaml file is used to configure the parameters for the running PEFT training jobs in NeMo with P-Tuning and LoRA techniques for language model tuning. Set the environment variables, pass the paths to your training, test and validation data files
MODEL="YOUR PATH TO starcoder2.nemo"
TRAIN="[YOUR PATH TO python_code_instructions_18k_alpaca/train.jsonl]"
VALID="[YOUR PATH TO python_code_instructions_18k_alpaca/validation.jsonl]"
TEST="[YOUR PATH TO python_code_instructions_18k_alpaca/test.jsonl]"
VALID_NAMES="[python_code_instructions_18k_alpaca]"
SCHEME="lora"
Set the concat sampling probability. This depends on the number of files being passed in the training set and what percentage of the fine tuning data would you like to use from each file. Note sum of concat sampling probabilities should be 1.0. For example, the following is an example for setting concat sampling probability for a training set with 2 jsonl files.
TRAIN="[/path/to/dataset_1.jsonl,/path/to/dataset_2.jsonl]"
CONCAT_SAMPLING_PROBS="[0.3,0.7]"
In our example we are using 1 train file so CONCAT_SAMPLING_PROBS="[1.0]"
Set the tensor parallelism and pipeline parallelism values based on the model you are using.
CONCAT_SAMPLING_PROBS="[1]"
TP_SIZE=4
PP_SIZE=1
Run the PEFT command by appropriately setting the values for the parameters such as the number of steps, model checkpoint path, batch sizes etc. For a full reference of parameter settings refer to the config file:
torchrun --nproc_per_node=8 \
/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \
trainer.devices=8 \
trainer.num_nodes=1 \
trainer.precision=bf16 \
trainer.val_check_interval=20 \
trainer.max_steps=50 \
model.megatron_amp_O2=False \
++model.mcore_gpt=True \
model.tensor_model_parallel_size=${TP_SIZE} \
model.pipeline_model_parallel_size=${PP_SIZE} \
model.micro_batch_size=1 \
model.global_batch_size=32 \
model.restore_from_path=${MODEL} \
model.data.train_ds.num_workers=0 \
model.data.validation_ds.num_workers=0 \
model.data.train_ds.file_names=${TRAIN_DS} \
model.data.train_ds.concat_sampling_probabilities=[1.0] \
model.data.validation_ds.file_names=${VALID_DS} \
model.peft.peft_scheme=${SCHEME} \
exp_manager.explicit_log_dir=/results \
++model.bias_activation_fusion=True \
++model.fp8=False \
++model.fp8_e4m3=False \
++model.fp8_hybrid=True \
++model.fp8_margin=0 \
++model.fp8_interval=1 \
++model.fp8_amax_history_len=128 \
++model.fp8_amax_compute_algo=max \
++model.fp8_params=True
Note: For running PEFT on multiple nodes (for example on a Slurm cluster, replace the torchrun --nproc_per_node=8
with python
.
Tuning with packed dataset: Enable training with packed sequences by adjusting configs. We need to set micro batch size to 1 and reduce global batch size due to packing. Here we set global_batch_size=8
and micro_batch_size=1
with sequence length 4096:
model.data.train_ds.file_names=/path/to/python_code_instructions_18k_alpaca/packed_4096_seed0.npy \
+model.data.train_ds.packed_sequence=True \
model.micro_batch_size=1 \
model.global_batch_size=8
Tuning with FP8: Enable training with FP8 by adjusting configs:
++model.fp8=True
Step 3: Run evaluation
Run evaluation using megatron_gpt_generate.py
Set the appropriate model checkpoint path, test file path, batch sizes, number of tokens etc. and run evaluation on the test file
PATH_TO_TRAINED_MODEL=/results/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning.nemo
TEST_DS="[YOUR PATH TO test.jsonl]"
python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py \
model.restore_from_path=${PATH_TO_TRAINED_MODEL} \
trainer.devices=1 \
model.data.test_ds.file_names=${TEST_DS} \
model.data.test_ds.names=['python_code_instructions_18k_alpaca_test'] \
model.data.test_ds.global_batch_size=2 \
model.data.test_ds.micro_batch_size=2 \
model.data.test_ds.tokens_to_generate=20 \
model.tensor_model_parallel_size=1 \
model.pipeline_model_parallel_size=1 \
inference.greedy=True \
model.data.test_ds.output_file_path_prefix=/results/sft_results \
model.data.test_ds.write_predictions_to_file=True
Sample Output
$ tail -n 2 sft_results.jsonl
{"input": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nCreate a python 3 script to generate a list of integers from 0 to 100.\n\n### Input:\n\n\n", "pred": " def generate_list():\n return list(range(101))\n\nprint(generate", "label": " list_of_integers = [x for x in range(0, 101)]"}
{"input": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nWrite a Python program to compute the sum of items in a given list and multiply it by a given number.\n\n### Input:\n{'list': [1, 3, 5, 7], 'num': 3}\n\n", "pred": " def sum_list_multiply(list, num):\n return sum(list) * num\n", "label": " #initialize variables\nlist = [1, 3, 5, 7]\nnum = 3\n\n# compute sum\nsum = 0\nfor i in list:\n sum = sum + i\n\n# compute product\nresult = sum * num\n\n# Print result\nprint(\"Result: \", result)"}
Note, This is only a sample output (based on a toy SFT example) and your output may vary. The performance can be further improved by fine tuning the model for more steps.
Step 4 (Optional): Merge LORA weights
If needed, you can merge LORA weights into the base GPT LM (StarCoder2). Currently, only PP=1 is supported.
PATH_TO_MERGED_MODEL=/results/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_lora_merged.nemo
python /opt/NeMo/scripts/nlp_language_modeling/merge_lora_weights/merge.py \
trainer.accelerator=gpu \ # Use 'cpu' if the model cannot fit in memory
tensor_model_parallel_size=${TP_SIZE} \
pipeline_model_parallel_size=1 \
gpt_model_file=${MODEL} \
lora_model_path=${PATH_TO_TRAINED_MODEL} \
merged_model_path=${PATH_TO_MERGED_MODEL}
To find the TP of the LORA checkpoint, you can visually examine the output of:
tar -tvf ${PATH_TO_MERGED_MODEL}
Replace ${PATH_TO_MERGED_MODEL} with the path to your merged model checkpoint.
Run PEFT with NeMo Launcher
To run PEFT update conf/config.yaml
:
defaults:
- peft: starcoder2/sft
stages:
- peft
Execute the launcher pipeline: python3 main.py
.
Configuration
Default configurations for PEFT can be found in conf/peft/starcoder2/sft.yaml
.
Fine-tuning configuration is divided into four sections run
, trainer
, exp_manger
and model
.
run:
name: peft_starcoder2
time_limit: "04:00:00"
dependency: "singleton"
task_name: "peft"
results_dir: ${base_results_dir}/peft_${.name}
Set the number of nodes and devices for fine-tuning:
trainer:
num_nodes: 1
devices: 8
model:
restore_from_path: /path/to/starcoder2.nemo
restore_from_path
sets the path to the .nemo
checkpoint to run fine-tuning.
peft_scheme
sets the fine-tuning scheme to be used. Supported schemes include: lora, adapter, ia3, ptuning.