Supervised Fine-tuning (SFT)

Please prepare the datasets according to Data Preparation for SFT and PEFT section before proceeding.

Run SFT inside NeMo Container

Step 1: Start NeMo Container

If the container is not already running use the following command

Copy
Copied!

            
            docker run %%gpus device=1 --shm-size=2g --net=host --ulimit memlock=-1 --rm -it -v ${PWD}:/workspace -w /workspace -v ${PWD}/results:/results nvcr.io/nvidia/nemo:24.01.starcoder2 bash

Step 2: Run SFT

Set the environment variables, pass the paths to your train, test, and validation data files

Copy
Copied!

            
            MODEL="YOUR PATH TO starcoder2.nemo"
TRAIN="[YOUR PATH TO python_code_instructions_18k_alpaca/train.jsonl]"
VALID="[YOUR PATH TO python_code_instructions_18k_alpaca/validation.jsonl]"
TEST="[YOUR PATH TO python_code_instructions_18k_alpaca/test.jsonl]"
VALID_NAMES="[python_code_instructions_18k_alpaca]"

Set the concat sampling probability. This depends on the number of files being passed in the train set and how much percentage of the fine tuning data would you like to use from each file. Note sum of concat sampling probabilities should be 1.0. For example, the following is an example for setting concat sampling probability for a train set with 2 jsonl files.

Copy
Copied!

            
            TRAIN="[/path/to/dataset_1.jsonl,/path/to/dataset_2.jsonl]"
CONCAT_SAMPLING_PROBS="[0.3,0.7]"

In our example we are using 1 train file so CONCAT_SAMPLING_PROBS="[1.0]" Set the tensor parallelism and pipeline parallelism values based on the model you are using.

Copy
Copied!

            
            CONCAT_SAMPLING_PROBS="[1]"
TP_SIZE=4
PP_SIZE=2

Run the SFT command by appropriately setting the values for the parameters such as the number of steps, model checkpoint path, batch sizes etc. For a full reference of parameter settings refer to the config file

Copy
Copied!

            
            torchrun --nproc_per_node=8 \
/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \
   trainer.precision=bf16 \
   trainer.devices=8 \
   trainer.num_nodes=1 \
   trainer.val_check_interval=0.1 \
   trainer.max_steps=50 \
   model.restore_from_path=${MODEL} \
   model.peft.peft_scheme=none \
   model.micro_batch_size=1 \
   model.global_batch_size=32 \
   model.tensor_model_parallel_size=${TP_SIZE} \
   model.pipeline_model_parallel_size=${PP_SIZE} \
   model.megatron_amp_O2=True \
   model.sequence_parallel=True \
   model.activations_checkpoint_granularity=selective \
   model.activations_checkpoint_method=uniform \
   model.optim.name=distributed_fused_adam \
   model.optim.lr=5e-6 \
   model.answer_only_loss=True \
   model.data.train_ds.file_names=${TRAIN_DS} \
   model.data.validation_ds.file_names=${VALID_DS} \
   model.data.test_ds.file_names=${TEST_DS} \
   model.data.train_ds.concat_sampling_probabilities=${CONCAT_SAMPLING_PROBS} \
   model.data.train_ds.max_seq_length=4096 \
   model.data.validation_ds.max_seq_length=4096 \
   model.data.train_ds.micro_batch_size=1 \
   model.data.train_ds.global_batch_size=32 \
   model.data.validation_ds.micro_batch_size=1 \
   model.data.validation_ds.global_batch_size=32 \
   model.data.test_ds.micro_batch_size=1 \
   model.data.test_ds.global_batch_size=256 \
   model.data.train_ds.num_workers=0 \
   model.data.validation_ds.num_workers=0 \
   model.data.test_ds.num_workers=0 \
   model.data.validation_ds.metric.name=loss \
   model.data.test_ds.metric.name=loss \
   exp_manager.create_wandb_logger=False \
   exp_manager.explicit_log_dir=/results \
   exp_manager.resume_if_exists=True \
   exp_manager.resume_ignore_no_checkpoint=True \
   exp_manager.create_checkpoint_callback=True \
   exp_manager.checkpoint_callback_params.monitor=validation_loss \
   exp_manager.checkpoint_callback_params.save_best_model=False \
   exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \
   ++cluster_type=BCP \
   model.sequence_parallel=True \
   ++model.bias_activation_fusion=True \
       ++model.apply_rope_fusion=True \
       ++model.optim.overlap_grad_sync=True \
   ++model.optim.overlap_param_sync=True \
   ++model.optim.contiguous_grad_buffer=True \
   ++model.optim.grad_sync_dtype=bf16 \
       ++model.fp8=False \
       ++model.fp8_e4m3=False \
       ++model.fp8_hybrid=True \
       ++model.fp8_margin=0 \
       ++model.fp8_interval=1 \
       ++model.fp8_amax_history_len=128 \
       ++model.fp8_amax_compute_algo=max

Note: For running SFT on multiple nodes (for example on a Slurm cluster, replace the torchrun --nproc_per_node=8 with python.

Tuning with packed dataset: Enable training with packed sequences by adjusting configs. We need to set micro batch size to 1 and reduce global batch size due to packing. Here we set global_batch_size=8 and micro_batch_size=1 with sequence length 4096:

Copy
Copied!

            
            model.data.train_ds.file_names=/path/to/python_code_instructions_18k_alpaca/packed_4096_seed0.npy \
+model.data.train_ds.packed_sequence=True \
model.micro_batch_size=1 \
model.global_batch_size=8

Tuning with FP8: Enable training with FP8 by adjusting configs:

Copy
Copied!

            
            ++model.fp8=True

Step 3: Run evaluation

Run evaluation using megatron_gpt_generate.py

Set the appropriate model checkpoint path, test file path, batch sizes, number of tokens etc. and run evaluation on the test file

Copy
Copied!

            
            PATH_TO_TRAINED_MODEL=/results/megatron_gpt_peft_none_tuning/checkpoints/megatron_gpt_peft_none_tuning.nemo
TEST_DS="[YOUR PATH TO test.jsonl]"
python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py \
    model.restore_from_path=${PATH_TO_TRAINED_MODEL} \
    trainer.devices=8 \
    model.data.test_ds.file_names=${TEST_DS} \
    model.data.test_ds.names=['python_code_instructions_18k_alpaca_test'] \
    model.data.test_ds.global_batch_size=16 \
    model.data.test_ds.micro_batch_size=2 \
    model.data.test_ds.tokens_to_generate=20 \
    model.tensor_model_parallel_size=1 \
    model.pipeline_model_parallel_size=1 \
    inference.greedy=True \
    model.data.test_ds.output_file_path_prefix=/results/sft_results \
    model.data.test_ds.write_predictions_to_file=True

Sample Output

Copy
Copied!

            
            $ tail -n 2 sft_results.jsonl

{"input": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nCreate a python 3 script to generate a list of integers from 0 to 100.\n\n### Input:\n\n\n", "pred": " def generate_list():\n    return list(range(101))\n\nprint(generate", "label": " list_of_integers = [x for x in range(0, 101)]"}
{"input": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nWrite a Python program to compute the sum of items in a given list and multiply it by a given number.\n\n### Input:\n{'list': [1, 3, 5, 7], 'num': 3}\n\n", "pred": " def sum_list_multiply(list, num):\n    return sum(list) * num\n", "label": " #initialize variables\nlist = [1, 3, 5, 7]\nnum = 3\n\n# compute sum\nsum = 0\nfor i in list:\n    sum = sum + i\n\n# compute product\nresult = sum * num\n\n# Print result\nprint(\"Result: \", result)"}

Note, This is only a sample output (based of a toy SFT example) and your output may vary. The performance can be further improved by fine tuning the model for more steps.

Run SFT with NeMo Launcher

Please refer to Launcher Guide section to understand the NeMo Launcher basics. To run SFT update conf/config.yaml:

Copy
Copied!

            
            defaults:
  - peft: starcoder2/sft

stages:
  - peft

Execute launcher pipeline: python3 main.py

Configuration

Default configurations for PEFT can be found in conf/peft/starcoder2/sft.yaml. Fine-tuning configuration is divided into four sections run, trainer, exp_manger and model.

Copy
Copied!

            
            run:
  name: sft_starcoder2
  time_limit: "04:00:00"
  dependency: "singleton"
  model_train_name: starcoder2
  convert_dir: ${base_results_dir}/${peft.run.model_train_name}/${peft.run.convert_name}
  task_name: "sft"
  results_dir: ${base_results_dir}/${.model_train_name}/peft_${.task_name}

Set the number of nodes and devices for fine-tuning:

Copy
Copied!

            
            trainer:
  num_nodes: 1
  devices: 8

Copy
Copied!

            
            model:
  restore_from_path: /path/to/starcoder2.nemo

restore_from_path sets the path to the .nemo checkpoint to run fine-tuning.