Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Supervised Fine-tuning (SFT)

Please prepare the datasets according to Data Preparation for SFT and PEFT section before proceeding.

Run SFT inside NeMo Container

Run SFT

Set the environment variables, pass the paths to your train, test, and validation data files

MODEL="YOUR PATH TO griffin-2b.nemo"
TRAIN="[YOUR PATH TO squad/train.jsonl]"
VALID="[YOUR PATH TO squad/validation.jsonl]"
TEST="[YOUR PATH TO squad/test.jsonl]"

Set the concat sampling probability. This depends on the number of files being passed in the train set and how much percentage of the fine tuning data would you like to use from each file. Note sum of concat sampling probabilities should be 1.0. For example, the following is an example for setting concat sampling probability for a train set with 2 jsonl files.

TRAIN="[/path/to/dataset_1.jsonl,/path/to/dataset_2.jsonl]"
CONCAT_SAMPLING_PROBS="[0.3,0.7]"

In our example we are using 1 train file so CONCAT_SAMPLING_PROBS="[1.0]"

Run the command by appropriately setting the values for the parameters such as the number of steps, model checkpoint path, batch sizes etc. For a full reference of parameter settings refer to the config file

torchrun --nproc_per_node=8 \
/opt/NeMo/examples/nlp/language_modeling/megatron_griffin_finetuning.py \
   trainer.precision=bf16 \
   trainer.devices=8 \
   trainer.precision=bf16 \
   trainer.accelerator=gpu \
   trainer.log_every_n_steps=1 \
   trainer.val_check_interval=50 \
   trainer.limit_val_batches=128 \
   +trainer.num_sanity_val_steps=0 \
   +trainer.accumulate_grad_batches=1 \
   trainer.max_steps=600 \
   trainer.gradient_clip_val=1.0 \
   model.peft.peft_scheme=null \
   model.megatron_amp_O2=True \
   model.encoder_seq_length=2048 \
   model.data.validation_ds.pad_to_max_length=True \
   model.data.train_ds.pad_to_max_length=True \
   model.optim.name="distributed_fused_adam" \
   +model.gradient_accumulation_fusion=True \
   +model.optim.bucket_cap_mb=400 \
   +model.optim.overlap_grad_sync=True \
   +model.optim.overlap_param_sync=True \
   +model.optim.contiguous_grad_buffer=True \
   +model.optim.contiguous_param_buffer=True \
   model.activations_checkpoint_recurrent='recurrent' \
   model.data.train_ds.max_seq_length=2048 \
   model.data.validation_ds.max_seq_length=2048 \
   model.mcore_gpt=True \
   model.micro_batch_size=4 \
   model.global_batch_size=128 \
   model.restore_from_path=${MODEL} \
   model.data.train_ds.file_names=${TRAIN} \
   model.data.validation_ds.file_names=${VALID} \
   model.data.test_ds.file_names=${TEST} \
   model.optim.lr=5e-6

Run evaluation

Run evaluation using megatron_griffin_generate.py

Set the appropriate model checkpoint path, test file path, batch sizes, number of tokens etc. and run evaluation on the test file

PATH_TO_TRAINED_MODEL="PATH TO THE TRAINED MODEL"
TEST_DS="[YOUR PATH TO test.jsonl]"
SAVE_DIR="PATH TO SAVING DIRECTORY"

 torchrun --nproc_per_node=8 /opt/NeMo/examples/nlp/language_modeling/megatron_griffin_generate.py \
      trainer.devices=8 \
      trainer.precision=bf16 \
      trainer.accelerator=gpu \
      trainer.log_every_n_steps=1 \
      trainer.val_check_interval=10 \
      trainer.limit_val_batches=20 \
      trainer.max_steps=1000 \
      trainer.gradient_clip_val=1.0 \
      exp_manager.exp_dir=${SAVE_DIR} \
      model.megatron_amp_O2=True \
      model.micro_batch_size=16 \
      model.global_batch_size=128 \
      model.restore_from_path=${PATH_TO_TRAINED_MODEL} \
      model.peft.restore_from_path=False \
      +model.peft.restore_from_ckpt.checkpoint_dir=False \
      +model.peft.restore_from_ckpt.checkpoint_name=False \
      model.data.test_ds.file_names=${TEST_DS} \
      model.data.test_ds.names=["DATASET_NAME"] \
      model.data.test_ds.global_batch_size=128 \
      model.data.test_ds.micro_batch_size=16 \
      model.data.test_ds.tokens_to_generate=30 \
      model.answer_only_loss=True \
      inference.greedy=True \
      exp_manager.checkpoint_callback_params.monitor=validation_loss \
      ++inference.verbose=True \
      model.data.test_ds.write_predictions_to_file=True \
      model.data.test_ds.output_file_path_prefix=${SAVE_DIR}/eval

Run SFT with NeMo Launcher

Please refer to Launcher Guide section to understand the NeMo Launcher basics. To run SFT update conf/config.yaml:

defaults:
  - peft: griffin/sft

stages:
  - peft

Execute the launcher pipeline: python3 main.py.

Configuration

Default configurations for PEFT with squad can be found in conf/peft/griffin/sft.yaml. Fine-tuning configuration is divided into four sections run, trainer, exp_manger and model.

run:
  name: griffin_2b
  time_limit: "04:00:00"
  dependency: "singleton"
  results_dir: ${base_results_dir}/sft_${.name}

Set the number of devices for fine-tuning:

trainer:
  num_nodes: 1
  devices: 8

model:
  restore_from_path: /path/to/griffin-2b.nemo

restore_from_path sets the path to the .nemo checkpoint to run fine-tuning.