Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Optimizing Models with Knowledge Distillation

Knowledge Distillation (KD) involves using information from an existing trained model to train a second (usually smaller, faster) model, thereby “distilling” knowledge from one to the other. The effectiveness of this approach has been showcased with NVIDIA Minitron models [Compact Language Models via Pruning and Knowledge Distillation][LLM Pruning and Distillation in Practice: The Minitron Approach]. Combined with pruning, new state of the art language models can be trained with up to 40x fewer tokens.

Distillation has two primary benefits: faster convergence and higher end accuracy than traditional training.

In NeMo, distillation is enabled by the NVIDIA TensorRT Model Optimizer (ModelOpt) library — a library to optimize deep-learning models for inference on GPUs.

The logits-distillation process consists of the following steps:

  1. Load both student and teacher model checkpoints (must support same parallelism strategy, if any).

  2. Train until convergence is achieved, where forward passes are run on both models (and backward only on student), performing a specific loss function between the logits.

  3. Save the final student model.

Example

The example below shows how to run the distillation script with Llama 3.1 models.

The script must be launched correctly with the number of processes equal to tensor parallelism. This is achieved with the torchrun command below:

STUDENT_CKPT="path/to/student.nemo"  # can also be None (will use default architecture found in examples/nlp/language_modeling/conf/megatron_llama_distill.yaml)
TEACHER_CKPT="path/to/teacher.nemo"
TOKENIZER="path/to/tokenizer.model"
DATA_PATHS="[1.0,path/to/tokenized/data]"
FINAL_SAVE_FILE="final_checkpoint.nemo"
TP=4

NPROC=$TP
launch_config="torchrun --nproc_per_node=$NPROC"

${launch_config} /NeMo/examples/nlp/language_modeling/megatron_gpt_distillation.py \
    model.restore_from_path=$STUDENT_CKPT \
    model.kd_teacher_restore_from_path=$TEACHER_CKPT \
    model.tensor_model_parallel_size=$TP \
    model.tokenizer.model=$TOKENIZER \
    model.data.data_prefix=$DATA_PATHS \
    model.nemo_path=$FINAL_SAVE_FILE \
    trainer.precision=bf16 \
    trainer.devices=$NPROC

For large models, the command can be used in multi-node setting. For example, this can be done with NeMo Framework Launcher using Slurm.

Limitations

  • Only Megatron Core-based GPT models are supported. Hugging Face models can be converted to NeMo using checkpoint converters, distilled, and then converted back to Hugging Face format.

  • Only logit-pair distillation is supported for now.

  • Pipeline parallelism is not yet supported.

  • Fully Sharded Data Parallel (FSDP) strategy is not yet supported.