Pruning#

This section explains how to prune GPT-based Large Language Models (LLMs) like Llama3.1 8B or Mistral NeMo 12B using the approach described in Minitron [Compact Language Models via Pruning and Knowledge Distillation] and [LLM Pruning and Distillation in Practice: The Minitron Approach]. Check out the blog post for more details on the approach and the results.

NeMo offers pruning across various dimensions of the model, such as depth (layers) and width (embedding hidden size, FFN hidden size, attention heads, attention query groups). Pruning can be used to train a smaller draft model for speculative decoding or to accelerate an existing model by removing less important parameters. These pruning features are powered by the NVIDIA TensorRT Model Optimizer library.

Use with torchrun or Slurm#

For more fine-grained control over the pruning process, you can use the pruning script with torchrun or Slurm as follows:

Drop Model Layers (Depth Pruning)#

To trim the number of model layers based on the [Block Influence] metric (cosine similarity), use the following script:

torchrun --nproc_per_node 8 scripts/llm/gpt_prune.py \
    --devices 8 \
    --tp_size 1 \
    --pp_size 8 \
    --restore_path <path/to/llama3.1-8b-nemo2> \
    --seq_length 8192 \
    --data_paths 1.0 path/to/tokenized/data \
    --index_mapping_dir path/to/index_mapping_dir \
    --target_num_layers 16 \
    --save_path llama3.1-8b-depth-pruned

Note

--tp_size must be 1 because of the current prune API limitation. --target_ffn_hidden_size and --target_hidden_size should be multiples of 64.

To trim model layers by manually specifying the layers to be dropped, use the following script:

torchrun --nproc_per_node 8 scripts/llm/gpt_prune.py \
    --devices 8 \
    --tp_size 8 \
    --pp_size 1 \
    --restore_path <path/to/llama3.1-8b-nemo2> \
    --drop_layers 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 \
    --save_path llama3.1-8b-dropped-layers

Note

The layer indices start from 1.

Drop Model Width (Width Pruning)#

To trim model width (embedding hidden size, FFN hidden size, attention heads, attention query groups) according to activation-based importance scores, use the following script:

torchrun --nproc_per_node 8 scripts/llm/gpt_prune.py \
    --devices 8 \
    --tp_size 1 \
    --pp_size 8 \
    --restore_path <path/to/llama3.1-8b-nemo2> \
    --seq_length 8192 \
    --data_paths 1.0 path/to/tokenized/data \
    --index_mapping_dir path/to/index_mapping_dir \
    --target_ffn_hidden_size 9216 \
    --target_hidden_size 3072 \
    --target_num_attention_heads 32 \
    --target_num_query_groups 8 \
    --save_path llama3.1-8b-width-pruned

Note

--tp_size must be 1 because of the current prune API limitation. --target_ffn_hidden_size and --target_hidden_size should be multiples of 64.

Drop Model Depth and Width#

To trim model depth (layers) based on cosine similarity and width (embedding hidden size, FFN hidden size, attention heads, attention query groups), according to activation-based importance scores, use the following script:

torchrun --nproc_per_node 8 scripts/llm/gpt_prune.py \
    --devices 8 \
    --tp_size 1 \
    --pp_size 8 \
    --restore_path <path/to/llama3.1-8b-nemo2> \
    --seq_length 8192 \
    --data_paths 30 path/to/dataset_1_prefix 70 path/to/dataset_2_prefix \
    --index_mapping_dir path/to/index_mapping_dir \
    --target_ffn_hidden_size 9216 \
    --target_hidden_size 3072 \
    --target_num_attention_heads 32 \
    --target_num_query_groups 8 \
    --target_num_layers 16 \
    --save_path llama3.1-8b-width-depth-pruned

Note

--tp_size must be 1 because of the current prune API limitation. --target_ffn_hidden_size and --target_hidden_size should be multiples of 64.

Use NeMo-Run Recipes#

Note

Prerequisite: Before proceeding, please follow the example in Quickstart with NeMo-Run to familiarize yourself with NeMo-Run first.

import nemo_run as run
from nemo.collections import llm
from nemo.collections.llm.modelopt.recipes import prune_recipe

recipe = prune_recipe(
    nemo_checkpoint="/path/to/llama3.1-8b/nemo-ckpt/",
    save_path="/path/to/pruned/llama3.1-8b/nemo-ckpt/",
)

# Override the configuration with desired components:
recipe.devices = 8
recipe.pp_size = 8
recipe.data = run.Config(
    llm.PreTrainingDataModule,
    paths=["1.0", "path/to/tokenized/data"],
    seq_length=8192,
    micro_batch_size=1,
    global_batch_size=1,  # should be equal to micro_batch_size
)
recipe.pruning_config.target_ffn_hidden_size = 9216
recipe.pruning_config.target_hidden_size = 3072
...

run.run(recipe)

Tips for Pruning#

  • Pruning should always be followed by knowledge distillation to ensure the pruned model can still match the performance of the full model. Please refer to the :ref:distillation section for more details.

  • If you want to use fewer than 8 GPUs, such as 2, you need to set --nproc_per_node 2, --devices 2 and --pp_size 2.

  • If you do not pass --data_paths and --index_mapping_dir, the script will use mock data for calibration. This will result in a randomly pruned model, which is useful for testing the pruning pipeline.

  • By default, pruning will run forward passes on 1024 samples to calibrate the importance scores. If you want to speed up the calibration process, you can pass --num_train_samples XYZ to use fewer samples, which may lead to a slightly worse pruned model.

  • If you have enough memory per GPU or number of GPUs, you can increase the micro batch size or pipeline parallel size (defaults to 1) via --mbs XYZ and --pp_size XYZ.

  • Pruning Llama3.1 8B with 1x 80GB H100 GPU uses ~32GB of GPU memory and takes ~30 minutes for 1024 samples.

Troubleshooting#

If you encounter such an error:

RuntimeError: Error(s) in loading state_dict for GPTModel:
    Missing key(s) in state_dict: "decoder.final_layernorm._extra_state"

This is likely due to loading a checkpoint from a previous version of TransformerEngine and can be safely ignored. Please add the flag --legacy_ckpt to suppress the error.