Is this page helpful?

Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

Pruning#

This section explains how to prune GPT-based Large Language Models (LLMs) like Llama3.1 8B or Mistral NeMo 12B using the approach described in Minitron [Compact Language Models via Pruning and Knowledge Distillation][LLM Pruning and Distillation in Practice: The Minitron Approach]. Check out the blog post for more details on the approach and the results.

NeMo offers pruning across various dimensions of the model, such as depth (layers) and width (embedding hidden size, FFN hidden size, attention heads, attention query groups) Pruning can be used to train a smaller draft model for speculative decoding or to accelerate an existing model by removing less important parameters. These pruning features are powered by the NVIDIA TensorRT Model Optimizer library.

Drop Model Layers (Depth Pruning)#

To trim the number of model layers based on the [Block Influence] metric (cosine similarity), use the following script:

torchrun --nproc_per_node 8 scripts/llm/gpt_prune.py \
    --devices 8 \
    --tp_size 1 \
    --pp_size 8 \
    --restore_path <path/to/llama3.1-8b-nemo2> \
    --seq_length 8192 \
    --data_paths 1.0 path/to/tokenized/data \
    --index_mapping_dir path/to/index_mapping_dir \
    --prune_num_layers 16 \
    --save_path llama3.1-8b-depth-pruned

Note

--tp_size must be 1 because of the current prune API limitation.

To trim model layers by manually specifying the layers to be dropped, use the following script:

torchrun --nproc_per_node 8 scripts/llm/gpt_prune.py \
    --devices 8 \
    --tp_size 8 \
    --pp_size 1 \
    --restore_path <path/to/llama3.1-8b-nemo2> \
    --drop_layers 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 \
    --save_path llama3.1-8b-dropped-layers

Note

The layer indices start from 1.

Drop Model Width (Width Pruning)#

To trim model width (embedding hidden size, FFN hidden size, attention heads, attention query groups) according to activation-based importance scores, use the following script:

torchrun --nproc_per_node 8 scripts/llm/gpt_prune.py \
    --devices 8 \
    --tp_size 1 \
    --pp_size 8 \
    --restore_path <path/to/llama3.1-8b-nemo2> \
    --seq_length 8192 \
    --data_paths 1.0 path/to/tokenized/data \
    --index_mapping_dir path/to/index_mapping_dir \
    --prune_ffn_hidden_size 9216 \
    --prune_hidden_size 3072 \
    --prune_num_attention_heads 32 \
    --prune_num_query_groups 8 \
    --save_path llama3.1-8b-width-pruned

Note

--tp_size must be 1 because of the current prune API limitation.

Drop Model Depth and Width#

To trim model depth (layers) based on cosine similarity and width (embedding hidden size, FFN hidden size, attention heads, attention query groups) according to activation-based importance scores, use the following script:

torchrun --nproc_per_node 8 scripts/llm/gpt_prune.py \
    --devices 8 \
    --tp_size 1 \
    --pp_size 8 \
    --restore_path <path/to/llama3.1-8b-nemo2> \
    --seq_length 8192 \
    --data_paths 30 path/to/dataset_1_prefix 70 path/to/dataset_2_prefix \
    --index_mapping_dir path/to/index_mapping_dir \
    --prune_ffn_hidden_size 9216 \
    --prune_hidden_size 3072 \
    --prune_num_attention_heads 32 \
    --prune_num_query_groups 8 \
    --prune_num_layers 16 \
    --save_path llama3.1-8b-width-depth-pruned

Note

--tp_size must be 1 because of the current prune API limitation.

Tips for Pruning#

Pruning should always be followed by knowledge distillation to ensure the pruned model can still match the performance of the full model. Please refer to the :ref:distillation section for more details.
If you want to use fewer than 8 GPUs, such as 2, you need to set --nproc_per_node 2, --devices 2 and --pp_size 2.
If you do not pass --data_paths and --index_mapping_dir, the script will use mock data for calibration. This will result in a randomly pruned model, which is useful for testing the pruning pipeline.
By default, pruning will run forward passes on 1024 samples to calibrate the importance scores. If you want to speed up the calibration process, you can pass --num_train_samples XYZ to use fewer samples, which may lead to a slightly worse pruned model.
If you have enough memory per GPU or number of GPUs, you can increase the micro batch size or pipeline parallel size (defaults to 1) via --mbs XYZ and --pp_size XYZ.
Pruning Llama3.1 8B with 1x 80GB H100 GPU uses ~32GB of GPU memory and takes ~30 minutes for 1024 samples.

Troubleshooting#

If you encounter such an error:

RuntimeError: Error(s) in loading state_dict for GPTModel:
    Missing key(s) in state_dict: "decoder.final_layernorm._extra_state"

This is likely due to loading a checkpoint from a previous version of TransformerEngine and can be safely ignored. Please add the flag --legacy_ckpt to suppress the error.