Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Optimizing Models with Pruning

This section explains how to prune Large Language Models (LLMs) based on the approach described in Minitron [Compact Language Models via Pruning and Knowledge Distillation][LLM Pruning and Distillation in Practice: The Minitron Approach]. These steps involve using various scripts to prune the model and validate the changes, ensuring the pruned model maintains the expected accuracy.

Drop Model Layers (Depth Pruning)

To trim the number of model layers by manually specifying the layers to be dropped, use the following script:

python /NeMo/examples/nlp/language_modeling/megatron_gpt_prune.py \
  model.restore_from_path=/path/to/model.nemo \
  model.tensor_model_parallel_size=<tensor_model_parallel_size> \
  model.pipeline_model_parallel_size=<pipeline_model_parallel_size> \
  +model.dist_ckpt_load_strictness=log_all \
  trainer.num_nodes=1 \
  trainer.precision=bf16 \
  trainer.devices=<tensor_model_parallel_size> * <pipeline_model_parallel_size> \
  'prune.drop_layers=[1, 2, 3, 4]' \
  export.save_path=/path/to/save/trimmed_model.nemo

Note

The layer indices start from 1.

To trim the number of model layers based on the [Block Influence] metric (cosine similarity), use the following script:

python /NeMo/examples/nlp/language_modeling/megatron_gpt_prune.py \
  model.restore_from_path=/path/to/model.nemo \
  model.tensor_model_parallel_size=1 \
  model.pipeline_model_parallel_size=<pipeline_model_parallel_size> \
  +model.dist_ckpt_load_strictness=log_all \
  inference.batch_size=1 \
  trainer.num_nodes=1 \
  trainer.precision=bf16 \
  trainer.devices=<pipeline_model_parallel_size> \
  prune.num_layers=<pruned_value> \
  export.save_path=/path/to/save/trimmed_model.nemo

Drop Model Width (Width Pruning)

To trim model width (attention heads, hidden size, FFN size) according to activation-based importance scores, use the following script:

python /NeMo/examples/nlp/language_modeling/megatron_gpt_prune.py \
  model.restore_from_path=/path/to/model.nemo \
  model.tensor_model_parallel_size=1 \
  model.pipeline_model_parallel_size=<pipeline_model_parallel_size> \
  +model.dist_ckpt_load_strictness=log_all \
  inference.batch_size=1 \
  trainer.num_nodes=1 \
  trainer.precision=bf16 \
  trainer.devices=<pipeline_model_parallel_size> \
  prune.ffn_hidden_size=<pruned_value_or_null> \
  prune.num_attention_heads=<pruned_value_or_null> \
  prune.num_query_groups=<pruned_value_or_null> \
  prune.hidden_size=<pruned_value_or_null> \
  export.save_path=/path/to/save/trimmed_model.nemo

Note

Due to a limitation in the width pruning API, both model.tensor_model_parallel_size and inference.batch_size must be currently set to 1.

Drop Model Depth and Width

To trim model depth (layers) based on cosine similarity and width (Attention heads, hidden size, FFN size) according to activation-based importance scores, use the following script:

python /NeMo/examples/nlp/language_modeling/megatron_gpt_prune.py \
  model.restore_from_path=/path/to/model.nemo \
  model.tensor_model_parallel_size=1 \
  model.pipeline_model_parallel_size=<pipeline_model_parallel_size> \
  +model.dist_ckpt_load_strictness=log_all \
  inference.batch_size=1 \
  trainer.num_nodes=1 \
  trainer.precision=bf16 \
  trainer.devices=<pipeline_model_parallel_size> \
  prune.ffn_hidden_size=<pruned_value_or_null> \
  prune.num_attention_heads=<pruned_value_or_null> \
  prune.num_query_groups=<pruned_value_or_null> \
  prune.hidden_size=<pruned_value_or_null> \
  prune.num_layers=<pruned_value_or_null> \
  export.save_path=/path/to/save/trimmed_model.nemo

Note

Due to a limitation in the width pruning API, both model.tensor_model_parallel_size and inference.batch_size must be currently set to 1.

Validate Trimmed Model

To compute validation loss on the trimmed model, use the following script:

python /NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py \
  --config-path=/path/to/folder/with/model/config \
  --config-name=model_config.yaml \
  trainer.limit_val_batches=<limit_val_batches> \
  model.restore_from_path=/path/to/trimmed_model.nemo \
  model.skip_train=True \
  model.data.data_impl=mock \
  model.data.data_prefix=[]

To use a specific dataset instead of a mock dataset, modify the model.data parameters as follows:

model.data.data_impl=mmap \
model.data.data_prefix=["path/to/datafile1", "path/to/datafile2"]

Validate Depth Pruning of Original Model on the Fly

To compute validation loss on the original model without specific layers, use the following script:

python /NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py \
  --config-path=/path/to/folder/with/model/config \
  --config-name=model_config.yaml \
  trainer.limit_val_batches=<limit_val_batches> \
  model.restore_from_path=/path/to/original_model.nemo \
  model.skip_train=True \
  model.data.data_impl=mock \
  model.data.data_prefix=[] \
  model.drop_layers=[1,2,3,4]