Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Optimizing Models with Pruning

This section explains how to prune Large Language Models (LLMs) based on the approach described in Minitron [Compact Language Models via Pruning and Knowledge Distillation][LLM Pruning and Distillation in Practice: The Minitron Approach]. These steps involve using various scripts to prune the model and validate the changes, ensuring the pruned model maintains the expected accuracy.

Drop Model Layers (Depth Pruning)

To trim the number of model layers by manually specifying the layers to be dropped, use the following script:

python /NeMo/examples/nlp/language_modeling/megatron_gpt_prune.py \
  model.restore_from_path=/path/to/model.nemo \
  model.tensor_model_parallel_size=<tensor_model_parallel_size> \
  model.pipeline_model_parallel_size=<pipeline_model_parallel_size> \
  +model.dist_ckpt_load_strictness=log_all \
  trainer.num_nodes=1 \
  trainer.precision=bf16 \
  trainer.devices=<tensor_model_parallel_size> * <pipeline_model_parallel_size> \
  'prune.drop_layers=[1, 2, 3, 4]' \
  export.save_path=/path/to/save/trimmed_model.nemo

Note

The layer indices start from 1.

To trim the number of model layers based on the [Block Influence] metric (cosine similarity), use the following script:

python /NeMo/examples/nlp/language_modeling/megatron_gpt_prune.py \
  model.restore_from_path=/path/to/model.nemo \
  model.tensor_model_parallel_size=1 \
  model.pipeline_model_parallel_size=<pipeline_model_parallel_size> \
  +model.dist_ckpt_load_strictness=log_all \
  inference.batch_size=1 \
  trainer.num_nodes=1 \
  trainer.precision=bf16 \
  trainer.devices=<pipeline_model_parallel_size> \
  prune.num_layers=<pruned_value> \
  export.save_path=/path/to/save/trimmed_model.nemo

Drop Model Width (Width Pruning)

To trim model width (attention heads, hidden size, FFN size) according to activation-based importance scores, use the following script:

python /NeMo/examples/nlp/language_modeling/megatron_gpt_prune.py \
  model.restore_from_path=/path/to/model.nemo \
  model.tensor_model_parallel_size=1 \
  model.pipeline_model_parallel_size=<pipeline_model_parallel_size> \
  +model.dist_ckpt_load_strictness=log_all \
  inference.batch_size=1 \
  trainer.num_nodes=1 \
  trainer.precision=bf16 \
  trainer.devices=<pipeline_model_parallel_size> \
  prune.ffn_hidden_size=<pruned_value_or_null> \
  prune.num_attention_heads=<pruned_value_or_null> \
  prune.num_query_groups=<pruned_value_or_null> \
  prune.hidden_size=<pruned_value_or_null> \
  export.save_path=/path/to/save/trimmed_model.nemo

Note

Due to a limitation in the width pruning API, both model.tensor_model_parallel_size and inference.batch_size must be currently set to 1.

Drop Model Depth and Width

To trim model depth (layers) based on cosine similarity and width (Attention heads, hidden size, FFN size) according to activation-based importance scores, use the following script:

python /NeMo/examples/nlp/language_modeling/megatron_gpt_prune.py \
  model.restore_from_path=/path/to/model.nemo \
  model.tensor_model_parallel_size=1 \
  model.pipeline_model_parallel_size=<pipeline_model_parallel_size> \
  +model.dist_ckpt_load_strictness=log_all \
  inference.batch_size=1 \
  trainer.num_nodes=1 \
  trainer.precision=bf16 \
  trainer.devices=<pipeline_model_parallel_size> \
  prune.ffn_hidden_size=<pruned_value_or_null> \
  prune.num_attention_heads=<pruned_value_or_null> \
  prune.num_query_groups=<pruned_value_or_null> \
  prune.hidden_size=<pruned_value_or_null> \
  prune.num_layers=<pruned_value_or_null> \
  export.save_path=/path/to/save/trimmed_model.nemo

Note

Due to a limitation in the width pruning API, both model.tensor_model_parallel_size and inference.batch_size must be currently set to 1.

Validate Trimmed Model

To compute validation loss on the trimmed model, use the following script:

python /NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py \
  --config-path=/path/to/folder/with/model/config \
  --config-name=model_config.yaml \
  trainer.limit_val_batches=<limit_val_batches> \
  model.restore_from_path=/path/to/trimmed_model.nemo \
  model.skip_train=True \
  model.data.data_impl=mock \
  model.data.data_prefix=[]

To use a specific dataset instead of a mock dataset, modify the model.data parameters as follows:

model.data.data_impl=mmap \
model.data.data_prefix=["path/to/datafile1", "path/to/datafile2"]

Validate Depth Pruning of Original Model on the Fly

To compute validation loss on the original model without specific layers, use the following script:

python /NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py \
  --config-path=/path/to/folder/with/model/config \
  --config-name=model_config.yaml \
  trainer.limit_val_batches=<limit_val_batches> \
  model.restore_from_path=/path/to/original_model.nemo \
  model.skip_train=True \
  model.data.data_impl=mock \
  model.data.data_prefix=[] \
  model.drop_layers=[1,2,3,4]