Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Optimizing Models with Pruning#
This section explains how to prune Large Language Models (LLMs) based on the approach described in Minitron [Compact Language Models via Pruning and Knowledge Distillation][LLM Pruning and Distillation in Practice: The Minitron Approach]. These steps involve using various scripts to prune the model and validate the changes, ensuring the pruned model maintains the expected accuracy.
Drop Model Layers (Depth Pruning)#
To trim the number of model layers by manually specifying the layers to be dropped, use the following script:
python /NeMo/examples/nlp/language_modeling/megatron_gpt_prune.py \
model.restore_from_path=/path/to/model.nemo \
model.tensor_model_parallel_size=<tensor_model_parallel_size> \
model.pipeline_model_parallel_size=<pipeline_model_parallel_size> \
+model.dist_ckpt_load_strictness=log_all \
trainer.num_nodes=1 \
trainer.precision=bf16 \
trainer.devices=<tensor_model_parallel_size> * <pipeline_model_parallel_size> \
'prune.drop_layers=[1, 2, 3, 4]' \
export.save_path=/path/to/save/trimmed_model.nemo
Note
The layer indices start from 1.
To trim the number of model layers based on the [Block Influence] metric (cosine similarity), use the following script:
python /NeMo/examples/nlp/language_modeling/megatron_gpt_prune.py \
model.restore_from_path=/path/to/model.nemo \
model.tensor_model_parallel_size=1 \
model.pipeline_model_parallel_size=<pipeline_model_parallel_size> \
+model.dist_ckpt_load_strictness=log_all \
inference.batch_size=1 \
trainer.num_nodes=1 \
trainer.precision=bf16 \
trainer.devices=<pipeline_model_parallel_size> \
prune.num_layers=<pruned_value> \
export.save_path=/path/to/save/trimmed_model.nemo
Drop Model Width (Width Pruning)#
To trim model width (attention heads, hidden size, FFN size) according to activation-based importance scores, use the following script:
python /NeMo/examples/nlp/language_modeling/megatron_gpt_prune.py \
model.restore_from_path=/path/to/model.nemo \
model.tensor_model_parallel_size=1 \
model.pipeline_model_parallel_size=<pipeline_model_parallel_size> \
+model.dist_ckpt_load_strictness=log_all \
inference.batch_size=1 \
trainer.num_nodes=1 \
trainer.precision=bf16 \
trainer.devices=<pipeline_model_parallel_size> \
prune.ffn_hidden_size=<pruned_value_or_null> \
prune.num_attention_heads=<pruned_value_or_null> \
prune.num_query_groups=<pruned_value_or_null> \
prune.hidden_size=<pruned_value_or_null> \
export.save_path=/path/to/save/trimmed_model.nemo
Note
Due to a limitation in the width pruning API, both model.tensor_model_parallel_size and inference.batch_size must be currently set to 1.
Drop Model Depth and Width#
To trim model depth (layers) based on cosine similarity and width (Attention heads, hidden size, FFN size) according to activation-based importance scores, use the following script:
python /NeMo/examples/nlp/language_modeling/megatron_gpt_prune.py \
model.restore_from_path=/path/to/model.nemo \
model.tensor_model_parallel_size=1 \
model.pipeline_model_parallel_size=<pipeline_model_parallel_size> \
+model.dist_ckpt_load_strictness=log_all \
inference.batch_size=1 \
trainer.num_nodes=1 \
trainer.precision=bf16 \
trainer.devices=<pipeline_model_parallel_size> \
prune.ffn_hidden_size=<pruned_value_or_null> \
prune.num_attention_heads=<pruned_value_or_null> \
prune.num_query_groups=<pruned_value_or_null> \
prune.hidden_size=<pruned_value_or_null> \
prune.num_layers=<pruned_value_or_null> \
export.save_path=/path/to/save/trimmed_model.nemo
Note
Due to a limitation in the width pruning API, both model.tensor_model_parallel_size and inference.batch_size must be currently set to 1.
Validate Trimmed Model#
To compute validation loss on the trimmed model, use the following script:
python /NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py \
--config-path=/path/to/folder/with/model/config \
--config-name=model_config.yaml \
trainer.limit_val_batches=<limit_val_batches> \
model.restore_from_path=/path/to/trimmed_model.nemo \
model.skip_train=True \
model.data.data_impl=mock \
model.data.data_prefix=[]
To use a specific dataset instead of a mock dataset, modify the model.data parameters as follows:
model.data.data_impl=mmap \
model.data.data_prefix=["path/to/datafile1", "path/to/datafile2"]
Validate Depth Pruning of Original Model on the Fly#
To compute validation loss on the original model without specific layers, use the following script:
python /NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py \
--config-path=/path/to/folder/with/model/config \
--config-name=model_config.yaml \
trainer.limit_val_batches=<limit_val_batches> \
model.restore_from_path=/path/to/original_model.nemo \
model.skip_train=True \
model.data.data_impl=mock \
model.data.data_prefix=[] \
model.drop_layers=[1,2,3,4]