Important
You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.
Quantization#
NeMo offers Post-Training Quantization (PTQ) to postprocess a FP16/BF16 model to a lower precision format for efficient deployment. The following sections detail how to use it.
Post-Training Quantization#
PTQ enables deploying a model in a low-precision format – FP8, INT4, or INT8 – for efficient serving. Different quantization methods are available including FP8 quantization, INT8 SmoothQuant, and INT4 AWQ.
Model quantization has three primary benefits: reduced model memory requirements, lower memory bandwidth pressure, and increased inference throughput.
In NeMo, quantization is enabled by the NVIDIA TensorRT Model Optimizer (ModelOpt) – a library to quantize and compress deep learning models for optimized inference on GPUs.
The quantization process consists of the following steps:
Load a model checkpoint using an appropriate parallelism strategy.
Calibrate the model to obtain scaling factors for lower-precision GEMMs.
Produce a TensorRT-LLM checkpoint with model config (json) and quantized weights (safetensors). Additionally, the necessary context to set up the model tokenizer is saved.
Loading models requires using a custom ModelOpt spec defined in the nemo.collections.nlp.models.language_modeling.megatron.gpt_layer_modelopt_spec module. Typically, the calibration step is lightweight and uses a small dataset to obtain appropriate statistics for scaling tensors. The output directory produced is ready to be used to build a serving engine with the NVIDIA TensorRT-LLM library (see Deploy NeMo Models by Exporting TensorRT-LLM). We refer to this checkpoint as the qnemo checkpoint henceforth.
The quantization algorithm can also be conveniently set to "no_quant"
to perform only the weights export step using the default precision for TensorRT-LLM deployment. This is useful for obtaining baseline performance and accuracy results for comparison.
Support Matrix#
The table below presents a verified model support matrix for popular LLM architectures. Support for other model families is experimental.
Model Name |
Model Parameters |
Decoder Type |
FP8 |
INT8 SQ |
INT4 AWQ |
---|---|---|---|---|---|
GPT |
2B, 8B, 43B |
gptnext |
✓ |
✓ |
✓ |
Nemotron-3 |
8B, 22B |
gptnext |
✓ |
✓ |
✓ |
Nemotron-4 |
15B, 340B |
gptnext |
✓ |
✓ |
✓ |
Llama 2 |
7B, 13B, 70B |
llama |
✓ |
✓ |
✓ |
Llama 3 |
8B, 70B |
llama |
✓ |
✓ |
✓ |
Llama 3.1 |
8B, 70B, 405B |
llama |
✓ |
✓ |
✓ |
Falcon |
7B, 40B |
falcon |
✗ |
✗ |
✗ |
Gemma 1 |
2B, 7B |
gemma |
✓ |
✓ |
✓ |
StarCoder 1 |
15B |
gpt2 |
✓ |
✓ |
✓ |
StarCoder 2 |
3B, 7B, 15B |
gptnext |
✓ |
✓ |
✓ |
Mistral |
7B |
llama |
✓ |
✓ |
✓ |
Mixtral |
8x7B |
llama |
✓ |
✗ |
✗ |
When running PTQ, the decoder type for exporting TensorRT-LLM checkpoint is detected automatically based on the model used. If necessary, it can be overriden using decoder_type
parameter.
Example#
The example below shows how to quantize the Llama 3 70b model into FP8 precision, using tensor parallelism of 8 on a single DGX H100 node. The quantized model is designed for serving using 2 H100 GPUs specified with the export.inference_tp
parameter.
The quantization workflow can be launched with NeMo CLI or using a PTQ script with torchrun
or Slurm. This is shown below.
Use the NeMo CLI#
The command below can be launched inside a NeMo container (only single-node use cases are supported):
CALIB_TP=8
INFER_TP=2
nemo llm ptq \
nemo_checkpoint=/opt/checkpoints/llama3-70b-base \
calibration_tp=$CALIB_TP \
quantization_config.algorithm=fp8 \
export_config.inference_tp=$INFER_TP \
export_config.path=/opt/checkpoints/llama3-70b-base-fp8-qnemo \
run.executor=torchrun \
run.executor.ntasks_per_node=$CALIB_TP
Use the PTQ script with torchrun
or Slurm#
Alternatively, the torchrun
command and scripts/llm/ptq.py can be used directly. The script must be launched correctly with the number of processes equal to tensor parallelism:
CALIB_TP=8
CALIB_PP=1
INFER_TP=2
torchrun --nproc_per_node $CALIB_TP /opt/NeMo/scripts/llm/ptq.py \
--nemo_checkpoint=/opt/checkpoints/llama3-70b-base \
--calibration_tp=$CALIB_TP \
--calibration_pp=$CALIB_PP \
--algorithm=fp8 \
--inference_tp=$INFER_TP \
--export_path=/opt/checkpoints/llama3-70b-base-fp8-qnemo
For large models, this script can be launched on Slurm for multi-node use cases by setting the --calibration_tp
and --calibration_pp
along with the corresponding Slurm --ntasks-per-node
and --nodes
parameters, respectively:
CALIB_TP=8
CALIB_PP=2
INFER_TP=8
srun --nodes $CALIB_PP --ntasks-per-node $CALIB_TP ... \
python /opt/NeMo/scripts/llm/ptq.py \
--nemo_checkpoint=/opt/checkpoints/nemotron4-340b-base \
--calibration_tp=$CALIB_TP \
--calibration_pp=$CALIB_PP \
...
For the Llama 3 70b example, the output directory has the following structure:
llama3-70b-base-fp8-qnemo/
├── config.json
├── nemo_context/
├── rank0.safetensors
└── rank1.safetensors
The next step is to build a TensorRT-LLM engine for the checkpoint produced. This can be conveniently achieved and run using the TensorRTLLM
class available in the nemo.export
module. See Deploy NeMo Models by Exporting TensorRT-LLM for details. Alternatively, you can use the TensorRT-LLM trtllm-build command directly.
References#
Please refer to the following papers for more details on quantization techniques: