Quantization#
NeMo offers Post-Training Quantization (PTQ) to postprocess a FP16/BF16 model to a lower precision format for efficient deployment. The following sections detail how to use it.
Post-Training Quantization#
PTQ enables deploying a model in a low-precision format – FP8, INT4, or INT8 – for efficient serving. Different quantization methods are available including FP8 quantization, INT8 SmoothQuant, and INT4 AWQ.
Model quantization has three primary benefits: reduced model memory requirements, lower memory bandwidth pressure, and increased inference throughput.
In NeMo, quantization is enabled by the NVIDIA TensorRT Model Optimizer (ModelOpt) – a library to quantize and compress deep learning models for optimized inference on GPUs.
The quantization process consists of the following steps:
Load a model checkpoint using an appropriate parallelism strategy.
Calibrate the model to obtain scaling factors for lower-precision GEMMs.
Produce a TensorRT-LLM checkpoint with model config (json) and quantized weights (safetensors). Additionally, the necessary context to set up the model tokenizer is saved.
Loading models requires using a custom ModelOpt spec defined in the nemo.collections.nlp.models.language_modeling.megatron.gpt_layer_modelopt_spec module. Typically, the calibration step is lightweight and uses a small dataset to obtain appropriate statistics for scaling tensors. The output directory produced is ready to be used to build a serving engine with the NVIDIA TensorRT-LLM library (see Deploy NeMo Models by Exporting TensorRT-LLM). We refer to this checkpoint as the qnemo checkpoint henceforth.
The quantization algorithm can also be conveniently set to "no_quant"
to perform only the weights export step using the default precision for TensorRT-LLM deployment. This is useful for obtaining baseline performance and accuracy results for comparison.
Support Matrix#
The table below presents a verified model support matrix for popular LLM architectures. Support for other model families is experimental.
Model Name |
Model Parameters |
Decoder Type |
FP8 |
INT8 SQ |
INT4 AWQ |
---|---|---|---|---|---|
GPT |
2B, 8B, 43B |
gptnext |
✓ |
✓ |
✓ |
Nemotron-3 |
8B, 22B |
gptnext |
✓ |
✓ |
✓ |
Nemotron-4 |
15B, 340B |
gptnext |
✓ |
✓ |
✓ |
Llama 2 |
7B, 13B, 70B |
llama |
✓ |
✓ |
✓ |
Llama 3 |
8B, 70B |
llama |
✓ |
✓ |
✓ |
Llama 3.1 |
8B, 70B, 405B |
llama |
✓ |
✓ |
✓ |
Llama 3.2 |
1B, 3B |
llama |
✓ |
✓ |
✓ |
Falcon |
7B, 40B |
falcon |
✗ |
✗ |
✗ |
Gemma 1 |
2B, 7B |
gemma |
✓ |
✓ |
✓ |
StarCoder 1 |
15B |
gpt2 |
✓ |
✓ |
✓ |
StarCoder 2 |
3B, 7B, 15B |
gptnext |
✓ |
✓ |
✓ |
Mistral |
7B |
llama |
✓ |
✓ |
✓ |
Mixtral |
8x7B |
llama |
✓ |
✗ |
✗ |
When running PTQ, the decoder type for exporting TensorRT-LLM checkpoint is detected automatically based on the model used. If necessary, it can be overriden using decoder_type
parameter.
Example#
The example below shows how to quantize the Llama 3 70b model into FP8 precision, using tensor parallelism of 8 on a single DGX H100 node. The quantized model is designed for serving using 2 H100 GPUs specified with the export.inference_tp
parameter.
The quantization workflow can be launched with NeMo CLI or using a PTQ script with torchrun
or Slurm. This is shown below.
Use the NeMo CLI#
The command below can be launched inside a NeMo container (only single-node use cases are supported):
CALIB_TP=8
INFER_TP=2
nemo llm ptq \
nemo_checkpoint=/opt/checkpoints/llama3-70b-base \
calibration_tp=$CALIB_TP \
quantization_config.algorithm=fp8 \
export_config.inference_tp=$INFER_TP \
export_config.path=/opt/checkpoints/llama3-70b-base-fp8-qnemo \
run.executor=torchrun \
run.executor.ntasks_per_node=$CALIB_TP
Use the PTQ script with torchrun
or Slurm#
Alternatively, the torchrun
command and scripts/llm/ptq.py can be used directly. The script must be launched correctly with the number of processes equal to tensor parallelism:
CALIB_TP=8
CALIB_PP=1
INFER_TP=2
torchrun --nproc_per_node $CALIB_TP /opt/NeMo/scripts/llm/ptq.py \
--nemo_checkpoint=/opt/checkpoints/llama3-70b-base \
--calibration_tp=$CALIB_TP \
--calibration_pp=$CALIB_PP \
--algorithm=fp8 \
--inference_tp=$INFER_TP \
--export_path=/opt/checkpoints/llama3-70b-base-fp8-qnemo
For large models, this script can be launched on Slurm for multi-node use cases by setting the --calibration_tp
and --calibration_pp
along with the corresponding Slurm --ntasks-per-node
and --nodes
parameters, respectively:
CALIB_TP=8
CALIB_PP=2
INFER_TP=8
srun --nodes $CALIB_PP --ntasks-per-node $CALIB_TP ... \
python /opt/NeMo/scripts/llm/ptq.py \
--nemo_checkpoint=/opt/checkpoints/nemotron4-340b-base \
--calibration_tp=$CALIB_TP \
--calibration_pp=$CALIB_PP \
...
For the Llama 3 70b example, the output directory has the following structure:
llama3-70b-base-fp8-qnemo/
├── config.json
├── nemo_context/
├── rank0.safetensors
└── rank1.safetensors
The next step is to build a TensorRT-LLM engine for the checkpoint produced. This can be conveniently achieved and run using the TensorRTLLM
class available in the nemo.export
module. See Deploy NeMo Models by Exporting TensorRT-LLM for details. Alternatively, you can use the TensorRT-LLM trtllm-build command directly.
References#
Please refer to the following papers for more details on quantization techniques: