Quantization#
NeMo offers Post-Training Quantization (PTQ) to postprocess a FP16/BF16 model to a lower precision format for efficient deployment. The following sections detail how to use it.
Post-Training Quantization#
PTQ enables deploying a model in a low-precision format – FP8, INT4, or INT8 – for efficient serving. Different quantization methods are available including FP8 quantization, INT8 SmoothQuant, and INT4 AWQ.
Model quantization has three primary benefits: reduced model memory requirements, lower memory bandwidth pressure, and increased inference throughput.
In NeMo, quantization is enabled by the NVIDIA TensorRT Model Optimizer (ModelOpt) – a library to quantize and compress deep learning models for optimized inference on GPUs.
The quantization process consists of the following steps:
- Load a model checkpoint using an appropriate parallelism strategy. 
- Calibrate the model to obtain scaling factors for lower-precision GEMMs. 
- Produce a TensorRT-LLM checkpoint with model config (json) and quantized weights (safetensors). Additionally, the necessary context to set up the model tokenizer is saved. 
Loading models requires using a custom ModelOpt spec defined in the megatron.core.post_training.modelopt module for both Transformer and Mamba-type models. Typically, the calibration step is lightweight and uses a small dataset to obtain appropriate statistics for scaling tensors. The output directory produced is ready to be used to build a serving engine with the NVIDIA TensorRT-LLM library (see Deploy NeMo Models by Exporting TensorRT-LLM). We refer to this checkpoint as the qnemo checkpoint henceforth.
The quantization algorithm can also be conveniently set to "no_quant" to perform only the weights export step using the default precision for TensorRT-LLM deployment. This is useful for obtaining baseline performance and accuracy results for comparison.
Support Matrix#
The table below presents a verified model support matrix for popular LLM architectures. Support for other model families is experimental.
| Model Name | Model Parameters | Decoder Type | FP8 | INT8 SQ | INT4 AWQ | 
|---|---|---|---|---|---|
| GPT | 2B, 8B, 43B | gptnext | ✓ | ✓ | ✓ | 
| Nemotron-3 | 8B, 22B | gptnext | ✓ | ✓ | ✓ | 
| Nemotron-4 | 15B, 340B | gptnext | ✓ | ✓ | ✓ | 
| Llama 2 | 7B, 13B, 70B | llama | ✓ | ✓ | ✓ | 
| Llama 3 | 8B, 70B | llama | ✓ | ✓ | ✓ | 
| Llama 3.1 | 8B, 70B, 405B | llama | ✓ | ✓ | ✓ | 
| Llama 3.2 | 1B, 3B | llama | ✓ | ✓ | ✓ | 
| Falcon | 7B, 40B | falcon | ✗ | ✗ | ✗ | 
| Gemma 1 | 2B, 7B | gemma | ✓ | ✓ | ✓ | 
| StarCoder 1 | 15B | gpt2 | ✓ | ✓ | ✓ | 
| StarCoder 2 | 3B, 7B, 15B | gptnext | ✓ | ✓ | ✓ | 
| Mistral | 7B | llama | ✓ | ✓ | ✓ | 
| Mixtral | 8x7B | llama | ✓ | ✗ | ✗ | 
When running PTQ, the decoder type for exporting TensorRT-LLM checkpoint is detected automatically based on the model used. If necessary, it can be overriden using decoder_type parameter.
Example#
The example below shows how to quantize the Llama 3 70b model into FP8 precision, using tensor parallelism of 8 on a single DGX H100 node. The quantized model is designed for serving using 2 H100 GPUs specified with the export.inference_tp parameter.
The quantization workflow can be launched with NeMo CLI or using a PTQ script with torchrun or Slurm. This is shown below.
Use the NeMo CLI#
The command below can be launched inside a NeMo container (only single-node use cases are supported):
CALIB_TP=8
INFER_TP=2
nemo llm ptq \
    model_path=/opt/checkpoints/llama3-70b-base \
    calibration_tp=$CALIB_TP \
    quantization_config.algorithm=fp8 \
    export_config.inference_tp=$INFER_TP \
    export_config.path=/opt/checkpoints/llama3-70b-base-fp8-qnemo \
    run.executor=torchrun \
    run.executor.ntasks_per_node=$CALIB_TP
Use the PTQ script with torchrun or Slurm#
Alternatively, the torchrun command and scripts/llm/ptq.py can be used directly. The script must be launched correctly with the number of processes equal to tensor parallelism:
CALIB_TP=8
CALIB_PP=1
INFER_TP=2
torchrun --nproc_per_node $CALIB_TP /opt/NeMo/scripts/llm/ptq.py \
    --nemo_checkpoint=/opt/checkpoints/llama3-70b-base \
    --calibration_tp=$CALIB_TP \
    --calibration_pp=$CALIB_PP \
    --algorithm=fp8 \
    --inference_tp=$INFER_TP \
    --export_path=/opt/checkpoints/llama3-70b-base-fp8-qnemo
For large models, this script can be launched on Slurm for multi-node use cases by setting the --calibration_tp and --calibration_pp along with the corresponding Slurm --ntasks-per-node and --nodes parameters, respectively:
CALIB_TP=8
CALIB_PP=2
INFER_TP=8
srun --nodes $CALIB_PP --ntasks-per-node $CALIB_TP ... \
    python /opt/NeMo/scripts/llm/ptq.py \
        --nemo_checkpoint=/opt/checkpoints/nemotron4-340b-base \
        --calibration_tp=$CALIB_TP \
        --calibration_pp=$CALIB_PP \
        ...
For the Llama 3 70b example, the output directory has the following structure:
llama3-70b-base-fp8-qnemo/
├── config.json
├── nemo_context/
├── rank0.safetensors
└── rank1.safetensors
The next step is to build a TensorRT-LLM engine for the checkpoint produced. This can be conveniently achieved and run using the TensorRTLLM class available in the nemo.export module. See Deploy NeMo Models by Exporting TensorRT-LLM for details. Alternatively, you can use the TensorRT-LLM trtllm-build command directly.
References#
Please refer to the following papers for more details on quantization techniques: