Quantization#

NeMo offers Post-Training Quantization (PTQ) to postprocess a FP16/BF16 model to a lower precision format for efficient deployment. The following sections detail how to use it.

Post-Training Quantization#

PTQ enables deploying a model in a low-precision format – FP8, INT4, or INT8 – for efficient serving. Different quantization methods are available including FP8 quantization, INT8 SmoothQuant, and INT4 AWQ.

Model quantization has three primary benefits: reduced model memory requirements, lower memory bandwidth pressure, and increased inference throughput.

In NeMo, quantization is enabled by the NVIDIA TensorRT Model Optimizer (ModelOpt) – a library to quantize and compress deep learning models for optimized inference on GPUs.

The quantization process consists of the following steps:

Load a model checkpoint using an appropriate parallelism strategy.
Calibrate the model to obtain scaling factors for lower-precision GEMMs.
Produce a TensorRT-LLM checkpoint with model config (json) and quantized weights (safetensors). Additionally, the necessary context to set up the model tokenizer is saved.

Loading models requires using a custom ModelOpt spec defined in the nemo.collections.nlp.models.language_modeling.megatron.gpt_layer_modelopt_spec module. Typically, the calibration step is lightweight and uses a small dataset to obtain appropriate statistics for scaling tensors. The output directory produced is ready to be used to build a serving engine with the NVIDIA TensorRT-LLM library (see Deploy NeMo Models by Exporting TensorRT-LLM). We refer to this checkpoint as the qnemo checkpoint henceforth.

The quantization algorithm can also be conveniently set to "no_quant" to perform only the weights export step using the default precision for TensorRT-LLM deployment. This is useful for obtaining baseline performance and accuracy results for comparison.

Support Matrix#

The table below presents a verified model support matrix for popular LLM architectures. Support for other model families is experimental.

Model Name	Model Parameters	Decoder Type	FP8	INT8 SQ	INT4 AWQ
GPT	2B, 8B, 43B	gptnext	✓	✓	✓
Nemotron-3	8B, 22B	gptnext	✓	✓	✓
Nemotron-4	15B, 340B	gptnext	✓	✓	✓
Llama 2	7B, 13B, 70B	llama	✓	✓	✓
Llama 3	8B, 70B	llama	✓	✓	✓
Llama 3.1	8B, 70B, 405B	llama	✓	✓	✓
Llama 3.2	1B, 3B	llama	✓	✓	✓
Falcon	7B, 40B	falcon	✗	✗	✗
Gemma 1	2B, 7B	gemma	✓	✓	✓
StarCoder 1	15B	gpt2	✓	✓	✓
StarCoder 2	3B, 7B, 15B	gptnext	✓	✓	✓
Mistral	7B	llama	✓	✓	✓
Mixtral	8x7B	llama	✓	✗	✗

When running PTQ, the decoder type for exporting TensorRT-LLM checkpoint is detected automatically based on the model used. If necessary, it can be overriden using decoder_type parameter.

Example#

The example below shows how to quantize the Llama 3 70b model into FP8 precision, using tensor parallelism of 8 on a single DGX H100 node. The quantized model is designed for serving using 2 H100 GPUs specified with the export.inference_tp parameter.

The quantization workflow can be launched with NeMo CLI or using a PTQ script with torchrun or Slurm. This is shown below.

Use the NeMo CLI#

The command below can be launched inside a NeMo container (only single-node use cases are supported):

CALIB_TP=8
INFER_TP=2

nemo llm ptq \
    nemo_checkpoint=/opt/checkpoints/llama3-70b-base \
    calibration_tp=$CALIB_TP \
    quantization_config.algorithm=fp8 \
    export_config.inference_tp=$INFER_TP \
    export_config.path=/opt/checkpoints/llama3-70b-base-fp8-qnemo \
    run.executor=torchrun \
    run.executor.ntasks_per_node=$CALIB_TP

Use the PTQ script with `torchrun` or Slurm#

Alternatively, the torchrun command and scripts/llm/ptq.py can be used directly. The script must be launched correctly with the number of processes equal to tensor parallelism:

CALIB_TP=8
CALIB_PP=1
INFER_TP=2

torchrun --nproc_per_node $CALIB_TP /opt/NeMo/scripts/llm/ptq.py \
    --nemo_checkpoint=/opt/checkpoints/llama3-70b-base \
    --calibration_tp=$CALIB_TP \
    --calibration_pp=$CALIB_PP \
    --algorithm=fp8 \
    --inference_tp=$INFER_TP \
    --export_path=/opt/checkpoints/llama3-70b-base-fp8-qnemo

For large models, this script can be launched on Slurm for multi-node use cases by setting the --calibration_tp and --calibration_pp along with the corresponding Slurm --ntasks-per-node and --nodes parameters, respectively:

CALIB_TP=8
CALIB_PP=2
INFER_TP=8

srun --nodes $CALIB_PP --ntasks-per-node $CALIB_TP ... \
    python /opt/NeMo/scripts/llm/ptq.py \
        --nemo_checkpoint=/opt/checkpoints/nemotron4-340b-base \
        --calibration_tp=$CALIB_TP \
        --calibration_pp=$CALIB_PP \
        ...

For the Llama 3 70b example, the output directory has the following structure:

llama3-70b-base-fp8-qnemo/
├── config.json
├── nemo_context/
├── rank0.safetensors
└── rank1.safetensors

The next step is to build a TensorRT-LLM engine for the checkpoint produced. This can be conveniently achieved and run using the TensorRTLLM class available in the nemo.export module. See Deploy NeMo Models by Exporting TensorRT-LLM for details. Alternatively, you can use the TensorRT-LLM trtllm-build command directly.

References#

Please refer to the following papers for more details on quantization techniques: