Quantization#

Post-Training Quantization (PTQ)#

PTQ enables deploying a model in a low-precision format – FP8, INT4, or INT8 – for efficient serving. Different quantization methods are available including FP8 quantization, INT8 SmoothQuant, and INT4 AWQ.

Model quantization has two primary benefits: reduced model memory requirements and increased inference throughput.

In NeMo, quantization is enabled by the NVIDIA TensorRT Model Optimizer (ModelOpt) library – a library to quantize and compress deep learning models for optimized inference on GPUs.

The quantization process consists of the following steps:

Loading a model checkpoint using an appropriate parallelism strategy
Calibrating the model to obtain appropriate algorithm-specific scaling factors
Producing an output directory or .qnemo tarball with model config (json), quantized weights (safetensors) and tokenizer config (yaml).

Loading models requires using an ModelOpt spec defined in nemo.collections.nlp.models.language_modeling.megatron.gpt_layer_modelopt_spec module. Typically the calibration step is lightweight and uses a small dataset to obtain appropriate statistics for scaling tensors. The output directory produced (or a .qnemo tarball) is ready to be used to build a serving engine with the Nvidia TensorRT-LLM library. The engine build step is also available in NeMo project in nemo.deploy and nemo.export modules.

Quantization algorithm can also be conveniently set to "null" to perform only the weights export step using default precision for TensorRT-LLM deployment. This is useful to obtain baseline performance and accuracy results for comparison.

Support Matrix#

Table below presents verified model support matrix for popular LLM architectures. Each model entry also optionally provides a download link to a corresponding Nemo checkpoint for testing purposes. Support for other model families is experimental.

Model Support Matrix#
Model Family	FP8	INT8_SQ	INT4_AWQ
Llama (1, 2, 3)	✅	✅	✅
Mistral	✅	✅	✅
GPT-3	✅	✅	✅
Nemotron-3 8b	✅	✅	✅
Nemotron-4 15b	✅	✅	✅
Nemotron-4 340b (Base, Instruct, Reward)	✅	✅	✅
StarCoder 2	✅	✅	✅
Gemma	✅	✅	✅

Example#

The example below shows how to quantize the Llama3 70b model into FP8 precision, using tensor parallelism of 8 on a single DGX H100 node. The quantized model is designed for serving using 2 GPUs specified with the export.inference_tensor_parallel parameter.

The script must be launched correctly with the number of processes equal to tensor parallelism. This is achieved with the torchrun command below:

torchrun --nproc-per-node 8 examples/nlp/language_modeling/megatron_gpt_ptq.py \
    model.restore_from_path=llama3-70b-base-bf16.nemo \
    model.tensor_model_parallel_size=8 \
    model.pipeline_model_parallel_size=1 \
    trainer.num_nodes=1 \
    trainer.devices=8 \
    trainer.precision=bf16 \
    quantization.algorithm=fp8 \
    export.decoder_type=llama \
    export.inference_tensor_parallel=2 \
    export.save_path=llama3-70b-base-fp8-qnemo

For large models, the command can be used in multi-node setting. For example, this can be done with NeMo Framework Launcher using Slurm.

The output directory stores the following files:

llama3-70b-base-fp8-qnemo/
├── config.json
├── rank0.safetensors
├── rank1.safetensors
├── tokenizer.model
└── tokenizer_config.yaml

The TensorRT-LLM engine can be conveniently built and run using TensorRTLLM class available in nemo.export submodule:

from nemo.export.tensorrt_llm import TensorRTLLM
trt_llm_exporter = TensorRTLLM(model_dir="/path/to/trt_llm_engine_folder")
trt_llm_exporter.export(
    nemo_checkpoint_path="llama3-70b-base-fp8-qnemo",
    model_type="llama",
)
trt_llm_exporter.forward(["Hi, how are you?", "I am good, thanks, how about you?"])

Alternatively, it can also be built directly using trtllm-build command, see TensorRT-LLM documentation:

trtllm-build \
    --checkpoint_dir llama3-70b-base-fp8-qnemo \
    --output_dir /path/to/trt_llm_engine_folder \
    --max_batch_size 8 \
    --max_input_len 2048 \
    --max_output_len 512 \
    --strongly_typed

Known issues#

Currently with nemo.export module building TensorRT-LLM engines for quantized “qnemo” models is limited to single-node deployments.

Quantization-Aware Training (QAT)#

QAT is the technique of fine-tuning a quantized model to recover model quality degradation due to quantization. During QAT, the quantization scaling factors computed during PTQ are frozen and the model weights are fine-tuned. While QAT requires much more compute resources than PTQ, it is highly effective in recovering model quality. To perform QAT on a calibrated model from PTQ, you need to further fine-tune the model on a downstream task using a small dataset before exporting to TensorRT-LLM. You can reuse your training pipeline for QAT. As a rule of thumb, we recommend QAT for 1-10% original training duration and a small learning rate, e.g. 1e-5 for Adam optimizer. If you are doing QAT on an SFT model where learning rates and finetuning dataset size are already small, you can continue using the same SFT learning rate and dataset size as a starting point for QAT. Since QAT is done after PTQ, the supported model families are the same as for PTQ.

Example#

The example below shows how to perform PTQ and QAT on a Supervised Finetuned Llama2 7B model to INT4 precision. The script is tested using tensor parallelism of 8 on 8x RTX 6000 Ada 48GB GPUs. Alternatively, a single DGX A100 node with 8x 40GB GPUs can be used for the same purpose. For bigger models like Llama2 70B, you may need to use one or more DGX H100 nodes with 8x 80GB GPUs each.

The example is a modified version of the SFT with Llama 2 playbook. Please refer to the playbook for more details on setting up a BF16 NeMo model and the databricks-dolly-15k instruction dataset.

First we will run the SFT example command from the playbook as-is to train a Llama2 7B SFT model for 100 steps. Make sure to change trainer.max_steps=50 to trainer.max_steps=100 for the examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py script. This will take ~2 hours to produce a model checkpoint with validation loss approximately 1.15 that we will use for PTQ and QAT next.

For Quantization, we use a modified version of the sft script and config file which includes the quantization and TensorRT-LLM export support. Along with the new parameters, make sure to pass the same parameters you passed for SFT training except the model restore path will be the SFT output .nemo file. The below example command will perform PTQ on the SFT model checkpoint followed by SFT again (QAT) which can then be exported for TensorRT-LLM inference. The script will take ~2-3 hours to complete.

torchrun --nproc-per-node 8 examples/nlp/language_modeling/tuning/megatron_gpt_qat.py \
    trainer.num_nodes=1 \
    trainer.devices=8 \
    trainer.precision=bf16 \
    trainer.max_steps=100 \
    model.restore_from_path=<llama2-7b-sft-nemo-path> \
    model.global_batch_size=128 \
    quantization.algorithm=int4 \
    # other parameters from sft training

As you can see from the logs, the INT4 PTQ model has a validation loss of approximately 1.31 and the QAT model has a validation loss of approximately 1.17 which is very close to the BF16 model loss of 1.15. This script will produce a quantized .nemo checkpoint at the experiment manager log directory (in the config yaml file) that can be used for further training. It can also optionally produce an exported TensorRT-LLM engine directory or a .qnemo file that can be used for inference by setting the export parameters similar to the PTQ example. Note that you may tweak the QAT trainer steps and learning rate if needed to achieve better model quality.

NeMo checkpoints trained in FP8 with NVIDIA Transformer Engine #

If you have an FP8-quantized checkpoint, produced during pre-training or fine-tuning with Transformer Engine, you can convert it to a FP8 TensorRT-LLM engine directly using nemo.export. The API is the same as with regular .nemo and .qnemo checkpoints:

from nemo.export.tensorrt_llm import TensorRTLLM
trt_llm_exporter = TensorRTLLM(model_dir="/path/to/trt_llm_engine_folder")
trt_llm_exporter.export(
    nemo_checkpoint_path="/path/to/llama2-7b-base-fp8.nemo",
    model_type="llama",
)
trt_llm_exporter.forward(["Hi, how are you?", "I am good, thanks, how about you?"])

The export settings for quantization can be adjusted via trt_llm_exporter.export arguments:

fp8_quantized: Optional[bool] = None: manually enables/disables FP8 quantization
fp8_kvcache: Optional[bool] = None: manually enables/disables FP8 quantization for KV-cache

By default quantization settings are auto-detected from the NeMo checkpoint.

References#

Please refer to the following papers for more details on quantization techniques:

Quantization#

Post-Training Quantization (PTQ)#

Support Matrix#

Example#

Known issues#

Quantization-Aware Training (QAT)#

Example#

NeMo checkpoints trained in FP8 with NVIDIA Transformer Engine#

References#

NeMo checkpoints trained in FP8 with NVIDIA Transformer Engine #