Quantization#

This guide covers model quantization in Megatron Bridge using NVIDIA ModelOpt, including post-training quantization (PTQ) and quantization-aware training (QAT).

Note: The scripts referenced in this guide are minimal, hard-coded reference examples. For more comprehensive, production-quality quantization, export, and Quantization Aware Distillation (QAD) scripts, see the NVIDIA Model Optimizer repository: examples/megatron_bridge/.

Overview#

Quantization is an effective model optimization technique that compresses models by reducing precision from high-precision formats (FP16/BF16) to lower-precision formats (NVFP4, FP8, INT8, INT4). Quantization with Model Optimizer can compress model size by 2x-4x, speeding up inference while preserving model quality.

In Megatron Bridge, quantization is enabled by NVIDIA Model Optimizer (ModelOpt) — a library to quantize and compress deep learning models for optimized inference on GPUs. Model Optimizer enables highly performant quantization formats including FP8, INT8, INT4, and NVFP4, and supports advanced algorithms such as SmoothQuant and AWQ with easy-to-use Python APIs.

Quantization Methods#

Megatron Bridge supports two quantization approaches:

Post-Training Quantization (PTQ)#

PTQ reduces model precision after training to improve inference efficiency without requiring retraining. This is the fastest approach and works well for most models.

Process:

Load a pretrained model checkpoint
Calibrate the model using a small dataset (typically 128-512 samples) to obtain scaling factors
Produce a quantized checkpoint

Quantization-Aware Training (QAT)#

Quantization Aware Training (QAT) helps to improve the model accuracy beyond post training quantization (PTQ). QAT can further preserve model accuracy at low precisions (e.g., INT4, or FP4 in NVIDIA Blackwell platform).

Process:

Train/fine-tune the model in the original precision without quantization
Quantize the model from step 1 with mtq.quantize()
Train/fine-tune the quantized model with a small learning rate (e.g., 1e-5 for Adam optimizer)

Note: Step 3 is the actual ‘Quantization Aware Training’ step. The optimal hyperparameter setting for QAT can vary depending on the model and training dataset.

Note: QAT without the original precision training/fine-tuning (i.e., skipping Step 1) gives worse accuracy. Therefore, un-quantized original precision training/fine-tuning followed by QAT is recommended for best accuracy.

Post-Training Quantization (PTQ)#

PTQ quantizes a pretrained model by running calibration with a small dataset to compute scaling factors. The complete workflow includes: quantize → resume and generate → export.

Quantize#

Use the examples/quantization/quantize.py script for LLM PTQ:

uv run python -m torch.distributed.run --nproc_per_node 2 examples/quantization/quantize.py \
    --hf-model-id meta-llama/Llama-3.2-1B \
    --export-quant-cfg fp8 \
    --tp 2 \
    --megatron-save-path ./llama3_2_1b_fp8

Use the examples/quantization/quantize_vlm.py script for VLM PTQ:

uv run python -m torch.distributed.run --nproc_per_node 8 examples/quantization/quantize_vlm.py \
    --hf-model-id Qwen/Qwen3-VL-30B-A3B-Instruct \
    --export-quant-cfg fp8 \
    --megatron-save-path ./Qwen3-VL-30B-A3B-Instruct_fp8 \
    --tp 4 \
    --etp 4 \
    --pp 2 \
    --calib-size 256

Key Arguments:

--hf-model-id - HuggingFace model ID or local path
--export-quant-cfg - Quantization format (fp8, nvfp4, etc.)
--megatron-save-path - Output checkpoint path
--tp - Tensor parallelism size
--pp - Pipeline parallelism size
--ep - Expert parallelism for MoE models
--etp - Expert tensor parallelism for MoE models
--calib-size - Calibration samples for quantization (default: 512)

Resume and Generate#

Resume the quantized checkpoint and test with text generation using examples/quantization/ptq_generate.py for LLM:

uv run python -m torch.distributed.run --nproc_per_node 2 examples/quantization/ptq_generate.py \
    --hf-model-id meta-llama/Llama-3.2-1B \
    --megatron-load-path ./llama3_2_1b_fp8 \
    --tp 2

Resume the quantized checkpoint and test with text generation using examples/quantization/ptq_generate_vlm.py for VLM:

uv run python -m torch.distributed.run --nproc_per_node 8 examples/quantization/ptq_generate_vlm.py \
    --hf-model-id Qwen/Qwen3-VL-30B-A3B-Instruct \
    --megatron-load-path ./Qwen3-VL-30B-A3B-Instruct_fp8 \
    --tp 8 \
    --ep 8 \
    --image-path ./demo.jpeg \
    --prompts "Describe this image."

Key Arguments:

--megatron-load-path - Path to quantized checkpoint
--hf-model-id - HuggingFace model ID or local path (for tokenizer)
--image-path - Path to the input image file used for visual-language model prompt generation
--prompts - Test prompts

Export#

Export the quantized checkpoint to unified HuggingFace format using examples/quantization/export.py:

uv run python -m torch.distributed.run --nproc_per_node 2 examples/quantization/export.py \
    --hf-model-id meta-llama/Llama-3.2-1B \
    --megatron-load-path ./llama3_2_1b_fp8 \
    --export-dir ./llama3_2_1b_fp8_hf \
    --tp 2 \
    --dtype bfloat16

Key Arguments:

--export-dir - Output directory for unified HuggingFace checkpoint
--dtype - Export data type

Programmatic ModelOpt Export#

Use AutoBridge.export_hf_weights_modelopt() when you need to stream ModelOpt deployment weights from an already-loaded Megatron model instead of writing a full checkpoint through the export script. This is useful for integrations that consume Hugging Face weight names directly, such as inference-engine refit paths.

The API currently only supports quant_mode="nvfp4". Quantized parameters are yielded as the original Hugging Face *.weight name plus the ModelOpt NVFP4 scale tensors:

*.weight
*.weight_scale
*.weight_scale_2

Unquantized parameters are yielded under their regular Hugging Face names. Quantizer-internal tensors are skipped.

from safetensors.torch import save_file

state_dict = {}
for name, weight in bridge.export_hf_weights_modelopt(
    model,
    quant_mode="nvfp4",
    cpu=True,
    show_progress=False,
):
    state_dict[name] = weight.contiguous()

save_file(state_dict, "modelopt-nvfp4.safetensors")

For large models, consume the iterator directly in the downstream writer or refit path instead of materializing the full state_dict.

for name, weight in bridge.export_hf_weights_modelopt(
    model,
    quant_mode="nvfp4",
    ignore_patterns=["lm_head", "*self_attn.o_proj*"],
    show_progress=False,
):
    refit_engine.replace_weight(name, weight)

ignore_patterns are matched against Hugging Face parameter names. The matcher handles the optional model. prefix and ModelOpt scale suffixes, so a pattern can target the logical parameter name without separately listing *.weight_scale and *.weight_scale_2.

Supported Models For PTQ#

Model	fp8	nvfp4
Llama-3.2-1B	✅	✅
Qwen3-8B	✅	✅
Qwen3-30B-A3B	✅	✅
Nemotron-H-8B-Base-8K	✅	✅
Qwen3-VL-8B-Instruct	✅	✅
Qwen3-VL-30B-A3B-Instruct	✅	✅

Quantization-Aware Training (QAT)#

In QAT, a model quantized using mtq.quantize() can be directly fine-tuned with the original training pipeline. During QAT, the scaling factors inside quantizers are frozen and the model weights are fine-tuned.

Complete QAT Workflow#

Step 1: Create Initial Quantized Checkpoint (PTQ)#

uv run python -m torch.distributed.run --nproc_per_node 8 examples/quantization/quantize.py \
    --hf-model-id meta-llama/Meta-Llama-3-8B \
    --export-quant-cfg fp8 \
    --tp 8 \
    --megatron-save-path /models/llama3_8b_fp8_init

Step 2: Configure Training#

Create a YAML configuration file (e.g., conf/my_qat_config.yaml):

model:
  tensor_model_parallel_size: 4
  gradient_accumulation_fusion: False

train:
  train_iters: 20
  global_batch_size: 8
  eval_iters: 0 

scheduler:
  lr_warmup_iters: 10

logger:
  log_interval: 1

checkpoint:
  pretrained_checkpoint: /models/llama3_8b_fp8_init
  save_interval: 20
  finetune: true

Step 3: Run QAT Training#

Use examples/quantization/pretrain_quantized_llama3_8b.py:

uv run python -m torch.distributed.run --nproc-per-node=4 pretrain_quantized_llama3_8b.py \
  --config-file=conf/my_qat_config.yaml \
  --hf-path=meta-llama/Meta-Llama-3-8B

Configuration Overrides:

You can also use command-line overrides:

uv run python -m torch.distributed.run --nproc_per_node 4 pretrain_quantized_llama3_8b.py \
    model.tensor_model_parallel_size=4 \
    model.gradient_accumulation_fusion=False \
    checkpoint.pretrained_checkpoint=/models/llama3_8b_fp8_init

Supported Models For QAT#

Model	Support
Meta-Llama-3-8B	✅