Post-Training Quantization#
Post-Training Quantization (PTQ) converts a trained FP32/FP16 model to a lower-precision representation to reduce latency and memory with minimal or no retraining. In TAO, PTQ is provided via the TAO Quant library. See Quantizing a model in TAO (TAO Quant) for the full guide and backend details.
When to Use PTQ#
Use PTQ when you:
Cannot or do not want to retrain (for faster turnaround than QAT)
Want to establish a performance/accuracy baseline before investing in QAT
Are targeting edge deployment where INT8/FP8 inference and memory savings matter
Backends at a glance#
TorchAO (weight-only PTQ): The simplest path; quantizes weights, with no calibration loop, and modest speedups.
NVIDIA ModelOpt (static PTQ): Quantizes weights and activations; requires calibration data; yields larger speed gains.
Quick Workflow#
Train your model as usual (FP32 or with AMP).
Add a
quantizesection to your experiment spec, selecting a backend and mode.Run the task-specific
quantizecommand.Evaluate with the quantized artifact and validate accuracy on your data.
Deploy. For TensorRT export pipelines, follow the task’s export section and the TAO Quant documentation.
Minimal example (RT-DETR)#
quantize:
model_path: "/path/to/trained_rtdetr.ckpt"
results_dir: "/path/to/quantized_output"
backend: "torchao" # or "modelopt"
mode: "weight_only_ptq" # torchao
# mode: "static_ptq" # modelopt
default_layer_dtype: "native"
default_activation_dtype: "native"
layers:
- module_name: "Linear"
weights: { dtype: "int8" }
Limitations#
Current backends:
torchao(weight-only PTQ) andmodelopt(static PTQ).Modes: PTQ only; QAT support in TAO Quant is planned but not yet available here.
Dtypes: INT8 and FP8 (E4M3FN/E5M2).
Tasks:
classification_pytandrtdetr.Runtime: PyTorch; ONNX/TensorRT export is experimental.
For the most up-to-date, comprehensive list, see Limitations and Current Status.