Post-Training Quantization#

Post-Training Quantization (PTQ) converts a trained FP32/FP16 model to a lower-precision representation to reduce latency and memory with minimal or no retraining. In TAO, PTQ is provided via the TAO Quant library. See Quantizing a model in TAO (TAO Quant) for the full guide and backend details.

When to Use PTQ#

Use PTQ when you:

Cannot or do not want to retrain (for faster turnaround than QAT)
Want to establish a performance/accuracy baseline before investing in QAT
Are targeting edge deployment where INT8/FP8 inference and memory savings matter

Backends at a glance#

TorchAO (weight-only PTQ): The simplest path; quantizes weights, with no calibration loop, and modest speedups.
NVIDIA ModelOpt (static PTQ): Quantizes weights and activations; requires calibration data; yields larger speed gains.

Quick Workflow#

Train your model as usual (FP32 or with AMP).
Add a quantize section to your experiment spec, selecting a backend and mode.
Run the task-specific quantize command.
Evaluate with the quantized artifact and validate accuracy on your data.
Deploy. For TensorRT export pipelines, follow the task’s export section and the TAO Quant documentation.

Minimal example (RT-DETR)#

quantize:
  model_path: "/path/to/trained_rtdetr.ckpt"
  results_dir: "/path/to/quantized_output"
  backend: "torchao"            # or "modelopt"
  mode: "weight_only_ptq"       # torchao
  # mode: "static_ptq"          # modelopt
  default_layer_dtype: "native"
  default_activation_dtype: "native"
  layers:
    - module_name: "Linear"
      weights: { dtype: "int8" }

Limitations#

Current backends: torchao (weight-only PTQ) and modelopt (static PTQ).
Modes: PTQ only; QAT support in TAO Quant is planned but not yet available here.
Dtypes: INT8 and FP8 (E4M3FN/E5M2).
Tasks: classification_pyt and rtdetr.
Runtime: PyTorch; ONNX/TensorRT export is experimental.
For the most up-to-date, comprehensive list, see Limitations and Current Status.