Quantizing a model in TAO (TAO Quant)#
TAO Quant is an extensible library integrated into TAO Toolkit that provides quantization capabilities. It currently offers Post-Training Quantization (PTQ) for both PyTorch and ONNX models to reduce inference latency and memory footprint while preserving accuracy. Use this page to select a backend and get started. Refer to the subpages for complete details.
Quantization at a Glance#
Quantization converts high-precision numbers (FP32) in your model to lower precision (INT8/FP8) to make inference faster and smaller in memory.
In TAO, you can do this without retraining using PTQ. You load a trained model, run prepare/calibrate (if needed), and produce a quantized checkpoint.
Trade-offs depend on the approach you choose (refer to the section below). Always validate accuracy on your data.
Weight-Only vs Weights+Activations#
Weight-only PTQ (e.g., TorchAO) - Pros: simplest to run, no calibration loop, often minimal accuracy impact, works broadly. - Cons: activations remain FP, so speedups/compression are modest; effectiveness depends on kernels/runtime.
Static PTQ for weights+activations (e.g., ModelOpt PyTorch and ONNX) - Pros: larger speed/memory wins by quantizing both weights and activations; more control (algorithms, per-layer). - Cons: needs calibration on representative data; more knobs to tune; accuracy can drop if data is not representative; requires supported runtime.
Learn more about the backends:
TorchAO: pytorch/ao.
NVIDIA ModelOpt: NVIDIA/TensorRT-Model-Optimizer (supports both PyTorch and ONNX formats).
TAO Quant Vision#
Unified, friendly API: one configuration schema and command across tasks.
Pluggable backends: choose TorchAO or ModelOpt today; bring your own backend with a small adapter.
Safe defaults: sensible dtypes and modes that work out of the box for common models.
Clear workflows: quick start for novices; deeper backend pages for advanced users.
Growing coverage: start with
classification_pytandrtdetr, expand with community and NVIDIA backends.Support for QAT (Quantization Aware Training) coming soon.
Choosing a Backend#
TAO Quant supports three backends with different capabilities and use cases:
torchaoPurpose: Weight-only PTQ for PyTorch models
Best for: Quick quantization experiments; minimal accuracy drop; broad model support
Limitations: Activations remain FP32; modest speedups; depends on kernel support
Input: PyTorch checkpoint (
.pthor.ckpt)Output: PyTorch state dict
modelopt.pytorchPurpose: Static PTQ with calibration for PyTorch models
Best for: Experimenting with weight+activation quantization; prototyping
Limitations: Fake-quant ops in PyTorch runtime; limited speedups; focuses on scale accuracy
Input: PyTorch checkpoint (
.pthor.ckpt)Output: PyTorch checkpoint with calibrated scales
modelopt.onnx(Recommended for NVIDIA TensorRT™ deployment)Purpose: Static PTQ with calibration for ONNX models
Best for: Production TensorRT deployment; maximum runtime performance gains
Limitations: Requires pre-exported ONNX model; no mixed-precision per layer (first dtype applies globally)
Input: ONNX model file (
.onnx)Output: Quantized ONNX model ready for TensorRT
Why preferred: When deployed to TensorRT, provides the best runtime speedups and memory savings. The ONNX format ensures compatibility with TensorRT’s optimized kernels and allows full hardware acceleration.
Decision guide: Use modelopt.onnx if your target is TensorRT inference. Use torchao or modelopt.pytorch for quick PyTorch experiments or when ONNX export is not available.
Quick Start#
Ensure you have a trained model checkpoint (PyTorch for
torchao/modelopt.pytorch; ONNX formodelopt.onnx).Add a
quantizesection to your experiment spec and run the task-specificquantizecommand.
Example (RT-DETR with PyTorch backend):
quantize:
model_path: "/path/to/trained_rtdetr.ckpt"
results_dir: "/path/to/quantized_output"
backend: "torchao" # or "modelopt.pytorch"
mode: "weight_only_ptq" # torchao
# mode: "static_ptq" # modelopt.pytorch
# algorithm: "minmax" # for modelopt.pytorch
layers:
- module_name: "Linear"
weights: { dtype: "int8" }
Example (ONNX model quantization for TensorRT):
quantize:
model_path: "/path/to/model.onnx" # ONNX file path (required)
results_dir: "/path/to/quantized_output"
backend: "modelopt.onnx"
mode: "static_ptq"
algorithm: "max" # or "minmax", "entropy"
device: "cuda"
layers:
- module_name: "*" # Global quantization setting
weights: { dtype: "fp8_e5m2" } # or "fp8_e4m3fn", "int8"
activations: { dtype: "fp8_e5m2" }
skip_names: ["/head/*"] # Skip output head layers
Evaluate or infer with the quantized checkpoint by setting
evaluate.is_quantizedorinference.is_quantizedtotrueand pointing to the produced artifact. A PyTorch backend saves artifacts asquantized_model_<backend>.pth; the ONNX backend saves asquantized_model.onnx.
What’s Supported#
The supported backends are: -
torchao(weight-only PTQ for PyTorch models) -modelopt.pytorch(static PTQ for PyTorch models) -modelopt.onnx(static PTQ for ONNX models)The supported modes are: PTQ (weight-only PTQ via TorchAO; static PTQ via ModelOpt for both PyTorch and ONNX).
The supported dtypes are INT8, FP8 (E4M3FN/E5M2).
float8_*aliases in configurations are accepted and normalized.The supported tasks are
classification_pyt,rtdetr.The supported runtimes are PyTorch; ONNX/TensorRT export is experimental.
Dive Deeper#
Contents
- 1. Terminology
- 2. Getting Started
- 3. Choosing a Backend
- 4. Configuration
- 5. ModelQuantizationConfig Fields
- 6. Skipping Layers From Quantization
- 7. TorchAO Backend (Weight-Only PTQ)
- 8. ModelOpt PyTorch Backend (Static PTQ)
- 9. ModelOpt ONNX Backend (Static PTQ)
- 10. API Reference
- 11. Extending TAO Quant With a Custom Backend
- 12. Limitations and Current Status
Workflows#
Classification PTQ: Classification quantization workflow
RT-DETR PTQ: RT-DETR quantization workflow