Quantizing a model in TAO (TAO Quant)#

TAO Quant is an extensible library integrated into TAO Toolkit that provides quantization capabilities. It currently offers Post-Training Quantization (PTQ) for both PyTorch and ONNX models to reduce inference latency and memory footprint while preserving accuracy. Use this page to select a backend and get started. Refer to the subpages for complete details.

Quantization at a Glance#

Quantization converts high-precision numbers (FP32) in your model to lower precision (INT8/FP8) to make inference faster and smaller in memory.
In TAO, you can do this without retraining using PTQ. You load a trained model, run prepare/calibrate (if needed), and produce a quantized checkpoint.
Trade-offs depend on the approach you choose (refer to the section below). Always validate accuracy on your data.

Weight-Only vs Weights+Activations#

Weight-only PTQ (e.g., TorchAO) - Pros: simplest to run, no calibration loop, often minimal accuracy impact, works broadly. - Cons: activations remain FP, so speedups/compression are modest; effectiveness depends on kernels/runtime.
Static PTQ for weights+activations (e.g., ModelOpt PyTorch and ONNX) - Pros: larger speed/memory wins by quantizing both weights and activations; more control (algorithms, per-layer). - Cons: needs calibration on representative data; more knobs to tune; accuracy can drop if data is not representative; requires supported runtime.

Learn more about the backends:

TorchAO: pytorch/ao.
NVIDIA ModelOpt: NVIDIA/TensorRT-Model-Optimizer (supports both PyTorch and ONNX formats).

TAO Quant Vision#

Unified, friendly API: one configuration schema and command across tasks.
Pluggable backends: choose TorchAO or ModelOpt today; bring your own backend with a small adapter.
Safe defaults: sensible dtypes and modes that work out of the box for common models.
Clear workflows: quick start for novices; deeper backend pages for advanced users.
Growing coverage: start with classification_pyt and rtdetr, expand with community and NVIDIA backends.
Support for QAT (Quantization Aware Training) coming soon.

Choosing a Backend#

TAO Quant supports three backends with different capabilities and use cases:

torchao
- Purpose: Weight-only PTQ for PyTorch models
- Best for: Quick quantization experiments; minimal accuracy drop; broad model support
- Limitations: Activations remain FP32; modest speedups; depends on kernel support
- Input: PyTorch checkpoint (.pth or .ckpt)
- Output: PyTorch state dict
modelopt.pytorch
- Purpose: Static PTQ with calibration for PyTorch models
- Best for: Experimenting with weight+activation quantization; prototyping
- Limitations: Fake-quant ops in PyTorch runtime; limited speedups; focuses on scale accuracy
- Input: PyTorch checkpoint (.pth or .ckpt)
- Output: PyTorch checkpoint with calibrated scales
modelopt.onnx (Recommended for NVIDIA TensorRT™ deployment)
- Purpose: Static PTQ with calibration for ONNX models
- Best for: Production TensorRT deployment; maximum runtime performance gains
- Limitations: Requires pre-exported ONNX model; no mixed-precision per layer (first dtype applies globally)
- Input: ONNX model file (.onnx)
- Output: Quantized ONNX model ready for TensorRT
- Why preferred: When deployed to TensorRT, provides the best runtime speedups and memory savings. The ONNX format ensures compatibility with TensorRT’s optimized kernels and allows full hardware acceleration.

Decision guide: Use modelopt.onnx if your target is TensorRT inference. Use torchao or modelopt.pytorch for quick PyTorch experiments or when ONNX export is not available.

Quick Start#

Ensure you have a trained model checkpoint (PyTorch for torchao/modelopt.pytorch; ONNX for modelopt.onnx).
Add a quantize section to your experiment spec and run the task-specific quantize command.

Example (RT-DETR with PyTorch backend):

quantize:
  model_path: "/path/to/trained_rtdetr.ckpt"
  results_dir: "/path/to/quantized_output"
  backend: "torchao"                 # or "modelopt.pytorch"
  mode: "weight_only_ptq"            # torchao
  # mode: "static_ptq"               # modelopt.pytorch
  # algorithm: "minmax"              # for modelopt.pytorch
  layers:
    - module_name: "Linear"
      weights: { dtype: "int8" }

Example (ONNX model quantization for TensorRT):

quantize:
  model_path: "/path/to/model.onnx"  # ONNX file path (required)
  results_dir: "/path/to/quantized_output"
  backend: "modelopt.onnx"
  mode: "static_ptq"
  algorithm: "max"                   # or "minmax", "entropy"
  device: "cuda"
  layers:
    - module_name: "*"               # Global quantization setting
      weights: { dtype: "fp8_e5m2" }  # or "fp8_e4m3fn", "int8"
      activations: { dtype: "fp8_e5m2" }
  skip_names: ["/head/*"]            # Skip output head layers

Evaluate or infer with the quantized checkpoint by setting evaluate.is_quantized or inference.is_quantized to true and pointing to the produced artifact. A PyTorch backend saves artifacts as quantized_model_<backend>.pth; the ONNX backend saves as quantized_model.onnx.

What’s Supported#

The supported backends are: - torchao (weight-only PTQ for PyTorch models) - modelopt.pytorch (static PTQ for PyTorch models) - modelopt.onnx (static PTQ for ONNX models)
The supported modes are: PTQ (weight-only PTQ via TorchAO; static PTQ via ModelOpt for both PyTorch and ONNX).
The supported dtypes are INT8, FP8 (E4M3FN/E5M2). float8_* aliases in configurations are accepted and normalized.
The supported tasks are classification_pyt, rtdetr.
The supported runtimes are PyTorch; ONNX/TensorRT export is experimental.

Dive Deeper#

Contents

Workflows#

Classification PTQ: Classification quantization workflow
RT-DETR PTQ: RT-DETR quantization workflow