Quantizing a model in TAO (TAO Quant)#

TAO Quant is an extensible library integrated into TAO Toolkit that provides quantization capabilities. It currently offers Post-Training Quantization (PTQ) for both PyTorch and ONNX models to reduce inference latency and memory footprint while preserving accuracy. Use this page to select a backend and get started. Refer to the subpages for complete details.

Quantization at a Glance#

  • Quantization converts high-precision numbers (FP32) in your model to lower precision (INT8/FP8) to make inference faster and smaller in memory.

  • In TAO, you can do this without retraining using PTQ. You load a trained model, run prepare/calibrate (if needed), and produce a quantized checkpoint.

  • Trade-offs depend on the approach you choose (refer to the section below). Always validate accuracy on your data.

Weight-Only vs Weights+Activations#

  • Weight-only PTQ (e.g., TorchAO) - Pros: simplest to run, no calibration loop, often minimal accuracy impact, works broadly. - Cons: activations remain FP, so speedups/compression are modest; effectiveness depends on kernels/runtime.

  • Static PTQ for weights+activations (e.g., ModelOpt PyTorch and ONNX) - Pros: larger speed/memory wins by quantizing both weights and activations; more control (algorithms, per-layer). - Cons: needs calibration on representative data; more knobs to tune; accuracy can drop if data is not representative; requires supported runtime.

Learn more about the backends:

TAO Quant Vision#

  • Unified, friendly API: one configuration schema and command across tasks.

  • Pluggable backends: choose TorchAO or ModelOpt today; bring your own backend with a small adapter.

  • Safe defaults: sensible dtypes and modes that work out of the box for common models.

  • Clear workflows: quick start for novices; deeper backend pages for advanced users.

  • Growing coverage: start with classification_pyt and rtdetr, expand with community and NVIDIA backends.

  • Support for QAT (Quantization Aware Training) coming soon.

Choosing a Backend#

TAO Quant supports three backends with different capabilities and use cases:

  • torchao

    • Purpose: Weight-only PTQ for PyTorch models

    • Best for: Quick quantization experiments; minimal accuracy drop; broad model support

    • Limitations: Activations remain FP32; modest speedups; depends on kernel support

    • Input: PyTorch checkpoint (.pth or .ckpt)

    • Output: PyTorch state dict

  • modelopt.pytorch

    • Purpose: Static PTQ with calibration for PyTorch models

    • Best for: Experimenting with weight+activation quantization; prototyping

    • Limitations: Fake-quant ops in PyTorch runtime; limited speedups; focuses on scale accuracy

    • Input: PyTorch checkpoint (.pth or .ckpt)

    • Output: PyTorch checkpoint with calibrated scales

  • modelopt.onnx (Recommended for NVIDIA TensorRT™ deployment)

    • Purpose: Static PTQ with calibration for ONNX models

    • Best for: Production TensorRT deployment; maximum runtime performance gains

    • Limitations: Requires pre-exported ONNX model; no mixed-precision per layer (first dtype applies globally)

    • Input: ONNX model file (.onnx)

    • Output: Quantized ONNX model ready for TensorRT

    • Why preferred: When deployed to TensorRT, provides the best runtime speedups and memory savings. The ONNX format ensures compatibility with TensorRT’s optimized kernels and allows full hardware acceleration.

Decision guide: Use modelopt.onnx if your target is TensorRT inference. Use torchao or modelopt.pytorch for quick PyTorch experiments or when ONNX export is not available.

Quick Start#

  1. Ensure you have a trained model checkpoint (PyTorch for torchao/modelopt.pytorch; ONNX for modelopt.onnx).

  2. Add a quantize section to your experiment spec and run the task-specific quantize command.

Example (RT-DETR with PyTorch backend):

quantize:
  model_path: "/path/to/trained_rtdetr.ckpt"
  results_dir: "/path/to/quantized_output"
  backend: "torchao"                 # or "modelopt.pytorch"
  mode: "weight_only_ptq"            # torchao
  # mode: "static_ptq"               # modelopt.pytorch
  # algorithm: "minmax"              # for modelopt.pytorch
  layers:
    - module_name: "Linear"
      weights: { dtype: "int8" }

Example (ONNX model quantization for TensorRT):

quantize:
  model_path: "/path/to/model.onnx"  # ONNX file path (required)
  results_dir: "/path/to/quantized_output"
  backend: "modelopt.onnx"
  mode: "static_ptq"
  algorithm: "max"                   # or "minmax", "entropy"
  device: "cuda"
  layers:
    - module_name: "*"               # Global quantization setting
      weights: { dtype: "fp8_e5m2" }  # or "fp8_e4m3fn", "int8"
      activations: { dtype: "fp8_e5m2" }
  skip_names: ["/head/*"]            # Skip output head layers
  1. Evaluate or infer with the quantized checkpoint by setting evaluate.is_quantized or inference.is_quantized to true and pointing to the produced artifact. A PyTorch backend saves artifacts as quantized_model_<backend>.pth; the ONNX backend saves as quantized_model.onnx.

What’s Supported#

  • The supported backends are: - torchao (weight-only PTQ for PyTorch models) - modelopt.pytorch (static PTQ for PyTorch models) - modelopt.onnx (static PTQ for ONNX models)

  • The supported modes are: PTQ (weight-only PTQ via TorchAO; static PTQ via ModelOpt for both PyTorch and ONNX).

  • The supported dtypes are INT8, FP8 (E4M3FN/E5M2). float8_* aliases in configurations are accepted and normalized.

  • The supported tasks are classification_pyt, rtdetr.

  • The supported runtimes are PyTorch; ONNX/TensorRT export is experimental.

Dive Deeper#

Workflows#