Quantizing a model in TAO (TAO Quant)#

TAO Quant is an extensible library integrated into the TAO Toolkit that provides quantization capabilities. It currently offers Post-Training Quantization (PTQ) for PyTorch models in TAO to reduce inference latency and memory footprint while preserving accuracy. Use this page to select a backend and get started; refer to the subpages for complete details.

Quantization at a glance#

  • Quantization converts high-precision numbers (FP32) in your model to lower precision (INT8/FP8) to make inference faster and smaller in memory.

  • In TAO, you can do this without retraining using PTQ. You load a trained model, run prepare/calibrate (if needed), and produce a quantized checkpoint.

  • Trade-offs depend on the approach you choose (see below). Always validate accuracy on your data.

Weight-only vs weights+activations#

  • Weight-only PTQ (e.g., TorchAO): - Pros: simplest to run, no calibration loop, often minimal accuracy impact, works broadly. - Cons: activations remain FP, so speedups/compression are modest; effectiveness depends on kernels/runtime.

  • Static PTQ for weights+activations (e.g., ModelOpt): - Pros: larger speed/memory wins by quantizing both weights and activations; more control (algorithms, per-layer). - Cons: needs calibration on representative data; more knobs to tune; accuracy can drop if data is not representative; requires supported runtime.

Learn more about the backends:

TAO Quant vision#

  • Unified, friendly API: one configuration schema and command across tasks.

  • Pluggable backends: choose TorchAO or ModelOpt today; bring your own backend with a small adapter.

  • Safe defaults: sensible dtypes and modes that work out of the box for common models.

  • Clear workflows: quick start for novices; deeper backend pages for advanced users.

  • Growing coverage: start with classification_pyt and rtdetr, expand with community and NVIDIA backends.

  • Support for QAT (Quantization Aware Training) coming soon.

Quick start#

  1. Pick a backend:

  • torchao: weight-only PTQ (INT8/FP8 weights). No calibration loop. Fast and simple. Activation settings are ignored.

  • modelopt: static PTQ with calibration. Weights and activations (INT8/FP8). More control.

  1. Add a quantize section to your experiment spec and run the task-specific quantize command.

Example (RT-DETR):

quantize:
  model_path: "/path/to/trained_rtdetr.ckpt"
  results_dir: "/path/to/quantized_output"
  backend: "torchao"            # or "modelopt"
  mode: "weight_only_ptq"       # torchao
  # mode: "static_ptq"          # modelopt
  default_layer_dtype: "native"   # currently ignored by backends; set per-layer
  default_activation_dtype: "native"  # ignored by torchao; set per-layer for modelopt
  layers:
    - module_name: "Linear"
      weights: { dtype: "int8" }
  1. Evaluate or infer with the quantized checkpoint by setting evaluate.is_quantized or inference.is_quantized to true and pointing to the produced artifact. Artifacts are saved under results_dir as quantized_model_<backend>.pth.

What’s supported#

  • Backends: torchao (weight-only PTQ), modelopt (static PTQ).

  • Modes: PTQ (weight-only PTQ via TorchAO; static PTQ via ModelOpt).

  • Dtypes: INT8, FP8 (E4M3FN/E5M2). float8_* aliases in configurations are accepted and normalized.

  • Tasks: classification_pyt, rtdetr.

  • Runtime: PyTorch; ONNX/TensorRT export is experimental.

Dive deeper#

Workflows#