Quantizing a model in TAO (TAO Quant)#

TAO Quant is an extensible library integrated into the TAO Toolkit that provides quantization capabilities. It currently offers Post-Training Quantization (PTQ) for PyTorch models in TAO to reduce inference latency and memory footprint while preserving accuracy. Use this page to select a backend and get started; refer to the subpages for complete details.

Quantization at a glance#

Quantization converts high-precision numbers (FP32) in your model to lower precision (INT8/FP8) to make inference faster and smaller in memory.
In TAO, you can do this without retraining using PTQ. You load a trained model, run prepare/calibrate (if needed), and produce a quantized checkpoint.
Trade-offs depend on the approach you choose (see below). Always validate accuracy on your data.

Weight-only vs weights+activations#

Weight-only PTQ (e.g., TorchAO): - Pros: simplest to run, no calibration loop, often minimal accuracy impact, works broadly. - Cons: activations remain FP, so speedups/compression are modest; effectiveness depends on kernels/runtime.
Static PTQ for weights+activations (e.g., ModelOpt): - Pros: larger speed/memory wins by quantizing both weights and activations; more control (algorithms, per-layer). - Cons: needs calibration on representative data; more knobs to tune; accuracy can drop if data is not representative; requires supported runtime.

Learn more about the backends:

TorchAO: pytorch/ao.
NVIDIA ModelOpt: NVIDIA/TensorRT-Model-Optimizer.

TAO Quant vision#

Unified, friendly API: one configuration schema and command across tasks.
Pluggable backends: choose TorchAO or ModelOpt today; bring your own backend with a small adapter.
Safe defaults: sensible dtypes and modes that work out of the box for common models.
Clear workflows: quick start for novices; deeper backend pages for advanced users.
Growing coverage: start with classification_pyt and rtdetr, expand with community and NVIDIA backends.
Support for QAT (Quantization Aware Training) coming soon.

Quick start#

Pick a backend:

torchao: weight-only PTQ (INT8/FP8 weights). No calibration loop. Fast and simple. Activation settings are ignored.
modelopt: static PTQ with calibration. Weights and activations (INT8/FP8). More control.

Add a quantize section to your experiment spec and run the task-specific quantize command.

Example (RT-DETR):

quantize:
  model_path: "/path/to/trained_rtdetr.ckpt"
  results_dir: "/path/to/quantized_output"
  backend: "torchao"            # or "modelopt"
  mode: "weight_only_ptq"       # torchao
  # mode: "static_ptq"          # modelopt
  default_layer_dtype: "native"   # currently ignored by backends; set per-layer
  default_activation_dtype: "native"  # ignored by torchao; set per-layer for modelopt
  layers:
    - module_name: "Linear"
      weights: { dtype: "int8" }

Evaluate or infer with the quantized checkpoint by setting evaluate.is_quantized or inference.is_quantized to true and pointing to the produced artifact. Artifacts are saved under results_dir as quantized_model_<backend>.pth.

What’s supported#

Backends: torchao (weight-only PTQ), modelopt (static PTQ).
Modes: PTQ (weight-only PTQ via TorchAO; static PTQ via ModelOpt).
Dtypes: INT8, FP8 (E4M3FN/E5M2). float8_* aliases in configurations are accepted and normalized.
Tasks: classification_pyt, rtdetr.
Runtime: PyTorch; ONNX/TensorRT export is experimental.

Dive deeper#

Contents

Workflows#

Classification PTQ: Classification quantization workflow
RT-DETR PTQ: RT-DETR quantization workflow