Quantizing a model in TAO (TAO Quant)#
TAO Quant is an extensible library integrated into the TAO Toolkit that provides quantization capabilities. It currently offers Post-Training Quantization (PTQ) for PyTorch models in TAO to reduce inference latency and memory footprint while preserving accuracy. Use this page to select a backend and get started; refer to the subpages for complete details.
Quantization at a glance#
Quantization converts high-precision numbers (FP32) in your model to lower precision (INT8/FP8) to make inference faster and smaller in memory.
In TAO, you can do this without retraining using PTQ. You load a trained model, run prepare/calibrate (if needed), and produce a quantized checkpoint.
Trade-offs depend on the approach you choose (see below). Always validate accuracy on your data.
Weight-only vs weights+activations#
Weight-only PTQ (e.g., TorchAO): - Pros: simplest to run, no calibration loop, often minimal accuracy impact, works broadly. - Cons: activations remain FP, so speedups/compression are modest; effectiveness depends on kernels/runtime.
Static PTQ for weights+activations (e.g., ModelOpt): - Pros: larger speed/memory wins by quantizing both weights and activations; more control (algorithms, per-layer). - Cons: needs calibration on representative data; more knobs to tune; accuracy can drop if data is not representative; requires supported runtime.
Learn more about the backends:
TorchAO: pytorch/ao.
NVIDIA ModelOpt: NVIDIA/TensorRT-Model-Optimizer.
TAO Quant vision#
Unified, friendly API: one configuration schema and command across tasks.
Pluggable backends: choose TorchAO or ModelOpt today; bring your own backend with a small adapter.
Safe defaults: sensible dtypes and modes that work out of the box for common models.
Clear workflows: quick start for novices; deeper backend pages for advanced users.
Growing coverage: start with
classification_pyt
andrtdetr
, expand with community and NVIDIA backends.Support for QAT (Quantization Aware Training) coming soon.
Quick start#
Pick a backend:
torchao
: weight-only PTQ (INT8/FP8 weights). No calibration loop. Fast and simple. Activation settings are ignored.modelopt
: static PTQ with calibration. Weights and activations (INT8/FP8). More control.
Add a
quantize
section to your experiment spec and run the task-specificquantize
command.
Example (RT-DETR):
quantize:
model_path: "/path/to/trained_rtdetr.ckpt"
results_dir: "/path/to/quantized_output"
backend: "torchao" # or "modelopt"
mode: "weight_only_ptq" # torchao
# mode: "static_ptq" # modelopt
default_layer_dtype: "native" # currently ignored by backends; set per-layer
default_activation_dtype: "native" # ignored by torchao; set per-layer for modelopt
layers:
- module_name: "Linear"
weights: { dtype: "int8" }
Evaluate or infer with the quantized checkpoint by setting
evaluate.is_quantized
orinference.is_quantized
totrue
and pointing to the produced artifact. Artifacts are saved underresults_dir
asquantized_model_<backend>.pth
.
What’s supported#
Backends:
torchao
(weight-only PTQ),modelopt
(static PTQ).Modes: PTQ (weight-only PTQ via TorchAO; static PTQ via ModelOpt).
Dtypes: INT8, FP8 (E4M3FN/E5M2).
float8_*
aliases in configurations are accepted and normalized.Tasks:
classification_pyt
,rtdetr
.Runtime: PyTorch; ONNX/TensorRT export is experimental.
Dive deeper#
Workflows#
Classification PTQ: Classification quantization workflow
RT-DETR PTQ: RT-DETR quantization workflow