5. TorchAO backend (weight-only PTQ)#
5.1. Overview#
Weight-only PTQ for INT8 and FP8 (E4M3FN/E5M2) weights.
Activation settings are ignored.
Layer-wise control via
module_name
patterns.
5.2. Supported options#
mode
:weight_only_ptq
.weights.dtype
:int8
,fp8_e4m3fn
,fp8_e5m2
.default_layer_dtype
: ignored by this backend; specifyweights.dtype
per layer.skip_names
: remove modules from quantization.
5.3. How it works#
Internally, TAO builds a per-module mapping from your patterns to TorchAO configs and calls torchao.quantization.quantize_
on a deep copy of the model.
5.4. Example configs#
INT8 all Linear layers.
quantize:
backend: "torchao"
mode: "weight_only_ptq"
default_layer_dtype: "int8"
default_activation_dtype: "native"
layers:
- module_name: "Linear"
weights: { dtype: "int8" }
FP8 E4M3FN on all Linear layers, skip classifier head.
quantize:
backend: "torchao"
mode: "weight_only_ptq"
default_layer_dtype: "fp8_e4m3fn"
default_activation_dtype: "native"
layers:
- module_name: "Linear"
weights: { dtype: "fp8_e4m3fn" }
skip_names: ["classifier.fc"]
5.5. Outputs#
Saved artifact in
results_dir
namedquantized_model_torchao.pth
(state_dict).
5.6. Limitations#
Weight-only: activations are not quantized.
No calibration loop.
Speedups depend on runtime kernel support and are model-dependent.
5.7. External links#
TorchAO (PyTorch native quantization and sparsity): pytorch/ao.