5. TorchAO backend (weight-only PTQ)#

5.1. Overview#

Weight-only PTQ for INT8 and FP8 (E4M3FN/E5M2) weights.
Activation settings are ignored.
Layer-wise control via module_name patterns.

5.2. Supported options#

mode: weight_only_ptq.
weights.dtype: int8, fp8_e4m3fn, fp8_e5m2.
default_layer_dtype: ignored by this backend; specify weights.dtype per layer.
skip_names: remove modules from quantization.

5.3. How it works#

Internally, TAO builds a per-module mapping from your patterns to TorchAO configs and calls torchao.quantization.quantize_ on a deep copy of the model.

5.4. Example configs#

INT8 all Linear layers.

quantize:
  backend: "torchao"
  mode: "weight_only_ptq"
  default_layer_dtype: "int8"
  default_activation_dtype: "native"
  layers:
    - module_name: "Linear"
      weights: { dtype: "int8" }

FP8 E4M3FN on all Linear layers, skip classifier head.

quantize:
  backend: "torchao"
  mode: "weight_only_ptq"
  default_layer_dtype: "fp8_e4m3fn"
  default_activation_dtype: "native"
  layers:
    - module_name: "Linear"
      weights: { dtype: "fp8_e4m3fn" }
  skip_names: ["classifier.fc"]

5.5. Outputs#

Saved artifact in results_dir named quantized_model_torchao.pth (state_dict).

5.6. Limitations#

Weight-only: activations are not quantized.
No calibration loop.
Speedups depend on runtime kernel support and are model-dependent.

5.7. External links#

TorchAO (PyTorch native quantization and sparsity): pytorch/ao.