5. TorchAO backend (weight-only PTQ)#

5.1. Overview#

  • Weight-only PTQ for INT8 and FP8 (E4M3FN/E5M2) weights.

  • Activation settings are ignored.

  • Layer-wise control via module_name patterns.

5.2. Supported options#

  • mode: weight_only_ptq.

  • weights.dtype: int8, fp8_e4m3fn, fp8_e5m2.

  • default_layer_dtype: ignored by this backend; specify weights.dtype per layer.

  • skip_names: remove modules from quantization.

5.3. How it works#

Internally, TAO builds a per-module mapping from your patterns to TorchAO configs and calls torchao.quantization.quantize_ on a deep copy of the model.

5.4. Example configs#

INT8 all Linear layers.

quantize:
  backend: "torchao"
  mode: "weight_only_ptq"
  default_layer_dtype: "int8"
  default_activation_dtype: "native"
  layers:
    - module_name: "Linear"
      weights: { dtype: "int8" }

FP8 E4M3FN on all Linear layers, skip classifier head.

quantize:
  backend: "torchao"
  mode: "weight_only_ptq"
  default_layer_dtype: "fp8_e4m3fn"
  default_activation_dtype: "native"
  layers:
    - module_name: "Linear"
      weights: { dtype: "fp8_e4m3fn" }
  skip_names: ["classifier.fc"]

5.5. Outputs#

  • Saved artifact in results_dir named quantized_model_torchao.pth (state_dict).

5.6. Limitations#

  • Weight-only: activations are not quantized.

  • No calibration loop.

  • Speedups depend on runtime kernel support and are model-dependent.