1. Terminology#

This page explains common terms used throughout TAO Quant. It aims to be beginner-friendly and concise.

1.1. Core concepts#

Quantization: Converting model numbers from high precision (e.g., FP32) to lower precision (e.g., INT8 or FP8) to reduce memory and improve speed.
PTQ (Post-Training Quantization): Quantize a pretrained model without fine-tuning; may require a small calibration dataset.
QAT (Quantization-Aware Training): Train (or fine-tune) the model with fake-quant operators to recover or retain accuracy at lower precision. Not covered by current TAO Quant release.
Weights: Learnable parameters of layers (e.g., kernels, matrices).
Activations: Intermediate outputs produced when the model processes inputs.
Dtype (Data type): Numeric precision or format, such as int8, fp8_e4m3fn, fp8_e5m2, or native (use original precision).

Backend: A plug-in that implements the quantization steps for a specific library. TAO supports: - torchao: Weight-only PTQ; no calibration loop; ignores activation settings. - modelopt: Static PTQ with calibration; quantizes weights and activations.
Calibration: A short pass over representative data to compute ranges and scales for activations and weights (used by backends like ModelOpt).
Observer or Fake-quant: Modules inserted during quantization to measure ranges or simulate lower-precision behavior during inference and training.

``quantize`` section: Where you specify backend, mode, default dtypes, per-layer rules, and paths.
``mode``: - weight_only_ptq: quantize only weights (e.g., TorchAO) - static_ptq: quantize weights and activations with calibration (e.g., ModelOpt)
Per-layer rules: layers list with module_name patterns and optional weights and activations dtypes.
``skip_names``: Patterns to exclude modules from quantization.

FP8 variants (fp8_e4m3fn and fp8_e5m2) are accepted; some backends treat them equivalently.
Always validate accuracy after quantization; representativeness of the calibration data matters.