1. Terminology#

This page explains common terms used throughout TAO Quant. It aims to be beginner-friendly and concise.

1.1. Core concepts#

  • Quantization: Converting model numbers from high precision (e.g., FP32) to lower precision (e.g., INT8 or FP8) to reduce memory and improve speed.

  • PTQ (Post-Training Quantization): Quantize a pretrained model without fine-tuning; may require a small calibration dataset.

  • QAT (Quantization-Aware Training): Train (or fine-tune) the model with fake-quant operators to recover or retain accuracy at lower precision. Not covered by current TAO Quant release.

  • Weights: Learnable parameters of layers (e.g., kernels, matrices).

  • Activations: Intermediate outputs produced when the model processes inputs.

  • Dtype (Data type): Numeric precision or format, such as int8, fp8_e4m3fn, fp8_e5m2, or native (use original precision).

1.2. Framework pieces#

  • Backend: A plug-in that implements the quantization steps for a specific library. TAO supports: - torchao: Weight-only PTQ; no calibration loop; ignores activation settings. - modelopt: Static PTQ with calibration; quantizes weights and activations.

  • Calibration: A short pass over representative data to compute ranges and scales for activations and weights (used by backends like ModelOpt).

  • Observer or Fake-quant: Modules inserted during quantization to measure ranges or simulate lower-precision behavior during inference and training.

1.3. Configuration terms#

  • ``quantize`` section: Where you specify backend, mode, default dtypes, per-layer rules, and paths.

  • ``mode``: - weight_only_ptq: quantize only weights (e.g., TorchAO) - static_ptq: quantize weights and activations with calibration (e.g., ModelOpt)

  • Per-layer rules: layers list with module_name patterns and optional weights and activations dtypes.

  • ``skip_names``: Patterns to exclude modules from quantization.

1.4. Good to know#

  • FP8 variants (fp8_e4m3fn and fp8_e5m2) are accepted; some backends treat them equivalently.

  • Always validate accuracy after quantization; representativeness of the calibration data matters.