9. Limitations and current status#
9.1. Runtime support#
PyTorch runtime is the primary supported target for quantized execution.
ONNX and TensorRT export and runtime are experimental; operator coverage and conversion steps may vary by backend and model.
9.2. Backends#
TorchAO: Weight-only PTQ; activations are not quantized; no calibration. FP8 weight-only support depends on TorchAO and hardware stack maturity.
ModelOpt: Fake-quant in PyTorch runtime, limited speedups; focus is on accurate scales and exportability.
9.3. Algorithms#
ModelOpt supports
minmax
andentropy
. SmoothQuant and additional algorithms may be added later.
9.4. Tasks and coverage#
Supported tasks today:
classification_pyt
,rtdetr
.
9.5. General guidance#
Always validate accuracy versus FP32 (full precision) and FP16 (half precision) and adjust calibration or layer rules.
Run an initial pilot on a representative subset before large-scale deployment.