9. Limitations and current status#

9.1. Runtime support#

PyTorch runtime is the primary supported target for quantized execution.
ONNX and TensorRT export and runtime are experimental; operator coverage and conversion steps may vary by backend and model.

9.2. Backends#

TorchAO: Weight-only PTQ; activations are not quantized; no calibration. FP8 weight-only support depends on TorchAO and hardware stack maturity.
ModelOpt: Fake-quant in PyTorch runtime, limited speedups; focus is on accurate scales and exportability.

9.3. Algorithms#

ModelOpt supports minmax and entropy. SmoothQuant and additional algorithms may be added later.

9.4. Tasks and coverage#

Supported tasks today: classification_pyt, rtdetr.

9.5. General guidance#

Always validate accuracy versus FP32 (full precision) and FP16 (half precision) and adjust calibration or layer rules.
Run an initial pilot on a representative subset before large-scale deployment.