9. Limitations and current status#

9.1. Runtime support#

  • PyTorch runtime is the primary supported target for quantized execution.

  • ONNX and TensorRT export and runtime are experimental; operator coverage and conversion steps may vary by backend and model.

9.2. Backends#

  • TorchAO: Weight-only PTQ; activations are not quantized; no calibration. FP8 weight-only support depends on TorchAO and hardware stack maturity.

  • ModelOpt: Fake-quant in PyTorch runtime, limited speedups; focus is on accurate scales and exportability.

9.3. Algorithms#

  • ModelOpt supports minmax and entropy. SmoothQuant and additional algorithms may be added later.

9.4. Tasks and coverage#

  • Supported tasks today: classification_pyt, rtdetr.

9.5. General guidance#

  • Always validate accuracy versus FP32 (full precision) and FP16 (half precision) and adjust calibration or layer rules.

  • Run an initial pilot on a representative subset before large-scale deployment.