12. Limitations and Current Status#

12.1. Runtime Support#

  • PyTorch runtime: Supported for torchao and modelopt.pytorch backends.

  • ONNX Runtime: Fully supported for modelopt.onnx backend.

  • NVIDIA TensorRTâ„¢: Recommended production target for modelopt.onnx backend. Provides best performance gains (2-5x for FP8, 2-4x for INT8).

Note

TensorRT deployment via modelopt.onnx is the preferred workflow for production deployments requiring maximum performance.

12.2. Backend-Specific Limitations#

TorchAO:

  • Weight-only PTQ; activations remain FP32

  • No calibration required (pro and con)

  • Runtime speedups depend on kernel support and hardware

  • FP8 support maturity varies by hardware generation

ModelOpt PyTorch:

  • Uses fake-quant operations in PyTorch runtime

  • Limited speedups in PyTorch (focus is on scale accuracy)

  • Best used for prototyping before ONNX export

  • Requires calibration data

ModelOpt ONNX:

  • Requires pre-exported ONNX model (cannot quantize PyTorch directly)

  • No mixed-precision per layer (first dtype applies globally)

  • Currently limited to classification_pyt and rtdetr models

  • Requires calibration data for best results

12.3. Calibration Algorithms#

  • ModelOpt backends (PyTorch and ONNX) support minmax, max, and entropy

  • max is often recommended for ONNX backend with TensorRT deployment

  • TorchAO does not require calibration (weight-only quantization)

12.4. Tasks and Coverage#

  • Supported tasks today: classification_pyt, rtdetr.

12.5. General Guidance#

  • Always validate accuracy versus FP32 (full precision) and FP16 (half precision) and adjust calibration or layer rules.

  • Run an initial pilot on a representative subset before large-scale deployment.