12. Limitations and Current Status#

12.1. Runtime Support#

PyTorch runtime: Supported for torchao and modelopt.pytorch backends.
ONNX Runtime: Fully supported for modelopt.onnx backend.
NVIDIA TensorRT™: Recommended production target for modelopt.onnx backend. Provides best performance gains (2-5x for FP8, 2-4x for INT8).

Note

TensorRT deployment via modelopt.onnx is the preferred workflow for production deployments requiring maximum performance.

12.2. Backend-Specific Limitations#

TorchAO:

Weight-only PTQ; activations remain FP32
No calibration required (pro and con)
Runtime speedups depend on kernel support and hardware
FP8 support maturity varies by hardware generation

ModelOpt PyTorch:

Uses fake-quant operations in PyTorch runtime
Limited speedups in PyTorch (focus is on scale accuracy)
Best used for prototyping before ONNX export
Requires calibration data

ModelOpt ONNX:

Requires pre-exported ONNX model (cannot quantize PyTorch directly)
No mixed-precision per layer (first dtype applies globally)
Currently limited to classification_pyt and rtdetr models
Requires calibration data for best results

12.3. Calibration Algorithms#

ModelOpt backends (PyTorch and ONNX) support minmax, max, and entropy
max is often recommended for ONNX backend with TensorRT deployment
TorchAO does not require calibration (weight-only quantization)

12.4. Tasks and Coverage#

Supported tasks today: classification_pyt, rtdetr.

12.5. General Guidance#

Always validate accuracy versus FP32 (full precision) and FP16 (half precision) and adjust calibration or layer rules.
Run an initial pilot on a representative subset before large-scale deployment.