12. Limitations and Current Status#
12.1. Runtime Support#
PyTorch runtime: Supported for
torchaoandmodelopt.pytorchbackends.ONNX Runtime: Fully supported for
modelopt.onnxbackend.NVIDIA TensorRTâ„¢: Recommended production target for
modelopt.onnxbackend. Provides best performance gains (2-5x for FP8, 2-4x for INT8).
Note
TensorRT deployment via modelopt.onnx is the preferred workflow for production deployments requiring maximum performance.
12.2. Backend-Specific Limitations#
TorchAO:
Weight-only PTQ; activations remain FP32
No calibration required (pro and con)
Runtime speedups depend on kernel support and hardware
FP8 support maturity varies by hardware generation
ModelOpt PyTorch:
Uses fake-quant operations in PyTorch runtime
Limited speedups in PyTorch (focus is on scale accuracy)
Best used for prototyping before ONNX export
Requires calibration data
ModelOpt ONNX:
Requires pre-exported ONNX model (cannot quantize PyTorch directly)
No mixed-precision per layer (first dtype applies globally)
Currently limited to
classification_pytandrtdetrmodelsRequires calibration data for best results
12.3. Calibration Algorithms#
ModelOpt backends (PyTorch and ONNX) support
minmax,max, andentropymaxis often recommended for ONNX backend with TensorRT deploymentTorchAO does not require calibration (weight-only quantization)
12.4. Tasks and Coverage#
Supported tasks today:
classification_pyt,rtdetr.
12.5. General Guidance#
Always validate accuracy versus FP32 (full precision) and FP16 (half precision) and adjust calibration or layer rules.
Run an initial pilot on a representative subset before large-scale deployment.