3. Choosing a Backend#
TAO Quant provides three quantization backends, each optimized for specific use cases and deployment targets. This guide helps you select the right backend for your needs.
3.1. Backend Comparison#
Backend |
Input Format |
Output Format |
Best Use Case |
NVIDIA TensorRT™ Ready |
|---|---|---|---|---|
|
PyTorch (.pth) |
PyTorch (.pth) |
Quick experiments |
No |
|
PyTorch (.pth) |
PyTorch (.pth) |
Prototyping |
Partial |
|
ONNX (.onnx) |
ONNX (.onnx) |
Production TensorRT |
Yes |
3.2. Detailed Breakdown#
3.2.1. torchao#
Purpose: Weight-only post-training quantization for PyTorch models.
When to use:
Quick quantization experiments without calibration
Minimal setup and configuration needed
When you want to preserve activation precision
Strengths:
Simplest to configure and run
No calibration data required
Often minimal accuracy impact
Broad model compatibility
Limitations:
Weight-only (activations remain FP32)
Modest speedups and compression
Runtime gains depend on kernel support
Not optimized for TensorRT deployment
Typical workflow:
Load trained PyTorch model
Configure weight quantization (INT8 or FP8)
Quantize and save
Evaluate in PyTorch runtime
3.2.2. modelopt.pytorch#
Purpose: Static PTQ with calibration for PyTorch models (weights + activations).
When to use:
Experimenting with full quantization (weights and activations)
Prototyping quantization strategies before ONNX export
When you need calibration but want to stay in PyTorch
Strengths:
Quantizes both weights and activations
Supports calibration algorithms (minmax, entropy)
Fine-grained per-layer control
Good for accuracy validation
Limitations:
Uses fake-quant operations in PyTorch runtime
Limited speedups in PyTorch (focus is on scale accuracy)
Not fully optimized for TensorRT deployment
Requires calibration data
Typical workflow:
Load trained PyTorch model
Configure quantization with calibration data
Calibrate and quantize
Evaluate accuracy in PyTorch
Optionally export to ONNX
3.2.3. modelopt.onnx (Recommended for TensorRT)#
Purpose: Static PTQ with calibration for ONNX models, optimized for TensorRT deployment.
When to use:
Production deployments targeting TensorRT
Maximum runtime performance is critical
When you have an ONNX-exported model
When the final deployment is on NVIDIA GPUs
Strengths:
Best TensorRT runtime speedups (up to 3-5x for FP8, 2-4x for INT8)
Optimized for TensorRT kernels and hardware acceleration
Produces TensorRT-ready quantized ONNX models
Clean deployment path: ONNX → TensorRT engine
Supports INT8 and FP8 (E4M3FN, E5M2) quantization
Full weight + activation quantization
Limitations:
Requires ONNX model as input (must export first)
Does not support mixed precision per layer (first layer’s dtype applies globally)
Only works with ONNX models (not PyTorch)
Calibration data must be numpy arrays
Why preferred for TensorRT:
When you quantize an ONNX model with modelopt.onnx and then build a TensorRT engine, you get:
Native TensorRT quantization nodes that map directly to hardware instructions
Optimized kernel fusion that combines quantized operations
Full GPU utilization with INT8/FP8 Tensor Cores
Memory bandwidth savings from smaller data types
Validated calibration scales that transfer correctly to TensorRT
This results in real measured speedups (not just theoretical speedups), typically:
FP8: 2-5x faster inference vs FP32
INT8: 2-4x faster inference vs FP32
Memory: 2-4x reduction in model size
Typical workflow:
Export trained PyTorch model to ONNX
Configure
modelopt.onnxquantizationProvide calibration data (numpy format)
Quantize ONNX model
Build TensorRT engine from quantized ONNX
Deploy with TensorRT runtime
3.3. Decision Flowchart#
Are you deploying to TensorRT for production?
Yes → Use
modelopt.onnx(requires calibration data; export to ONNX first if needed)No → Continue to question 2
Do you need to quantize activations (not just weights)?
Yes → Use
modelopt.pytorch(requires calibration data)No → Use
torchao(weight-only, no calibration needed)
3.4. Common Patterns#
3.4.1. Experimentation to Production#
Many users follow this progression:
Start with torchao: Quick weight-only experiments to validate quantization feasibility
Move to modelopt.pytorch: Add activation quantization and calibration
Graduate to modelopt.onnx: Export to ONNX and quantize for TensorRT deployment
3.4.2. TensorRT-First Workflow#
If you know you’re targeting TensorRT:
Train and validate model in PyTorch
Export to ONNX using TAO’s export command (
classification_pyt exportorrtdetr export)Verify exported ONNX model accuracy
Use
modelopt.onnxto quantize the ONNX modelBuild TensorRT engine and deploy
3.5. Examples#
Quick weight-only experiment:
quantize:
backend: "torchao"
mode: "weight_only_ptq"
# Fast, no calibration needed
Full quantization prototyping:
quantize:
backend: "modelopt.pytorch"
mode: "static_ptq"
algorithm: "minmax"
# Requires calibration data
Production TensorRT deployment:
quantize:
backend: "modelopt.onnx"
mode: "static_ptq"
algorithm: "max"
# Best for TensorRT runtime performance