3. Choosing a Backend#

TAO Quant provides three quantization backends, each optimized for specific use cases and deployment targets. This guide helps you select the right backend for your needs.

3.1. Backend Comparison#

Backend	Input Format	Output Format	Best Use Case	NVIDIA TensorRT™ Ready
`torchao`	PyTorch (.pth)	PyTorch (.pth)	Quick experiments	No
`modelopt.pytorch`	PyTorch (.pth)	PyTorch (.pth)	Prototyping	Partial
`modelopt.onnx`	ONNX (.onnx)	ONNX (.onnx)	Production TensorRT	Yes

3.2. Detailed Breakdown#

3.2.1. torchao#

Purpose: Weight-only post-training quantization for PyTorch models.

When to use:

Quick quantization experiments without calibration
Minimal setup and configuration needed
When you want to preserve activation precision

Strengths:

Simplest to configure and run
No calibration data required
Often minimal accuracy impact
Broad model compatibility

Limitations:

Weight-only (activations remain FP32)
Modest speedups and compression
Runtime gains depend on kernel support
Not optimized for TensorRT deployment

Typical workflow:

Load trained PyTorch model
Configure weight quantization (INT8 or FP8)
Quantize and save
Evaluate in PyTorch runtime

3.2.2. modelopt.pytorch#

Purpose: Static PTQ with calibration for PyTorch models (weights + activations).

When to use:

Experimenting with full quantization (weights and activations)
Prototyping quantization strategies before ONNX export
When you need calibration but want to stay in PyTorch

Strengths:

Quantizes both weights and activations
Supports calibration algorithms (minmax, entropy)
Fine-grained per-layer control
Good for accuracy validation

Limitations:

Uses fake-quant operations in PyTorch runtime
Limited speedups in PyTorch (focus is on scale accuracy)
Not fully optimized for TensorRT deployment
Requires calibration data

Typical workflow:

Load trained PyTorch model
Configure quantization with calibration data
Calibrate and quantize
Evaluate accuracy in PyTorch
Optionally export to ONNX

3.2.3. modelopt.onnx (Recommended for TensorRT)#

Purpose: Static PTQ with calibration for ONNX models, optimized for TensorRT deployment.

When to use:

Production deployments targeting TensorRT
Maximum runtime performance is critical
When you have an ONNX-exported model
When the final deployment is on NVIDIA GPUs

Strengths:

Best TensorRT runtime speedups (up to 3-5x for FP8, 2-4x for INT8)
Optimized for TensorRT kernels and hardware acceleration
Produces TensorRT-ready quantized ONNX models
Clean deployment path: ONNX → TensorRT engine
Supports INT8 and FP8 (E4M3FN, E5M2) quantization
Full weight + activation quantization

Limitations:

Requires ONNX model as input (must export first)
Does not support mixed precision per layer (first layer’s dtype applies globally)
Only works with ONNX models (not PyTorch)
Calibration data must be numpy arrays

Why preferred for TensorRT:

When you quantize an ONNX model with modelopt.onnx and then build a TensorRT engine, you get:

Native TensorRT quantization nodes that map directly to hardware instructions
Optimized kernel fusion that combines quantized operations
Full GPU utilization with INT8/FP8 Tensor Cores
Memory bandwidth savings from smaller data types
Validated calibration scales that transfer correctly to TensorRT

This results in real measured speedups (not just theoretical speedups), typically:

FP8: 2-5x faster inference vs FP32
INT8: 2-4x faster inference vs FP32
Memory: 2-4x reduction in model size

Typical workflow:

Export trained PyTorch model to ONNX
Configure modelopt.onnx quantization
Provide calibration data (numpy format)
Quantize ONNX model
Build TensorRT engine from quantized ONNX
Deploy with TensorRT runtime

3.3. Decision Flowchart#

Are you deploying to TensorRT for production?
- Yes → Use modelopt.onnx (requires calibration data; export to ONNX first if needed)
- No → Continue to question 2
Do you need to quantize activations (not just weights)?
- Yes → Use modelopt.pytorch (requires calibration data)
- No → Use torchao (weight-only, no calibration needed)

3.4. Common Patterns#

3.4.1. Experimentation to Production#

Many users follow this progression:

Start with torchao: Quick weight-only experiments to validate quantization feasibility
Move to modelopt.pytorch: Add activation quantization and calibration
Graduate to modelopt.onnx: Export to ONNX and quantize for TensorRT deployment

3.4.2. TensorRT-First Workflow#

If you know you’re targeting TensorRT:

Train and validate model in PyTorch
Export to ONNX using TAO’s export command (classification_pyt export or rtdetr export)
Verify exported ONNX model accuracy
Use modelopt.onnx to quantize the ONNX model
Build TensorRT engine and deploy

3.5. Examples#

Quick weight-only experiment:

quantize:
  backend: "torchao"
  mode: "weight_only_ptq"
  # Fast, no calibration needed

Full quantization prototyping:

quantize:
  backend: "modelopt.pytorch"
  mode: "static_ptq"
  algorithm: "minmax"
  # Requires calibration data

Production TensorRT deployment:

quantize:
  backend: "modelopt.onnx"
  mode: "static_ptq"
  algorithm: "max"
  # Best for TensorRT runtime performance

3. Choosing a Backend#

3.1. Backend Comparison#

3.2. Detailed Breakdown#

3.2.1. torchao#

3.2.2. modelopt.pytorch#

3.2.3. modelopt.onnx (Recommended for TensorRT)#

3.3. Decision Flowchart#

3.4. Common Patterns#

3.4.1. Experimentation to Production#

3.4.2. TensorRT-First Workflow#

3.5. Examples#

3.6. Navigation#