3. Choosing a Backend#

TAO Quant provides three quantization backends, each optimized for specific use cases and deployment targets. This guide helps you select the right backend for your needs.

3.1. Backend Comparison#

Backend

Input Format

Output Format

Best Use Case

NVIDIA TensorRT™ Ready

torchao

PyTorch (.pth)

PyTorch (.pth)

Quick experiments

No

modelopt.pytorch

PyTorch (.pth)

PyTorch (.pth)

Prototyping

Partial

modelopt.onnx

ONNX (.onnx)

ONNX (.onnx)

Production TensorRT

Yes

3.2. Detailed Breakdown#

3.2.1. torchao#

Purpose: Weight-only post-training quantization for PyTorch models.

When to use:

  • Quick quantization experiments without calibration

  • Minimal setup and configuration needed

  • When you want to preserve activation precision

Strengths:

  • Simplest to configure and run

  • No calibration data required

  • Often minimal accuracy impact

  • Broad model compatibility

Limitations:

  • Weight-only (activations remain FP32)

  • Modest speedups and compression

  • Runtime gains depend on kernel support

  • Not optimized for TensorRT deployment

Typical workflow:

  1. Load trained PyTorch model

  2. Configure weight quantization (INT8 or FP8)

  3. Quantize and save

  4. Evaluate in PyTorch runtime

3.2.2. modelopt.pytorch#

Purpose: Static PTQ with calibration for PyTorch models (weights + activations).

When to use:

  • Experimenting with full quantization (weights and activations)

  • Prototyping quantization strategies before ONNX export

  • When you need calibration but want to stay in PyTorch

Strengths:

  • Quantizes both weights and activations

  • Supports calibration algorithms (minmax, entropy)

  • Fine-grained per-layer control

  • Good for accuracy validation

Limitations:

  • Uses fake-quant operations in PyTorch runtime

  • Limited speedups in PyTorch (focus is on scale accuracy)

  • Not fully optimized for TensorRT deployment

  • Requires calibration data

Typical workflow:

  1. Load trained PyTorch model

  2. Configure quantization with calibration data

  3. Calibrate and quantize

  4. Evaluate accuracy in PyTorch

  5. Optionally export to ONNX

3.3. Decision Flowchart#

  1. Are you deploying to TensorRT for production?

    • Yes → Use modelopt.onnx (requires calibration data; export to ONNX first if needed)

    • No → Continue to question 2

  2. Do you need to quantize activations (not just weights)?

    • Yes → Use modelopt.pytorch (requires calibration data)

    • No → Use torchao (weight-only, no calibration needed)

3.4. Common Patterns#

3.4.1. Experimentation to Production#

Many users follow this progression:

  1. Start with torchao: Quick weight-only experiments to validate quantization feasibility

  2. Move to modelopt.pytorch: Add activation quantization and calibration

  3. Graduate to modelopt.onnx: Export to ONNX and quantize for TensorRT deployment

3.4.2. TensorRT-First Workflow#

If you know you’re targeting TensorRT:

  1. Train and validate model in PyTorch

  2. Export to ONNX using TAO’s export command (classification_pyt export or rtdetr export)

  3. Verify exported ONNX model accuracy

  4. Use modelopt.onnx to quantize the ONNX model

  5. Build TensorRT engine and deploy

3.5. Examples#

Quick weight-only experiment:

quantize:
  backend: "torchao"
  mode: "weight_only_ptq"
  # Fast, no calibration needed

Full quantization prototyping:

quantize:
  backend: "modelopt.pytorch"
  mode: "static_ptq"
  algorithm: "minmax"
  # Requires calibration data

Production TensorRT deployment:

quantize:
  backend: "modelopt.onnx"
  mode: "static_ptq"
  algorithm: "max"
  # Best for TensorRT runtime performance