9. ModelOpt ONNX Backend (Static PTQ)#

9.1. Overview#

Static PTQ for ONNX model files
Quantizes both weights and activations (INT8/FP8)
Works exclusively with ONNX files specified via file path
Requires ModelOpt ONNX package to be installed
Algorithm selection via algorithm (refer to the Supported Options section below)

9.2. Supported Options#

mode: static_ptq.
algorithm: minmax (range via min/max), max (maximum range), entropy (KL divergence). If unset, defaults to minmax. max is often recommended for NVIDIA TensorRT™ deployment.
weights.dtype: int8, fp8_e4m3fn, fp8_e5m2, native.
activations.dtype: int8, fp8_e4m3fn, fp8_e5m2, native.
default_layer_dtype / default_activation_dtype: Currently ignored by this backend; specify dtypes per layer.
skip_names: Remove modules from quantization.
model_path: Path to ONNX model file (required, must have .onnx extension).

9.3. Key Differences From PyTorch Backend#

Input format: ONNX backend requires an ONNX file path via model_path, not a PyTorch model. Use TAO’s export command to generate the ONNX file.
Calibration data: Extracted from DataLoader and converted to numpy arrays for ONNX quantization.
Output format: Quantized model is saved as an ONNX file (quantized_model.onnx), not a PyTorch checkpoint.
Mixed precision ⚠️: Does not support mixing layer types. If multiple dtypes are specified, only the first layer’s dtype is applied globally with a warning.
Validation: API fails if a non-ONNX model file is provided (validates .onnx extension).

9.4. Calibration#

Provide a DataLoader via TAO’s evaluation configurations. The integration extracts data from the loader, converts it to numpy arrays, and passes it to ModelOpt ONNX.
Batches can be tensors, tuples (first element is input), or dicts with common keys (input, data, x, images, images_left, images_right).
If no calibration data is provided, ModelOpt generates dummy data (reduces accuracy).

9.5. Example Config#

quantize:
  model_path: "/path/to/model.onnx"  # ONNX file path (required)
  results_dir: "/path/to/quantized_output"
  backend: "modelopt.onnx"
  mode: "static_ptq"
  algorithm: "minmax"
  layers:
    - module_name: "Conv"
      weights: { dtype: "int8" }
      activations: { dtype: "int8" }
    - module_name: "Gemm"
      weights: { dtype: "int8" }
      activations: { dtype: "int8" }

Note: For ONNX models, module names refer to ONNX operator types (e.g., Conv, Gemm, MatMul) rather than PyTorch class names.

9.6. Outputs#

Saved artifact in results_dir named quantized_model.onnx, containing the quantized ONNX model with calibrated scales.

9.7. Important Notes and Limitations#

Warning

Critical Limitation: No Mixed-Precision Support

The ONNX backend does not support per-layer mixed precision. If you specify multiple layers with different dtypes (e.g., INT8 for some layers, FP8 for others), only the first dtype is applied globally to all quantized layers. A warning is issued, but quantization proceeds.

For fine-grained per-layer quantization control, use the modelopt.pytorch backend instead.

Input validation:

The backend fails if you provide a non-ONNX model file.
File must have a .onnx extension.
PyTorch models are not accepted (use TAO’s export command first).

Dtype mixing limitation:

Does not support mixed precision per layer.
If multiple layers specify different dtypes, only the first dtype is applied globally.
A warning is issued if mixed dtypes are detected.
For per-layer quantization control, use modelopt.pytorch backend instead.

Calibration:

Calibration data is automatically extracted from the DataLoader and converted to numpy format.
If no calibration data provided, ModelOpt generates dummy data (may reduce accuracy).
Calibration is critical for achieving good INT8 accuracy.

Model support:

Currently supports: classification_pyt and rtdetr models.

Advanced backend_kwargs: For advanced use cases, you can pass additional parameters via backend_kwargs. Some common examples are:

use_external_data_format: Enable for models >2GB
op_types_to_quantize: List of ONNX op types to quantize (e.g., ["Conv", "Gemm"])
calibrate_per_node: Enable per-node calibration
high_precision_dtype: Keep certain operations in higher precision

For the complete list of available parameters and their descriptions, refer to the ModelOpt ONNX quantize API documentation.

9.8. Complete Workflow for TensorRT Deployment#

The modelopt.onnx backend is specifically designed for TensorRT deployment. Follow these steps for the complete workflow.

9.8.1. Step 1: Export PyTorch Model to ONNX#

Use TAO Toolkit’s built-in export functionality to convert your trained PyTorch model to ONNX format.

For classification models:

classification_pyt export -e export_config.yaml

For RT-DETR models:

rtdetr export -e export_config.yaml

The export command generates an ONNX file compatible with the modelopt.onnx quantization backend.

Note

TAO’s ONNX export automatically handles model-specific configurations (input/output names, dynamic axes, opset version). Refer to the TAO Toolkit export documentation for your specific task for configuration details: Classification Export or RT-DETR Export.

9.8.2. Step 2: Configure Quantization#

Create a quantization configuration file. Some key points are:

Use module_name: "*" for global quantization settings
Choose from fp8_e5m2, fp8_e4m3fn, or int8 for dtype
Use skip_names to exclude sensitive layers (e.g., heads, encoders, decoders)

quantize:
  model_path: "/path/to/model.onnx"
  results_dir: "/path/to/output"
  backend: "modelopt.onnx"
  mode: "static_ptq"
  algorithm: "max"                    # or "minmax", "entropy"
  device: "cuda"
  layers:
    - module_name: "*"                # Global quantization
      weights:
        dtype: "fp8_e5m2"              # First dtype applies to all
      activations:
        dtype: "fp8_e5m2"
  skip_names:
    - "/head/*"                       # Skip detection/classification head
    - "/encoder/*"                    # Optional: skip if accuracy-sensitive
    - "/decoder/*"                    # Optional: skip if accuracy-sensitive

Important

The backend does not support mixing dtypes. If you specify multiple layer configurations with different dtypes, only the first dtype is used, and a warning is issued.

9.8.3. Step 3: Run Quantization#

Execute the quantization command for your task:

# For classification
classification_pyt quantize -e quantize_config.yaml

# For RT-DETR
rtdetr quantize -e quantize_config.yaml

This produces a quantized ONNX file: <results_dir>/quantized_model.onnx

9.8.4. Step 4: Build TensorRT Engine#

Use TAO Deploy’s gen_trt_engine API to convert the quantized ONNX model to a TensorRT engine:

For classification models:

classification_pyt gen_trt_engine -e gen_trt_engine_config.yaml

For RT-DETR models:

rtdetr gen_trt_engine -e gen_trt_engine_config.yaml

Sample configuration:

gen_trt_engine:
  onnx_file: "/path/to/quantized_model.onnx"
  trt_engine: "/path/to/output_engine.trt"
  tensorrt:
    data_type: FP16                # FP16/FP32 (quantization scales embedded in ONNX)
    workspace_size: 1024           # MB
    min_batch_size: 1
    opt_batch_size: 4
    max_batch_size: 8
results_dir: "/path/to/results"

Important

The gen_trt_engine API automatically detects QDQ-quantized ONNX models (generated by modelopt.onnx) and enables strongly-typed TensorRT mode, ensuring the embedded quantization scales are used correctly. No additional flags are required.

9.8.5. Step 5: Evaluate TensorRT Engine#

Use TAO Deploy’s evaluate API to measure accuracy:

# Evaluate baseline TensorRT engine (from unquantized ONNX)
rtdetr evaluate -e eval_baseline.yaml
# => mAP: 52.3%

# Evaluate quantized TensorRT engine
rtdetr evaluate -e eval_quantized.yaml
# => mAP: 51.8% (0.5% drop - acceptable!)

Sample evaluation configuration:

evaluate:
  trt_engine: "/path/to/quantized_engine.trt"
dataset:
  test_data_sources:
    image_dir: "/path/to/val/images"
    json_file: "/path/to/val/annotations.json"
  batch_size: 10
results_dir: "/path/to/results"

If there is an accuracy drop of >1-2%, iterate by adjusting skip_names to exclude more sensitive layers and rerun the quantization workflow.

9.8.6. Step 6: Deploy With TensorRT#

The generated TensorRT engine is ready for production deployment. The quantized engine can be used directly in any TensorRT-compatible inference framework without additional configuration.

9.8.7. Expected Performance Gains#

With modelopt.onnx and TensorRT deployment:

FP8 (E5M2/E4M3FN): 2-5x speedup vs FP32, <1% accuracy drop
INT8: 2-4x speedup vs FP32, <2% accuracy drop with proper calibration
Memory: 2-4x reduction in model size

9.8.8. Typical Skip Patterns by Model Type#

Classification models:

skip_names: ["/classifier/*", "/fc"]

RT-DETR (object detection):

skip_names: ["/head/*", "/encoder/*", "/decoder/*"]

Adjust based on your accuracy requirements.

9.9. External Links#

NVIDIA ModelOpt (TensorRT Model Optimizer): NVIDIA/TensorRT-Model-Optimizer.
ONNX Runtime: https://onnxruntime.ai/.