9. ModelOpt ONNX Backend (Static PTQ)#

9.1. Overview#

  • Static PTQ for ONNX model files

  • Quantizes both weights and activations (INT8/FP8)

  • Works exclusively with ONNX files specified via file path

  • Requires ModelOpt ONNX package to be installed

  • Algorithm selection via algorithm (refer to the Supported Options section below)

9.2. Supported Options#

  • mode: static_ptq.

  • algorithm: minmax (range via min/max), max (maximum range), entropy (KL divergence). If unset, defaults to minmax. max is often recommended for NVIDIA TensorRT™ deployment.

  • weights.dtype: int8, fp8_e4m3fn, fp8_e5m2, native.

  • activations.dtype: int8, fp8_e4m3fn, fp8_e5m2, native.

  • default_layer_dtype / default_activation_dtype: Currently ignored by this backend; specify dtypes per layer.

  • skip_names: Remove modules from quantization.

  • model_path: Path to ONNX model file (required, must have .onnx extension).

9.3. Key Differences From PyTorch Backend#

  • Input format: ONNX backend requires an ONNX file path via model_path, not a PyTorch model. Use TAO’s export command to generate the ONNX file.

  • Calibration data: Extracted from DataLoader and converted to numpy arrays for ONNX quantization.

  • Output format: Quantized model is saved as an ONNX file (quantized_model.onnx), not a PyTorch checkpoint.

  • Mixed precision ⚠️: Does not support mixing layer types. If multiple dtypes are specified, only the first layer’s dtype is applied globally with a warning.

  • Validation: API fails if a non-ONNX model file is provided (validates .onnx extension).

9.4. Calibration#

  • Provide a DataLoader via TAO’s evaluation configurations. The integration extracts data from the loader, converts it to numpy arrays, and passes it to ModelOpt ONNX.

  • Batches can be tensors, tuples (first element is input), or dicts with common keys (input, data, x, images, images_left, images_right).

  • If no calibration data is provided, ModelOpt generates dummy data (reduces accuracy).

9.5. Example Config#

quantize:
  model_path: "/path/to/model.onnx"  # ONNX file path (required)
  results_dir: "/path/to/quantized_output"
  backend: "modelopt.onnx"
  mode: "static_ptq"
  algorithm: "minmax"
  layers:
    - module_name: "Conv"
      weights: { dtype: "int8" }
      activations: { dtype: "int8" }
    - module_name: "Gemm"
      weights: { dtype: "int8" }
      activations: { dtype: "int8" }

Note: For ONNX models, module names refer to ONNX operator types (e.g., Conv, Gemm, MatMul) rather than PyTorch class names.

9.6. Outputs#

  • Saved artifact in results_dir named quantized_model.onnx, containing the quantized ONNX model with calibrated scales.

9.7. Important Notes and Limitations#

Warning

Critical Limitation: No Mixed-Precision Support

The ONNX backend does not support per-layer mixed precision. If you specify multiple layers with different dtypes (e.g., INT8 for some layers, FP8 for others), only the first dtype is applied globally to all quantized layers. A warning is issued, but quantization proceeds.

For fine-grained per-layer quantization control, use the modelopt.pytorch backend instead.

Input validation:

  • The backend fails if you provide a non-ONNX model file.

  • File must have a .onnx extension.

  • PyTorch models are not accepted (use TAO’s export command first).

Dtype mixing limitation:

  • Does not support mixed precision per layer.

  • If multiple layers specify different dtypes, only the first dtype is applied globally.

  • A warning is issued if mixed dtypes are detected.

  • For per-layer quantization control, use modelopt.pytorch backend instead.

Calibration:

  • Calibration data is automatically extracted from the DataLoader and converted to numpy format.

  • If no calibration data provided, ModelOpt generates dummy data (may reduce accuracy).

  • Calibration is critical for achieving good INT8 accuracy.

Model support:

  • Currently supports: classification_pyt and rtdetr models.

Advanced backend_kwargs: For advanced use cases, you can pass additional parameters via backend_kwargs. Some common examples are:

  • use_external_data_format: Enable for models >2GB

  • op_types_to_quantize: List of ONNX op types to quantize (e.g., ["Conv", "Gemm"])

  • calibrate_per_node: Enable per-node calibration

  • high_precision_dtype: Keep certain operations in higher precision

For the complete list of available parameters and their descriptions, refer to the ModelOpt ONNX quantize API documentation.

9.8. Complete Workflow for TensorRT Deployment#

The modelopt.onnx backend is specifically designed for TensorRT deployment. Follow these steps for the complete workflow.

9.8.1. Step 1: Export PyTorch Model to ONNX#

Use TAO Toolkit’s built-in export functionality to convert your trained PyTorch model to ONNX format.

For classification models:

classification_pyt export -e export_config.yaml

For RT-DETR models:

rtdetr export -e export_config.yaml

The export command generates an ONNX file compatible with the modelopt.onnx quantization backend.

Note

TAO’s ONNX export automatically handles model-specific configurations (input/output names, dynamic axes, opset version). Refer to the TAO Toolkit export documentation for your specific task for configuration details: Classification Export or RT-DETR Export.

9.8.2. Step 2: Configure Quantization#

Create a quantization configuration file. Some key points are:

  • Use module_name: "*" for global quantization settings

  • Choose from fp8_e5m2, fp8_e4m3fn, or int8 for dtype

  • Use skip_names to exclude sensitive layers (e.g., heads, encoders, decoders)

quantize:
  model_path: "/path/to/model.onnx"
  results_dir: "/path/to/output"
  backend: "modelopt.onnx"
  mode: "static_ptq"
  algorithm: "max"                    # or "minmax", "entropy"
  device: "cuda"
  layers:
    - module_name: "*"                # Global quantization
      weights:
        dtype: "fp8_e5m2"              # First dtype applies to all
      activations:
        dtype: "fp8_e5m2"
  skip_names:
    - "/head/*"                       # Skip detection/classification head
    - "/encoder/*"                    # Optional: skip if accuracy-sensitive
    - "/decoder/*"                    # Optional: skip if accuracy-sensitive

Important

The backend does not support mixing dtypes. If you specify multiple layer configurations with different dtypes, only the first dtype is used, and a warning is issued.

9.8.3. Step 3: Run Quantization#

Execute the quantization command for your task:

# For classification
classification_pyt quantize -e quantize_config.yaml

# For RT-DETR
rtdetr quantize -e quantize_config.yaml

This produces a quantized ONNX file: <results_dir>/quantized_model.onnx

9.8.4. Step 4: Build TensorRT Engine#

Use TAO Deploy’s gen_trt_engine API to convert the quantized ONNX model to a TensorRT engine:

For classification models:

classification_pyt gen_trt_engine -e gen_trt_engine_config.yaml

For RT-DETR models:

rtdetr gen_trt_engine -e gen_trt_engine_config.yaml

Sample configuration:

gen_trt_engine:
  onnx_file: "/path/to/quantized_model.onnx"
  trt_engine: "/path/to/output_engine.trt"
  tensorrt:
    data_type: FP16                # FP16/FP32 (quantization scales embedded in ONNX)
    workspace_size: 1024           # MB
    min_batch_size: 1
    opt_batch_size: 4
    max_batch_size: 8
results_dir: "/path/to/results"

Important

The gen_trt_engine API automatically detects QDQ-quantized ONNX models (generated by modelopt.onnx) and enables strongly-typed TensorRT mode, ensuring the embedded quantization scales are used correctly. No additional flags are required.

9.8.5. Step 5: Evaluate TensorRT Engine#

Use TAO Deploy’s evaluate API to measure accuracy:

# Evaluate baseline TensorRT engine (from unquantized ONNX)
rtdetr evaluate -e eval_baseline.yaml
# => mAP: 52.3%

# Evaluate quantized TensorRT engine
rtdetr evaluate -e eval_quantized.yaml
# => mAP: 51.8% (0.5% drop - acceptable!)

Sample evaluation configuration:

evaluate:
  trt_engine: "/path/to/quantized_engine.trt"
dataset:
  test_data_sources:
    image_dir: "/path/to/val/images"
    json_file: "/path/to/val/annotations.json"
  batch_size: 10
results_dir: "/path/to/results"

If there is an accuracy drop of >1-2%, iterate by adjusting skip_names to exclude more sensitive layers and rerun the quantization workflow.

9.8.6. Step 6: Deploy With TensorRT#

The generated TensorRT engine is ready for production deployment. The quantized engine can be used directly in any TensorRT-compatible inference framework without additional configuration.

9.8.7. Expected Performance Gains#

With modelopt.onnx and TensorRT deployment:

  • FP8 (E5M2/E4M3FN): 2-5x speedup vs FP32, <1% accuracy drop

  • INT8: 2-4x speedup vs FP32, <2% accuracy drop with proper calibration

  • Memory: 2-4x reduction in model size

9.8.8. Typical Skip Patterns by Model Type#

Classification models:

skip_names: ["/classifier/*", "/fc"]

RT-DETR (object detection):

skip_names: ["/head/*", "/encoder/*", "/decoder/*"]

Adjust based on your accuracy requirements.