9. ModelOpt ONNX Backend (Static PTQ)#

9.1. Overview#

Static PTQ for ONNX model files
Quantizes both weights and activations (INT8/FP8)
Works exclusively with ONNX files specified via file path
Requires ModelOpt ONNX package to be installed
Algorithm selection via algorithm (refer to the Supported Options section below)

9.2. Supported Options#

mode: static_ptq.
algorithm: minmax (range via min/max), max (maximum range), entropy (KL divergence). If unset, defaults to minmax. max is often recommended for NVIDIA TensorRT™ deployment.
weights.dtype: int8, fp8_e4m3fn, fp8_e5m2, native.
activations.dtype: int8, fp8_e4m3fn, fp8_e5m2, native.
default_layer_dtype / default_activation_dtype: Currently ignored by this backend; specify dtypes per layer.
skip_names: Remove modules from quantization.
model_path: Path to ONNX model file (required, must have .onnx extension).

9.3. Key Differences From PyTorch Backend#

Input format: ONNX backend requires an ONNX file path via model_path, not a PyTorch model. Use TAO’s export command to generate the ONNX file.
Calibration data: Extracted from DataLoader and converted to numpy arrays for ONNX quantization.
Output format: Quantized model is saved as an ONNX file (quantized_model.onnx), not a PyTorch checkpoint.
Mixed precision ⚠️: Does not support mixing layer types. If multiple dtypes are specified, only the first layer’s dtype is applied globally with a warning.
Validation: API fails if a non-ONNX model file is provided (validates .onnx extension).

9.4. Calibration#

Provide a DataLoader via TAO’s evaluation configurations. The integration extracts data from the loader, converts it to numpy arrays, and passes it to ModelOpt ONNX.
Batches can be tensors, tuples (first element is input), or dicts with common keys (input, data, x, images, images_left, images_right).
If no calibration data is provided, ModelOpt generates dummy data (reduces accuracy).

9.5. Example Config#

quantize:
  model_path: "/path/to/model.onnx"  # ONNX file path (required)
  results_dir: "/path/to/quantized_output"
  backend: "modelopt.onnx"
  mode: "static_ptq"
  algorithm: "minmax"
  layers:
    - module_name: "Conv"
      weights: { dtype: "int8" }
      activations: { dtype: "int8" }
    - module_name: "Gemm"
      weights: { dtype: "int8" }
      activations: { dtype: "int8" }

Note: For ONNX models, module names refer to ONNX operator types (e.g., Conv, Gemm, MatMul) rather than PyTorch class names.

9.6. Outputs#

Saved artifact in results_dir named quantized_model.onnx, containing the quantized ONNX model with calibrated scales.

9.7. Important Notes and Limitations#

Warning

Critical Limitation: No Mixed-Precision Support

The ONNX backend does not support per-layer mixed precision. If you specify multiple layers with different dtypes (e.g., INT8 for some layers, FP8 for others), only the first dtype is applied globally to all quantized layers. A warning is issued, but quantization proceeds.

For fine-grained per-layer quantization control, use the modelopt.pytorch backend instead.

Input validation:

The backend fails if you provide a non-ONNX model file.
File must have a .onnx extension.
PyTorch models are not accepted (use TAO’s export command first).

Dtype mixing limitation:

Does not support mixed precision per layer.
If multiple layers specify different dtypes, only the first dtype is applied globally.
A warning is issued if mixed dtypes are detected.
For per-layer quantization control, use modelopt.pytorch backend instead.

Calibration:

Calibration data is automatically extracted from the DataLoader and converted to numpy format.
If no calibration data provided, ModelOpt generates dummy data (may reduce accuracy).
Calibration is critical for achieving good INT8 accuracy.

Model support:

Currently supports: classification-pyt and rtdetr models.

Advanced backend_kwargs: For advanced use cases, you can pass additional parameters via backend_kwargs. Some common examples are:

use_external_data_format: Enable for models >2GB
op_types_to_quantize: List of ONNX op types to quantize (e.g., ["Conv", "Gemm"])
calibrate_per_node: Enable per-node calibration
high_precision_dtype: Keep certain operations in higher precision

For the complete list of available parameters and their descriptions, refer to the ModelOpt ONNX quantize API documentation.

9.8. Complete Workflow for TensorRT Deployment#

The modelopt.onnx backend is specifically designed for TensorRT deployment. Follow these steps for the complete workflow.

9.8.1. Step 1: Export PyTorch Model to ONNX#

Ask the agent to run the model skill’s export action against your trained checkpoint. For example:

Export my trained classification-pyt model to ONNX using ``export_config.yaml``.

(Substitute rtdetr for RT-DETR.) The export action produces an ONNX file compatible with the modelopt.onnx quantization backend.

Note

TAO’s ONNX export automatically handles model-specific configurations (input/output names, dynamic axes, opset version). Refer to the TAO Toolkit export documentation for your specific task for configuration details: Classification Export or RT-DETR Export.

9.8.2. Step 2: Configure Quantization#

Create a quantization configuration file. Some key points are:

Use module_name: "*" for global quantization settings
Choose from fp8_e5m2, fp8_e4m3fn, or int8 for dtype
Use skip_names to exclude sensitive layers (e.g., heads, encoders, decoders)

quantize:
  model_path: "/path/to/model.onnx"
  results_dir: "/path/to/output"
  backend: "modelopt.onnx"
  mode: "static_ptq"
  algorithm: "max"                    # or "minmax", "entropy"
  device: "cuda"
  layers:
    - module_name: "*"                # Global quantization
      weights:
        dtype: "fp8_e5m2"              # First dtype applies to all
      activations:
        dtype: "fp8_e5m2"
  skip_names:
    - "/head/*"                       # Skip detection/classification head
    - "/encoder/*"                    # Optional: skip if accuracy-sensitive
    - "/decoder/*"                    # Optional: skip if accuracy-sensitive

Important

The backend does not support mixing dtypes. If you specify multiple layer configurations with different dtypes, only the first dtype is used, and a warning is issued.

9.8.3. Step 3: Run Quantization#

Ask the agent to run the model skill’s quantize action with your specification:

“Quantize my trained classification-pyt model using quantize_config.yaml .”

(Substitute rtdetr for RT-DETR.) This produces a quantized ONNX file at <results_dir>/quantized_model.onnx.

9.8.4. Step 4: Build TensorRT Engine#

Run the model skill’s gen_trt_engine action against the quantized ONNX:

“Build a TensorRT engine from the quantized ONNX at <results_dir>/quantized_model.onnx using gen_trt_engine_config.yaml .”

Sample configuration:

gen_trt_engine:
  onnx_file: "/path/to/quantized_model.onnx"
  trt_engine: "/path/to/output_engine.trt"
  tensorrt:
    data_type: FP16                # FP16/FP32 (quantization scales embedded in ONNX)
    workspace_size: 1024           # MB
    min_batch_size: 1
    opt_batch_size: 4
    max_batch_size: 8
results_dir: "/path/to/results"

Important

The gen_trt_engine API automatically detects QDQ-quantized ONNX models (generated by modelopt.onnx) and enables strongly-typed TensorRT mode, ensuring the embedded quantization scales are used correctly. No additional flags are required.

9.8.5. Step 5: Evaluate TensorRT Engine#

Run the model skill’s evaluate action against each engine to compare accuracy:

“Evaluate my baseline TensorRT engine using eval_baseline.yaml .” → mAP: 52.3%

“Evaluate my quantized TensorRT engine using eval_quantized.yaml .” → mAP: 51.8% (0.5% drop, acceptable).

Sample evaluation configuration:

evaluate:
  trt_engine: "/path/to/quantized_engine.trt"
dataset:
  test_data_sources:
    image_dir: "/path/to/val/images"
    json_file: "/path/to/val/annotations.json"
  batch_size: 10
results_dir: "/path/to/results"

If there is an accuracy drop of >1-2%, iterate by adjusting skip_names to exclude more sensitive layers and rerun the quantization workflow.

9.8.6. Step 6: Deploy With TensorRT#

The generated TensorRT engine is ready for production deployment. The quantized engine can be used directly in any TensorRT-compatible inference framework without additional configuration.

9.8.7. Expected Performance Gains#

With modelopt.onnx and TensorRT deployment:

FP8 (E5M2/E4M3FN): 2-5x speedup vs FP32, <1% accuracy drop
INT8: 2-4x speedup vs FP32, <2% accuracy drop with proper calibration
Memory: 2-4x reduction in model size

9.8.8. Typical Skip Patterns by Model Type#

Classification models:

skip_names: ["/classifier/*", "/fc"]

RT-DETR (object detection):

skip_names: ["/head/*", "/encoder/*", "/decoder/*"]

Adjust based on your accuracy requirements.

9.9. External Links#

NVIDIA ModelOpt (TensorRT Model Optimizer): NVIDIA/TensorRT-Model-Optimizer.
ONNX Runtime: https://onnxruntime.ai/.