9. ModelOpt ONNX Backend (Static PTQ)#

9.1. Overview#

  • Static PTQ for ONNX model files

  • Quantizes both weights and activations (INT8/FP8)

  • Works exclusively with ONNX files specified via file path

  • Requires ModelOpt ONNX package to be installed

  • Algorithm selection via algorithm (refer to the Supported Options section below)

9.2. Supported Options#

  • mode: static_ptq.

  • algorithm: minmax (range via min/max), max (maximum range), entropy (KL divergence). If unset, defaults to minmax. max is often recommended for NVIDIA TensorRT™ deployment.

  • weights.dtype: int8, fp8_e4m3fn, fp8_e5m2, native.

  • activations.dtype: int8, fp8_e4m3fn, fp8_e5m2, native.

  • default_layer_dtype / default_activation_dtype: Currently ignored by this backend; specify dtypes per layer.

  • skip_names: Remove modules from quantization.

  • model_path: Path to ONNX model file (required, must have .onnx extension).

9.3. Key Differences From PyTorch Backend#

  • Input format: ONNX backend requires an ONNX file path via model_path, not a PyTorch model. Use TAO’s export command to generate the ONNX file.

  • Calibration data: Extracted from DataLoader and converted to numpy arrays for ONNX quantization.

  • Output format: Quantized model is saved as an ONNX file (quantized_model.onnx), not a PyTorch checkpoint.

  • Mixed precision ⚠️: Does not support mixing layer types. If multiple dtypes are specified, only the first layer’s dtype is applied globally with a warning.

  • Validation: API fails if a non-ONNX model file is provided (validates .onnx extension).

9.4. Calibration#

  • Provide a DataLoader via TAO’s evaluation configurations. The integration extracts data from the loader, converts it to numpy arrays, and passes it to ModelOpt ONNX.

  • Batches can be tensors, tuples (first element is input), or dicts with common keys (input, data, x, images, images_left, images_right).

  • If no calibration data is provided, ModelOpt generates dummy data (reduces accuracy).

9.5. Example Config#

quantize:
  model_path: "/path/to/model.onnx"  # ONNX file path (required)
  results_dir: "/path/to/quantized_output"
  backend: "modelopt.onnx"
  mode: "static_ptq"
  algorithm: "minmax"
  layers:
    - module_name: "Conv"
      weights: { dtype: "int8" }
      activations: { dtype: "int8" }
    - module_name: "Gemm"
      weights: { dtype: "int8" }
      activations: { dtype: "int8" }

Note: For ONNX models, module names refer to ONNX operator types (e.g., Conv, Gemm, MatMul) rather than PyTorch class names.

9.6. Outputs#

  • Saved artifact in results_dir named quantized_model.onnx, containing the quantized ONNX model with calibrated scales.

9.7. Important Notes and Limitations#

Warning

Critical Limitation: No Mixed-Precision Support

The ONNX backend does not support per-layer mixed precision. If you specify multiple layers with different dtypes (e.g., INT8 for some layers, FP8 for others), only the first dtype is applied globally to all quantized layers. A warning is issued, but quantization proceeds.

For fine-grained per-layer quantization control, use the modelopt.pytorch backend instead.

Input validation:

  • The backend fails if you provide a non-ONNX model file.

  • File must have a .onnx extension.

  • PyTorch models are not accepted (use TAO’s export command first).

Dtype mixing limitation:

  • Does not support mixed precision per layer.

  • If multiple layers specify different dtypes, only the first dtype is applied globally.

  • A warning is issued if mixed dtypes are detected.

  • For per-layer quantization control, use modelopt.pytorch backend instead.

Calibration:

  • Calibration data is automatically extracted from the DataLoader and converted to numpy format.

  • If no calibration data provided, ModelOpt generates dummy data (may reduce accuracy).

  • Calibration is critical for achieving good INT8 accuracy.

Model support:

  • Currently supports: classification-pyt and rtdetr models.

Advanced backend_kwargs: For advanced use cases, you can pass additional parameters via backend_kwargs. Some common examples are:

  • use_external_data_format: Enable for models >2GB

  • op_types_to_quantize: List of ONNX op types to quantize (e.g., ["Conv", "Gemm"])

  • calibrate_per_node: Enable per-node calibration

  • high_precision_dtype: Keep certain operations in higher precision

For the complete list of available parameters and their descriptions, refer to the ModelOpt ONNX quantize API documentation.

9.8. Complete Workflow for TensorRT Deployment#

The modelopt.onnx backend is specifically designed for TensorRT deployment. Follow these steps for the complete workflow.

9.8.1. Step 1: Export PyTorch Model to ONNX#

Ask the agent to run the model skill’s export action against your trained checkpoint. For example:

Export my trained classification-pyt model to ONNX using ``export_config.yaml``.

(Substitute rtdetr for RT-DETR.) The export action produces an ONNX file compatible with the modelopt.onnx quantization backend.

Note

TAO’s ONNX export automatically handles model-specific configurations (input/output names, dynamic axes, opset version). Refer to the TAO Toolkit export documentation for your specific task for configuration details: Classification Export or RT-DETR Export.

9.8.2. Step 2: Configure Quantization#

Create a quantization configuration file. Some key points are:

  • Use module_name: "*" for global quantization settings

  • Choose from fp8_e5m2, fp8_e4m3fn, or int8 for dtype

  • Use skip_names to exclude sensitive layers (e.g., heads, encoders, decoders)

quantize:
  model_path: "/path/to/model.onnx"
  results_dir: "/path/to/output"
  backend: "modelopt.onnx"
  mode: "static_ptq"
  algorithm: "max"                    # or "minmax", "entropy"
  device: "cuda"
  layers:
    - module_name: "*"                # Global quantization
      weights:
        dtype: "fp8_e5m2"              # First dtype applies to all
      activations:
        dtype: "fp8_e5m2"
  skip_names:
    - "/head/*"                       # Skip detection/classification head
    - "/encoder/*"                    # Optional: skip if accuracy-sensitive
    - "/decoder/*"                    # Optional: skip if accuracy-sensitive

Important

The backend does not support mixing dtypes. If you specify multiple layer configurations with different dtypes, only the first dtype is used, and a warning is issued.

9.8.3. Step 3: Run Quantization#

Ask the agent to run the model skill’s quantize action with your specification:

“Quantize my trained classification-pyt model using quantize_config.yaml .”

(Substitute rtdetr for RT-DETR.) This produces a quantized ONNX file at <results_dir>/quantized_model.onnx.

9.8.4. Step 4: Build TensorRT Engine#

Run the model skill’s gen_trt_engine action against the quantized ONNX:

“Build a TensorRT engine from the quantized ONNX at <results_dir>/quantized_model.onnx using gen_trt_engine_config.yaml .”

Sample configuration:

gen_trt_engine:
  onnx_file: "/path/to/quantized_model.onnx"
  trt_engine: "/path/to/output_engine.trt"
  tensorrt:
    data_type: FP16                # FP16/FP32 (quantization scales embedded in ONNX)
    workspace_size: 1024           # MB
    min_batch_size: 1
    opt_batch_size: 4
    max_batch_size: 8
results_dir: "/path/to/results"

Important

The gen_trt_engine API automatically detects QDQ-quantized ONNX models (generated by modelopt.onnx) and enables strongly-typed TensorRT mode, ensuring the embedded quantization scales are used correctly. No additional flags are required.

9.8.5. Step 5: Evaluate TensorRT Engine#

Run the model skill’s evaluate action against each engine to compare accuracy:

“Evaluate my baseline TensorRT engine using eval_baseline.yaml .” → mAP: 52.3%

“Evaluate my quantized TensorRT engine using eval_quantized.yaml .” → mAP: 51.8% (0.5% drop, acceptable).

Sample evaluation configuration:

evaluate:
  trt_engine: "/path/to/quantized_engine.trt"
dataset:
  test_data_sources:
    image_dir: "/path/to/val/images"
    json_file: "/path/to/val/annotations.json"
  batch_size: 10
results_dir: "/path/to/results"

If there is an accuracy drop of >1-2%, iterate by adjusting skip_names to exclude more sensitive layers and rerun the quantization workflow.

9.8.6. Step 6: Deploy With TensorRT#

The generated TensorRT engine is ready for production deployment. The quantized engine can be used directly in any TensorRT-compatible inference framework without additional configuration.

9.8.7. Expected Performance Gains#

With modelopt.onnx and TensorRT deployment:

  • FP8 (E5M2/E4M3FN): 2-5x speedup vs FP32, <1% accuracy drop

  • INT8: 2-4x speedup vs FP32, <2% accuracy drop with proper calibration

  • Memory: 2-4x reduction in model size

9.8.8. Typical Skip Patterns by Model Type#

Classification models:

skip_names: ["/classifier/*", "/fc"]

RT-DETR (object detection):

skip_names: ["/head/*", "/encoder/*", "/decoder/*"]

Adjust based on your accuracy requirements.