2. Getting Started#

This guide helps you run PTQ quickly with default settings.

2.1. Prerequisites#

  • A trained model checkpoint (PyTorch for torchao/modelopt.pytorch; ONNX for modelopt.onnx)

  • TAO Toolkit installed with quantization support

  • For ModelOpt backends: representative calibration data

2.2. Minimal Specification Snippets#

TorchAO (weight-only PTQ):

quantize:
  model_path: "/path/to/model.pth"
  results_dir: "/path/to/quantized_output"
  backend: "torchao"
  mode: "weight_only_ptq"
  layers:
    - module_name: "Linear"
      weights: { dtype: "int8" }

ModelOpt PyTorch (static PTQ):

quantize:
  model_path: "/path/to/model.pth"
  results_dir: "/path/to/quantized_output"
  backend: "modelopt.pytorch"
  mode: "static_ptq"
  algorithm: "minmax"
  layers:
    - module_name: "Linear"
      weights: { dtype: "int8" }
      activations: { dtype: "int8" }

ModelOpt ONNX (static PTQ for NVIDIA TensorRTâ„¢):

quantize:
  model_path: "/path/to/model.onnx"    # ONNX file path (required)
  results_dir: "/path/to/quantized_output"
  backend: "modelopt.onnx"
  mode: "static_ptq"
  algorithm: "max"                      # Recommended for TensorRT
  device: "cuda"
  layers:
    - module_name: "*"                  # Global quantization
      weights: { dtype: "fp8_e5m2" }     # or "int8", "fp8_e4m3fn"
      activations: { dtype: "fp8_e5m2" }
  skip_names: ["/head/*"]               # Skip sensitive layers

2.3. Run Quantization#

  • Classification: tao classification_pyt quantize -e <specification.yaml>.

  • RT-DETR: tao rtdetr quantize -e <specification.yaml>.

2.4. Use the Quantized Checkpoint#

Set evaluate.is_quantized: true or inference.is_quantized: true and point to the produced artifact in results_dir:

  • TorchAO: quantized_model_torchao.pth

  • ModelOpt PyTorch: quantized_model_modelopt.pytorch.pth (state dict under model_state_dict)

  • ModelOpt ONNX: quantized_model.onnx (use with ONNX Runtime or TensorRT)