2. Getting Started#
This guide helps you run PTQ quickly with default settings.
2.1. Prerequisites#
A trained model checkpoint (PyTorch for
torchao/modelopt.pytorch; ONNX formodelopt.onnx)TAO Toolkit installed with quantization support
For ModelOpt backends: representative calibration data
2.2. Minimal Specification Snippets#
TorchAO (weight-only PTQ):
quantize:
model_path: "/path/to/model.pth"
results_dir: "/path/to/quantized_output"
backend: "torchao"
mode: "weight_only_ptq"
layers:
- module_name: "Linear"
weights: { dtype: "int8" }
ModelOpt PyTorch (static PTQ):
quantize:
model_path: "/path/to/model.pth"
results_dir: "/path/to/quantized_output"
backend: "modelopt.pytorch"
mode: "static_ptq"
algorithm: "minmax"
layers:
- module_name: "Linear"
weights: { dtype: "int8" }
activations: { dtype: "int8" }
ModelOpt ONNX (static PTQ for NVIDIA TensorRTâ„¢):
quantize:
model_path: "/path/to/model.onnx" # ONNX file path (required)
results_dir: "/path/to/quantized_output"
backend: "modelopt.onnx"
mode: "static_ptq"
algorithm: "max" # Recommended for TensorRT
device: "cuda"
layers:
- module_name: "*" # Global quantization
weights: { dtype: "fp8_e5m2" } # or "int8", "fp8_e4m3fn"
activations: { dtype: "fp8_e5m2" }
skip_names: ["/head/*"] # Skip sensitive layers
2.3. Run Quantization#
Classification:
tao classification_pyt quantize -e <specification.yaml>.RT-DETR:
tao rtdetr quantize -e <specification.yaml>.
2.4. Use the Quantized Checkpoint#
Set evaluate.is_quantized: true or inference.is_quantized: true and point to the produced artifact in results_dir:
TorchAO:
quantized_model_torchao.pthModelOpt PyTorch:
quantized_model_modelopt.pytorch.pth(state dict undermodel_state_dict)ModelOpt ONNX:
quantized_model.onnx(use with ONNX Runtime or TensorRT)