6. ModelOpt backend (static PTQ)#
6.1. Overview#
Static PTQ with optional calibration loop over representative data.
Quantizes both weights and activations (INT8/FP8).
Algorithm selection via
algorithm
(see below).
6.2. Supported options#
mode
:static_ptq
.algorithm
:minmax
(range via min/max),entropy
. If unset, defaults tominmax
.weights.dtype
:int8
,fp8_e4m3fn
,fp8_e5m2
,native
.activations.dtype
:int8
,fp8_e4m3fn
,fp8_e5m2
,native
.default_layer_dtype
/default_activation_dtype
: currently ignored by this backend; specify dtypes per layer.skip_names
: remove modules from quantization.
6.3. Calibration#
Provide a DataLoader via TAO’s evaluation configurations; the integration builds a forward loop and runs it during quantization.
Batches can be tensors, tuples (first element is input), or dicts with key
input
.
6.4. Example config#
quantize:
backend: "modelopt"
mode: "static_ptq"
algorithm: "minmax"
default_layer_dtype: "int8"
default_activation_dtype: "int8"
layers:
- module_name: "Conv2d"
weights: { dtype: "int8" }
activations: { dtype: "int8" }
- module_name: "Linear"
weights: { dtype: "int8" }
activations: { dtype: "int8" }
6.5. Outputs#
Saved artifact in
results_dir
namedquantized_model_modelopt.pth
containing a structured checkpoint. The model state dict is undermodel_state_dict
.
6.6. Notes#
In PyTorch runtime, ModelOpt inserts fake-quant operations; speedups may be limited but the exported checkpoint includes calibrated scales.
6.7. External links#
NVIDIA ModelOpt (TensorRT Model Optimizer): NVIDIA/TensorRT-Model-Optimizer.