8. ModelOpt PyTorch Backend (Static PTQ)#

8.1. Overview#

Static PTQ with optional calibration loop over representative data.
Quantizes both weights and activations (INT8/FP8).
Works with PyTorch models (torch.nn.Module).
Algorithm selection via algorithm (refer to the Supported Options section below).

8.2. Supported Options#

mode: static_ptq.
algorithm: minmax (range via min/max), max (maximum range), entropy (KL divergence). If unset, defaults to minmax.
weights.dtype: int8, fp8_e4m3fn, fp8_e5m2, native.
activations.dtype: int8, fp8_e4m3fn, fp8_e5m2, native.
default_layer_dtype / default_activation_dtype: Currently ignored by this backend; specify dtypes per layer.
skip_names: Remove modules from quantization.

8.3. Calibration#

Provide a DataLoader via TAO’s evaluation configurations. The integration builds a forward loop and runs it during quantization.
Batches can be tensors, tuples (first element is input), or dicts with key input.

8.4. Example Config#

quantize:
  model_path: "/path/to/model.pth"
  results_dir: "/path/to/quantized_output"
  backend: "modelopt.pytorch"
  mode: "static_ptq"
  algorithm: "minmax"
  layers:
    - module_name: "Conv2d"
      weights: { dtype: "int8" }
      activations: { dtype: "int8" }
    - module_name: "Linear"
      weights: { dtype: "int8" }
      activations: { dtype: "int8" }

8.5. Outputs#

Saved artifact in results_dir named quantized_model_modelopt.pytorch.pth containing a structured checkpoint. The model state dict is under model_state_dict.

8.6. Notes#

In PyTorch runtime, ModelOpt inserts fake-quant operations. Speedups may be limited but the exported checkpoint includes calibrated scales.

8.7. External Links#

NVIDIA ModelOpt (NVIDIA TensorRT™ Model Optimizer): NVIDIA/TensorRT-Model-Optimizer.

8. ModelOpt PyTorch Backend (Static PTQ)#

8.1. Overview#

8.2. Supported Options#

8.3. Calibration#

8.4. Example Config#

8.5. Outputs#

8.6. Notes#

8.7. External Links#

8.8. Navigation#