4. Configuration#

Quantization is configured under the quantize section of your experiment specification. Dtype strings may be written as fp8_* or float8_* (aliases map to the same values in backends that accept FP8).

4.1. Top-Level Fields#

  • backend: torchao, modelopt.pytorch, or modelopt.onnx.

  • mode: weight_only_ptq (TorchAO) or static_ptq (ModelOpt backends).

  • algorithm: Calibration/optimisation algorithm for ModelOpt. Valid: minmax, entropy, max (refer to backend documentation).

  • default_layer_dtype: Deprecated—Ignored by all backends. Set weights.dtype per layer instead. Valid values: int8, fp8_e4m3fn, fp8_e5m2, native.

  • default_activation_dtype: Deprecated—Ignored by all backends. Set activations.dtype per layer instead (ModelOpt only). Same valid options as above.

  • layers: List of layer-wise configurations.

  • skip_names: List of module name patterns to exclude from quantization. Refer to Skipping layers for details.

  • model_path: Trained checkpoint path to quantize (PyTorch checkpoint for torchao/modelopt.pytorch; ONNX file path for modelopt.onnx).

  • results_dir: Directory for quantized artifacts.

  • backend_kwargs: Advanced—Additional backend-specific parameters (dict). Refer to the section below for details.

  • device: Device for quantization (cuda, cpu, or specific GPU like cuda:0). Default: cuda.

4.2. Schema Reference (Auto-Generated)#

5. ModelQuantizationConfig Fields#

Field

value_type

description

default_value

valid_options

backend

categorical

The quantization backend to use

torchao

modelopt.pytorch,modelopt.onnx,torchao

mode

categorical

The quantization mode to use

weight_only_ptq

static_ptq,weight_only_ptq

algorithm

categorical

Calibration or optimization algorithm name to pass to the backend configuration. For the ModelOpt backends, this becomes the top-level ‘algorithm’ field

minmax

minmax,entropy

default_layer_dtype

categorical

Default data type for layers (currently ignored by backends; specify dtype per layer)

native

int8,fp8_e4m3fn,fp8_e5m2,native

default_activation_dtype

categorical

Default data type for activations (currently ignored by backends; specify dtype per layer)

native

int8,fp8_e4m3fn,fp8_e5m2,native

layers

list

List of per-module quantization configurations

[]

skip_names

list

List of module or layer names or patterns to exclude from quantization

[]

model_path

string

Path to the model to be quantized

results_dir

string

Path to where all the assets generated from a task are stored

5.1. Layer Entries#

Each item accepts:

  • module_name: Qualified name or wildcard pattern; also matches module types (e.g., Linear, Conv2d for PyTorch; Conv, Gemm for ONNX).

  • weights: { dtype: <int8|fp8_e4m3fn|fp8_e5m2|native> }.

  • activations: { dtype: <...> } (ModelOpt backends only; ignored by TorchAO).

5.2. Pattern Matching#

Patterns are matched first against the qualified module name in the graph, then against the module class name. Wildcards * and ? are supported.

5.3. Examples#

Weight-only int8 for all Linear layers, skip classifier head:

quantize:
  backend: "torchao"
  mode: "weight_only_ptq"
  layers:
    - module_name: "Linear"
      weights: { dtype: "int8" }
  skip_names: ["classifier.fc"]

Static PTQ INT8 for conv/linear with INT8 activations (PyTorch):

quantize:
  model_path: "/path/to/model.pth"
  results_dir: "/path/to/output"
  backend: "modelopt.pytorch"
  mode: "static_ptq"
  algorithm: "minmax"
  layers:
    - module_name: "Conv2d"
      weights: { dtype: "int8" }
      activations: { dtype: "int8" }
    - module_name: "Linear"
      weights: { dtype: "int8" }
      activations: { dtype: "int8" }

Static PTQ INT8 for ONNX model:

quantize:
  model_path: "/path/to/model.onnx"
  results_dir: "/path/to/output"
  backend: "modelopt.onnx"
  mode: "static_ptq"
  algorithm: "minmax"
  layers:
    - module_name: "Conv"
      weights: { dtype: "int8" }
      activations: { dtype: "int8" }
    - module_name: "Gemm"
      weights: { dtype: "int8" }
      activations: { dtype: "int8" }

5.4. Advanced: backend_kwargs#

The backend_kwargs field allows you to pass advanced, backend-specific parameters that are not exposed in the main configuration schema. These parameters are merged with the backend configuration and forwarded to the underlying quantization library.

Common use cases:

  • Overriding default ModelOpt ONNX parameters

  • Enabling per-channel quantization

  • Specifying custom execution providers

  • Advanced calibration options

torchao backend_kwargs: Parameters passed directly to torchao.quantization.quantize_.

quantize:
  backend: "torchao"
  backend_kwargs:
    # Add TorchAO-specific parameters here
    # (refer to TorchAO documentation)

modelopt.pytorch backend_kwargs: Parameters passed to ModelOpt PyTorch APIs.

quantize:
  backend: "modelopt.pytorch"
  backend_kwargs:
    # Add ModelOpt PyTorch-specific parameters here

modelopt.onnx backend_kwargs: Parameters passed directly to modelopt.onnx.quantization.quantize. Some useful examples are:

quantize:
  backend: "modelopt.onnx"
  backend_kwargs:
    use_external_data_format: true             # For large models >2GB
    op_types_to_quantize: ["Conv", "Gemm"]     # Limit quantized op types
    calibrate_per_node: true                   # Per-node calibration
    high_precision_dtype: "fp16"               # Keep some ops in FP16

For complete list of parameters, refer to the ModelOpt ONNX quantize API.

Example: Large model with selective quantization:

quantize:
  model_path: "/path/to/model.onnx"
  results_dir: "/path/to/output"
  backend: "modelopt.onnx"
  mode: "static_ptq"
  algorithm: "max"
  layers:
    - module_name: "*"
      weights: { dtype: "fp8_e5m2" }
      activations: { dtype: "fp8_e5m2" }
  skip_names: ["/head/*"]
  backend_kwargs:
    use_external_data_format: true              # For models >2GB
    op_types_to_quantize: ["Conv", "Gemm"]      # Only these ops

Note

You should only use backend_kwargs if you need to override defaults or access features not exposed in the main configuration. Incorrect usage may lead to quantization failures.

5.5. Task-Specific Notes#

  • Classification: No extra dataset fields are needed beyond your usual evaluation/inference configurations.

  • RT-DETR + ModelOpt: Ensure you have a representative validation set configured for calibration. Evaluation inputs are reused.

  • ONNX models: Use TAO’s export command (classification_pyt export or rtdetr export) to convert your PyTorch model to ONNX, then use modelopt.onnx backend with the exported ONNX file path.

  • For NVIDIA TensorRT™ deployment, use modelopt.onnx backend for best performance.