4. Configuration#

Quantization is configured under the quantize section of your experiment specification. Dtype strings may be written as fp8_* or float8_* (aliases map to the same values in backends that accept FP8).

4.1. Top-Level Fields#

backend: torchao, modelopt.pytorch, or modelopt.onnx.
mode: weight_only_ptq (TorchAO) or static_ptq (ModelOpt backends).
algorithm: Calibration/optimisation algorithm for ModelOpt. Valid: minmax, entropy, max (refer to backend documentation).
default_layer_dtype: Deprecated—Ignored by all backends. Set weights.dtype per layer instead. Valid values: int8, fp8_e4m3fn, fp8_e5m2, native.
default_activation_dtype: Deprecated—Ignored by all backends. Set activations.dtype per layer instead (ModelOpt only). Same valid options as above.
layers: List of layer-wise configurations.
skip_names: List of module name patterns to exclude from quantization. Refer to Skipping layers for details.
model_path: Trained checkpoint path to quantize (PyTorch checkpoint for torchao/modelopt.pytorch; ONNX file path for modelopt.onnx).
results_dir: Directory for quantized artifacts.
backend_kwargs: Advanced—Additional backend-specific parameters (dict). Refer to the section below for details.
device: Device for quantization (cuda, cpu, or specific GPU like cuda:0). Default: cuda.

4.2. Schema Reference (Auto-Generated)#

5. ModelQuantizationConfig Fields#

Field	value_type	description	default_value	valid_options
`backend`	categorical	The quantization backend to use	torchao	modelopt.pytorch,modelopt.onnx,torchao
`mode`	categorical	The quantization mode to use	weight_only_ptq	static_ptq,weight_only_ptq
`algorithm`	categorical	Calibration or optimization algorithm name to pass to the backend configuration. For the ModelOpt backends, this becomes the top-level ‘algorithm’ field	minmax	minmax,entropy
`default_layer_dtype`	categorical	Default data type for layers (currently ignored by backends; specify dtype per layer)	native	int8,fp8_e4m3fn,fp8_e5m2,native
`default_activation_dtype`	categorical	Default data type for activations (currently ignored by backends; specify dtype per layer)	native	int8,fp8_e4m3fn,fp8_e5m2,native
`layers`	list	List of per-module quantization configurations	[]
`skip_names`	list	List of module or layer names or patterns to exclude from quantization	[]
`model_path`	string	Path to the model to be quantized
`results_dir`	string	Path to where all the assets generated from a task are stored

5.1. Layer Entries#

Each item accepts:

module_name: Qualified name or wildcard pattern; also matches module types (e.g., Linear, Conv2d for PyTorch; Conv, Gemm for ONNX).
weights: { dtype: <int8|fp8_e4m3fn|fp8_e5m2|native> }.
activations: { dtype: <...> } (ModelOpt backends only; ignored by TorchAO).

5.2. Pattern Matching#

Patterns are matched first against the qualified module name in the graph, then against the module class name. Wildcards * and ? are supported.

5.3. Examples#

Weight-only int8 for all Linear layers, skip classifier head:

quantize:
  backend: "torchao"
  mode: "weight_only_ptq"
  layers:
    - module_name: "Linear"
      weights: { dtype: "int8" }
  skip_names: ["classifier.fc"]

Static PTQ INT8 for conv/linear with INT8 activations (PyTorch):

quantize:
  model_path: "/path/to/model.pth"
  results_dir: "/path/to/output"
  backend: "modelopt.pytorch"
  mode: "static_ptq"
  algorithm: "minmax"
  layers:
    - module_name: "Conv2d"
      weights: { dtype: "int8" }
      activations: { dtype: "int8" }
    - module_name: "Linear"
      weights: { dtype: "int8" }
      activations: { dtype: "int8" }

Static PTQ INT8 for ONNX model:

quantize:
  model_path: "/path/to/model.onnx"
  results_dir: "/path/to/output"
  backend: "modelopt.onnx"
  mode: "static_ptq"
  algorithm: "minmax"
  layers:
    - module_name: "Conv"
      weights: { dtype: "int8" }
      activations: { dtype: "int8" }
    - module_name: "Gemm"
      weights: { dtype: "int8" }
      activations: { dtype: "int8" }

5.4. Advanced: backend_kwargs#

The backend_kwargs field allows you to pass advanced, backend-specific parameters that are not exposed in the main configuration schema. These parameters are merged with the backend configuration and forwarded to the underlying quantization library.

Common use cases:

Overriding default ModelOpt ONNX parameters
Enabling per-channel quantization
Specifying custom execution providers
Advanced calibration options

torchao backend_kwargs: Parameters passed directly to torchao.quantization.quantize_.

quantize:
  backend: "torchao"
  backend_kwargs:
    # Add TorchAO-specific parameters here
    # (refer to TorchAO documentation)

modelopt.pytorch backend_kwargs: Parameters passed to ModelOpt PyTorch APIs.

quantize:
  backend: "modelopt.pytorch"
  backend_kwargs:
    # Add ModelOpt PyTorch-specific parameters here

modelopt.onnx backend_kwargs: Parameters passed directly to modelopt.onnx.quantization.quantize. Some useful examples are:

quantize:
  backend: "modelopt.onnx"
  backend_kwargs:
    use_external_data_format: true             # For large models >2GB
    op_types_to_quantize: ["Conv", "Gemm"]     # Limit quantized op types
    calibrate_per_node: true                   # Per-node calibration
    high_precision_dtype: "fp16"               # Keep some ops in FP16

For complete list of parameters, refer to the ModelOpt ONNX quantize API.

Example: Large model with selective quantization:

quantize:
  model_path: "/path/to/model.onnx"
  results_dir: "/path/to/output"
  backend: "modelopt.onnx"
  mode: "static_ptq"
  algorithm: "max"
  layers:
    - module_name: "*"
      weights: { dtype: "fp8_e5m2" }
      activations: { dtype: "fp8_e5m2" }
  skip_names: ["/head/*"]
  backend_kwargs:
    use_external_data_format: true              # For models >2GB
    op_types_to_quantize: ["Conv", "Gemm"]      # Only these ops

Note

You should only use backend_kwargs if you need to override defaults or access features not exposed in the main configuration. Incorrect usage may lead to quantization failures.

5.5. Task-Specific Notes#

Classification: No extra dataset fields are needed beyond your usual evaluation/inference configurations.
RT-DETR + ModelOpt: Ensure you have a representative validation set configured for calibration. Evaluation inputs are reused.
ONNX models: Use TAO’s export command (classification_pyt export or rtdetr export) to convert your PyTorch model to ONNX, then use modelopt.onnx backend with the exported ONNX file path.
For NVIDIA TensorRT™ deployment, use modelopt.onnx backend for best performance.