4. Configuration#
Quantization is configured under the quantize section of your experiment specification. Dtype strings may be written as fp8_* or float8_* (aliases map to the same values in backends that accept FP8).
4.1. Top-Level Fields#
backend:torchao,modelopt.pytorch, ormodelopt.onnx.mode:weight_only_ptq(TorchAO) orstatic_ptq(ModelOpt backends).algorithm: Calibration/optimisation algorithm for ModelOpt. Valid:minmax,entropy,max(refer to backend documentation).default_layer_dtype: Deprecated—Ignored by all backends. Setweights.dtypeper layer instead. Valid values:int8,fp8_e4m3fn,fp8_e5m2,native.default_activation_dtype: Deprecated—Ignored by all backends. Setactivations.dtypeper layer instead (ModelOpt only). Same valid options as above.layers: List of layer-wise configurations.skip_names: List of module name patterns to exclude from quantization. Refer to Skipping layers for details.model_path: Trained checkpoint path to quantize (PyTorch checkpoint fortorchao/modelopt.pytorch; ONNX file path formodelopt.onnx).results_dir: Directory for quantized artifacts.backend_kwargs: Advanced—Additional backend-specific parameters (dict). Refer to the section below for details.device: Device for quantization (cuda,cpu, or specific GPU likecuda:0). Default:cuda.
4.2. Schema Reference (Auto-Generated)#
5. ModelQuantizationConfig Fields#
Field |
value_type |
description |
default_value |
valid_options |
|---|---|---|---|---|
|
categorical |
The quantization backend to use |
torchao |
modelopt.pytorch,modelopt.onnx,torchao |
|
categorical |
The quantization mode to use |
weight_only_ptq |
static_ptq,weight_only_ptq |
|
categorical |
Calibration or optimization algorithm name to pass to the backend configuration. For the ModelOpt backends, this becomes the top-level ‘algorithm’ field |
minmax |
minmax,entropy |
|
categorical |
Default data type for layers (currently ignored by backends; specify dtype per layer) |
native |
int8,fp8_e4m3fn,fp8_e5m2,native |
|
categorical |
Default data type for activations (currently ignored by backends; specify dtype per layer) |
native |
int8,fp8_e4m3fn,fp8_e5m2,native |
|
list |
List of per-module quantization configurations |
[] |
|
|
list |
List of module or layer names or patterns to exclude from quantization |
[] |
|
|
string |
Path to the model to be quantized |
||
|
string |
Path to where all the assets generated from a task are stored |
5.1. Layer Entries#
Each item accepts:
module_name: Qualified name or wildcard pattern; also matches module types (e.g.,Linear,Conv2dfor PyTorch;Conv,Gemmfor ONNX).weights:{ dtype: <int8|fp8_e4m3fn|fp8_e5m2|native> }.activations:{ dtype: <...> }(ModelOpt backends only; ignored by TorchAO).
5.2. Pattern Matching#
Patterns are matched first against the qualified module name in the graph, then against the module class name. Wildcards * and ? are supported.
5.3. Examples#
Weight-only int8 for all Linear layers, skip classifier head:
quantize:
backend: "torchao"
mode: "weight_only_ptq"
layers:
- module_name: "Linear"
weights: { dtype: "int8" }
skip_names: ["classifier.fc"]
Static PTQ INT8 for conv/linear with INT8 activations (PyTorch):
quantize:
model_path: "/path/to/model.pth"
results_dir: "/path/to/output"
backend: "modelopt.pytorch"
mode: "static_ptq"
algorithm: "minmax"
layers:
- module_name: "Conv2d"
weights: { dtype: "int8" }
activations: { dtype: "int8" }
- module_name: "Linear"
weights: { dtype: "int8" }
activations: { dtype: "int8" }
Static PTQ INT8 for ONNX model:
quantize:
model_path: "/path/to/model.onnx"
results_dir: "/path/to/output"
backend: "modelopt.onnx"
mode: "static_ptq"
algorithm: "minmax"
layers:
- module_name: "Conv"
weights: { dtype: "int8" }
activations: { dtype: "int8" }
- module_name: "Gemm"
weights: { dtype: "int8" }
activations: { dtype: "int8" }
5.4. Advanced: backend_kwargs#
The backend_kwargs field allows you to pass advanced, backend-specific parameters that are not exposed in the main configuration schema. These parameters are merged with the backend configuration and forwarded to the underlying quantization library.
Common use cases:
Overriding default ModelOpt ONNX parameters
Enabling per-channel quantization
Specifying custom execution providers
Advanced calibration options
torchao backend_kwargs: Parameters passed directly to torchao.quantization.quantize_.
quantize:
backend: "torchao"
backend_kwargs:
# Add TorchAO-specific parameters here
# (refer to TorchAO documentation)
modelopt.pytorch backend_kwargs: Parameters passed to ModelOpt PyTorch APIs.
quantize:
backend: "modelopt.pytorch"
backend_kwargs:
# Add ModelOpt PyTorch-specific parameters here
modelopt.onnx backend_kwargs: Parameters passed directly to modelopt.onnx.quantization.quantize. Some useful examples are:
quantize:
backend: "modelopt.onnx"
backend_kwargs:
use_external_data_format: true # For large models >2GB
op_types_to_quantize: ["Conv", "Gemm"] # Limit quantized op types
calibrate_per_node: true # Per-node calibration
high_precision_dtype: "fp16" # Keep some ops in FP16
For complete list of parameters, refer to the ModelOpt ONNX quantize API.
Example: Large model with selective quantization:
quantize:
model_path: "/path/to/model.onnx"
results_dir: "/path/to/output"
backend: "modelopt.onnx"
mode: "static_ptq"
algorithm: "max"
layers:
- module_name: "*"
weights: { dtype: "fp8_e5m2" }
activations: { dtype: "fp8_e5m2" }
skip_names: ["/head/*"]
backend_kwargs:
use_external_data_format: true # For models >2GB
op_types_to_quantize: ["Conv", "Gemm"] # Only these ops
Note
You should only use backend_kwargs if you need to override defaults or access features not exposed in the main configuration. Incorrect usage may lead to quantization failures.
5.5. Task-Specific Notes#
Classification: No extra dataset fields are needed beyond your usual evaluation/inference configurations.
RT-DETR + ModelOpt: Ensure you have a representative validation set configured for calibration. Evaluation inputs are reused.
ONNX models: Use TAO’s export command (
classification_pyt exportorrtdetr export) to convert your PyTorch model to ONNX, then usemodelopt.onnxbackend with the exported ONNX file path.For NVIDIA TensorRT™ deployment, use
modelopt.onnxbackend for best performance.