6. Skipping Layers From Quantization#

6.1. Overview#

Not all layers benefit from quantization. Output heads, embeddings, and certain sensitive layers often need to remain in full precision to preserve accuracy. TAO Quant provides the skip_names field to exclude specific layers or patterns from quantization.

6.2. The skip_names Field#

Add a skip_names list to your quantization configuration to specify which modules should not be quantized:

quantize:
  backend: "modelopt.onnx"
  mode: "static_ptq"
  layers:
    - module_name: "*"
      weights: { dtype: "fp8_e5m2" }
      activations: { dtype: "fp8_e5m2" }
  skip_names:
    - "/head/*"          # Skip all layers under head
    - "/output"          # Skip specific output layer
    - "*classifier*"     # Skip any layer with 'classifier' in name

6.3. Pattern Matching#

Patterns in skip_names support wildcard notation and are matched using fnmatch semantics:

* : Matches zero or more characters
? : Matches exactly one character
[seq] : Matches any character in seq
[!seq] : Matches any character not in seq

Matching order:

First, the pattern is matched against the module’s qualified name in the graph (e.g., backbone.layer1.conv)
If no match, the pattern is checked against the module’s class name (e.g., Conv2d, Linear)

6.4. Common Skip Patterns#

6.4.1. Output Heads#

Most models have output/classification heads that are sensitive to quantization:

skip_names:
  - "/head/*"           # ONNX-style: skip all under /head
  - "head.*"            # PyTorch-style: skip all under head module
  - "*classifier*"      # Skip any layer with classifier in name
  - "fc"                # Skip final fully connected layer

6.4.2. Embeddings and Norms#

Embedding and normalization (norm) layers often need full precision:

skip_names:
  - "*embedding*"       # Skip embedding layers
  - "*Embedding"        # Skip by class name
  - "*norm*"            # Skip normalization layers
  - "*LayerNorm"        # Skip LayerNorm by class type

6.4.3. Detection Model Components#

For object detection models like RT-DETR:

skip_names:
  - "/head/*"           # Detection head
  - "/encoder/*"        # Transformer encoder (if sensitive)
  - "/decoder/*"        # Transformer decoder (if sensitive)
  - "*bbox*"            # Bounding box prediction layers

6.5. Backend-Specific Notes#

6.5.1. PyTorch Backends (torchao, modelopt.pytorch)#

Patterns match PyTorch module names (e.g., backbone.layer1.0.conv1)
Class names use PyTorch conventions (Conv2d, Linear, BatchNorm2d)

Example:

skip_names:
  - "classifier.fc"     # Exact module path
  - "backbone.layer4.*" # All layers in layer4
  - "Linear"            # All Linear layers (by class)

6.5.2. ONNX Backend (modelopt.onnx)#

Patterns match ONNX node names (e.g., /head/Conv, /backbone/layer1/conv/Conv)
ONNX names typically use forward slashes
Class names use ONNX operator types (Conv, Gemm, MatMul)

Example:

skip_names:
  - "/head/*"           # All nodes under /head
  - "/output"           # Specific output node
  - "*/classifier/*"    # Any classifier subgraph

6.6. Finding Layer Names#

To determine which layers to skip, you can:

6.6.1. PyTorch Models#

Print the model structure:

for name, module in model.named_modules():
    print(f"{name}: {module.__class__.__name__}")

6.6.2. ONNX Models#

Use Netron to visualize the ONNX model:

pip install netron
netron model.onnx

Or use ONNX tools:

import onnx
model = onnx.load("model.onnx")
for node in model.graph.node:
    print(f"{node.name}: {node.op_type}")

Best practices:

Start broad, refine iteratively: Begin by quantizing everything, measure accuracy drop, then selectively skip sensitive layers.
Common skip patterns:
- Always consider skipping output heads
- Skip embeddings if they’re critical to your task
- For detection models, often skip encoder/decoder in addition to heads
Verify with accuracy metrics:
- Run evaluation on unquantized model (baseline)
- Run evaluation on fully quantized model
- If accuracy drops significantly, add skip patterns and re-evaluate
Balance accuracy vs performance:
- More skipped layers = better accuracy, lower speedup
- Fewer skipped layers = lower accuracy, higher speedup
- Aim for <1-2% accuracy drop with minimal skips

6.7. Examples by Model Type#

6.7.1. Classification (ResNet, EfficientNet)#

skip_names:
  - "fc"                # Final classifier
  - "classifier.*"      # All classifier layers

6.7.2. Object Detection (RT-DETR)#

skip_names:
  - "/head/*"           # Detection head
  - "/encoder/*"        # Can skip if accuracy-sensitive
  - "/decoder/*"        # Can skip if accuracy-sensitive

6.7.3. Segmentation (SegFormer)#

skip_names:
  - "decode_head.*"     # Segmentation head
  - "*norm*"            # Normalization layers (optional)

6.8. Validation Workflow#

Baseline: Evaluate unquantized model; record accuracy
Full quantization: Quantize all layers; measure accuracy drop
If drop > 2%: Add skip patterns for sensitive layers
Iterate: Re-quantize and evaluate until accuracy is acceptable
Final: Measure inference speedup with final skip configuration

# Step 1: Baseline
tao rtdetr evaluate -e config_baseline.yaml
# => mAP: 52.3%

# Step 2: Full quantization
tao rtdetr quantize -e config_full_quant.yaml
tao rtdetr evaluate -e config_full_quant_eval.yaml
# => mAP: 48.1% (drop: 4.2% - too much!)

# Step 3: Add skip patterns
# Update config to skip /head/*, /encoder/*, /decoder/*
tao rtdetr quantize -e config_selective_quant.yaml
tao rtdetr evaluate -e config_selective_quant_eval.yaml
# => mAP: 51.8% (drop: 0.5% - acceptable!)