Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

Stable Diffusion XL Int8 Quantization#

This example shows how to use ModelOpt to calibrate and quantize the UNet part of the SDXL. The UNet part typically consumes >95% of the e2e Stable Diffusion latency.

We also provide instructions on deploying and running E2E SDXL pipeline with ModelOpt quantized int8 UNet to generate images and measure latency on target GPUs.

To get started, you need a pretrained SDXL checkpoint in NeMo format. Example training configurations are available here: Stable Diffusion Examples NVIDIA/NeMo.

Calibration#

The first step is to run the quantization script with the default config. The script will export the quantized unet to the onnx file. Here is the default config for SDXL Quantize Script: NVIDIA/NeMo.

quantize
  exp_name: nemo
  n_steps: 20          # number of inference steps
  format: 'int8'       # only int8 quantization is supported now
  percentile: 1.0      # Control quantization scaling factors (amax) collecting range, meaning that we will collect the minimum amax in the range of `(n_steps * percentile)` steps. Recommendation: 1.0
  batch_size: 1        # batch size calling sdxl inference pipeline during calibration
  calib_size: 32       # For SDXL, we recommend 32, 64 or 128
  quant_level: 2.5     #Which layers to be quantized, 1: `CNNs`, 2: `CNN + FFN`, 2.5: `CNN + FFN + QKV`, 3: `CNN + Linear`. Recommendation: 2, 2.5 and 3, depending on the requirements for image quality & speedup.
  alpha: 0.8           # A parameter in SmoothQuant, used for linear layers only. Recommendation: 0.8 for SDXL

Important Parameters#

  • percentile: Controls the range for collecting quantization scaling factors (amax). This means we will collect the minimum amax over a range of (n_steps * percentile) steps. We recommend 1.0.

  • alpha: A parameter in SmoothQuant, used exclusively for linear layers only. We recommend 0.8 for SDXL and 1.0 for SD 1.5.

  • quant-level: Which layers to be quantized, 1: CNNs, 2: CNN + FFN, 2.5: CNN + FFN + QKV, 3: CNN + Linear. Recommendation: 2, 2.5 and 3, depending on the requirements for image quality & speedup.

  • calib-size: For SDXL, we recommend 32, 64 or 128, for SD 1.5 and set to 512 or 1024.

Build the TRT engine for the Quantized ONNX UNet#

trtexec --onnx=./nemo_onnx/unet.onnx --shapes=x:8x4x128x128,timesteps:8,context:8x80x2048,y:8x2816 --fp16 --int8 --builderOptimizationLevel=4 --saveEngine=nemo_unet_xl.plan

Important Parameters#

When building the TRT engine, you must provide the input shape as follows:

  • x: Input image latent shape (B * C * H * W)

  • context: Input text conditioning (B * S * hidden_dimension)

  • y: Additional embedding (B * adm_in_channels)

Build End-to-end Stable Diffusion XL Pipeline with NeMo#

We provide a script to build an end-to-end TRT inference pipeline with the NeMo backend here: SDXL Export Script NVIDIA/NeMo.

infer:
    out_path: sdxl_export
    width: 1024
    height: 1024
    batch_size: 2

trt:
  static_batch: False
  min_batch_size: 1
  max_batch_size: 8

Important Parameters#

  • out_path: Directory to save onnx file and TRT engine files.

  • width and height: Image resolution of inference output.

  • batch_size: Only used for dummy input generation and onnx sanity check.

  • {min,max}_batch_size: The input batch size of TRT engine along its dynamic axis.

Run End-to-End Stable Diffusion XL TRT Pipeline#

The inference script can be found here: SDXL TRT Inference Script NVIDIA/NeMo.

unet_xl: sdxl_export/plan/unet_xl.plan
vae: sdxl_export/plan/vae.plan
clip1: sdxl_export/plan/clip1.plan
clip2: sdxl_export/plan/clip2.plan

out_path: trt_output

Please specify unet_xl as the quantized Unet engine to run the quantized solution. The system will load the original engine file by default.

Inference Speedup#

TRT version 9.3.0 GPU: H100

TRT int8 vs Framework fp16#

Pipeline

Batch Size

Latency (ms)

Pipeline

Batch Size

Latency

Speedup

Framework fp16 base

1

3056.01

ModelOpt TRT Int8

1

1406.68

2.172498365

Framework fp16 base

2

4832.24

ModelOpt TRT Int8

2

2403.29

2.01067703

Framework fp16 base

4

8433.71

ModelOpt TRT Int8

4

4252.6

1.983189108

TRT int8 vs TRT fp16#

Pipeline

Batch Size

Latency (ms)

Precision

Batch Size

Latency

Speedup

fp16 base

1

1723.97

ModelOpt Int8

1

1406.68

1.225559473

fp16 base

2

3004.47

ModelOpt Int8

2

2403.29

1.250148754

fp16 base

4

5657.19

ModelOpt Int8

4

4252.6

1.330289705

FP16 inference vs Int8 inference#

https://github.com/NVIDIA/NeMo/releases/download/v1.23.0/asset-githubio-home-sdxl_trt_fp16_1.png https://github.com/NVIDIA/NeMo/releases/download/v1.23.0/asset-githubio-home-sdxl_trt_int8_1.png

Prompt: A photo of a Shiba Inu dog with a backpack riding a bike. It is wearing sunglasses and a beach hat. (FP16 upper vs Int8 lower)

https://github.com/NVIDIA/NeMo/releases/download/v1.23.0/asset-githubio-home-sdxl_trt_fp16_2.png https://github.com/NVIDIA/NeMo/releases/download/v1.23.0/asset-githubio-home-sdxl_trt_int8_2.png

Prompt: A cute corgi lives in a house made out of sushi. (FP16 upper vs Int8 lower)

https://github.com/NVIDIA/NeMo/releases/download/v1.23.0/asset-githubio-home-sdxl_trt_fp16_3.png https://github.com/NVIDIA/NeMo/releases/download/v1.23.0/asset-githubio-home-sdxl_trt_int8_3.png

Prompt: A high contrast portrait of a very happy fuzzy panda dressed as a chef in a high end kitchen making dough. There is a painting of flowers on the wall behind him. (FP16 upper vs Int8 lower)