Performance

We applied multiple optimizations to speedup the training throughput of controlnet. The following numbers are got from running on a single A100 GPU.

Model

Batch Size

Flash Attention

Channels Last

Inductor

Samples per Second

Memory Usage

Weak Scaling

ControlNet 8 NO NO NO 11.68 76G 1.0
ControlNet 8 YES NO NO 16.40 33G 1.4
ControlNet 8 YES YES NO 20.24 29G 1.73
ControlNet 8 YES YES YES 21.52 29G 1.84
ControlNet 32 YES YES YES 27.20 66G 2.33

Here we show the examples of controlnet generations. The left column is the original input (upper) and conditioning image (lower).

Prompt: House.

ControlNet_house.png

Prompt: House in oil painting style.

ControlNet_painting_house.png

Prompt: Bear.

ControlNet_bears.png

Latency times are started directly before the text encoding (CLIP) and stopped directly after the output image decoding (VAE). For framework we use the Torch Automated Mixed Precision (AMP) for FP16 computation. For TRT, we export the various models with the FP16 acceleration. We use the optimized TRT engine setup present in the deployment directory to get the numbers in the same environment as the framework.

GPU: NVIDIA DGX A100 (1x A100 80 GB) Batch Size: Synonymous with num_images_per_prompt

Model

Batch Size

Sampler

Inference Steps

TRT FP 16 Latency (s)

FW FP 16 (AMP) Latency (s)

TRT vs FW Speedup (x)

ControlNet (Res=512) 1 DDIM 50 1.7 6.5 3.8
ControlNet (Res=512) 2 DDIM 50 2.6 7.1 2.8
ControlNet (Res=512) 4 DDIM 50 4.4 11.1 2.5
ControlNet (Res=512) 8 DDIM 50 8.2 21.1 2.6
Previous Model Deployment
Next InstructPix2Pix
© Copyright 2023-2024, NVIDIA. Last updated on Apr 25, 2024.