Stable Diffusion - NVIDIA Docs

NVIDIA Docs Hub NVIDIA NeMo Framework User Guide Stable Diffusion

Stable Diffusion (SD) [[Paper]](https://arxiv.org/pdf/2112.10752v2.pdf) is a powerful generative model that can produce high-quality images based on textual descriptions. By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) have achieved state-of-the-art synthesis results on image data and beyond. However, due to their direct operation in pixel space, optimization of powerful DMs is computationally expensive and can consume hundreds of GPU days. To address this challenge, the SD model is applied in the latent space of powerful pretrained autoencoders. This enables DM training on limited computational resources while retaining their quality and flexibility, greatly boosting visual fidelity.

The SD model also introduces cross-attention layers into the model architecture, allowing it to turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes. As a result, the SD model achieves a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution. Additionally, the SD model significantly reduces computational requirements compared to pixel-based DMs, making it an attractive solution for a wide range of applications.

Feature	Training	Inference
Data parallelism	Yes	N/A
Tensor parallelism	No	No
Pipeline parallelism	No	No
Sequence parallelism	No	No
Activation checkpointing	No	No
FP32/TF32	Yes	Yes (FP16 enabled by default)
AMP/FP16	Yes	Yes
AMP/BF16	No	No
BF16 O2	No	No
TransformerEngine/FP8	No	No
Multi-GPU	Yes	Yes
Multi-Node	Yes	Yes
Inference deployment	N/A	NVIDIA Triton supported
SW stack support	Slurm DeepOps/Base Command Manager/Base Command Platform	Slurm DeepOps/Base Command Manager/Base Command Platform
NVfuser	No	N/A
Distributed Optimizer	No	N/A
TorchInductor	Yes	N/A
Flash Attention	Yes	N/A
NHWC GroupNorm	Yes	Yes

Previous Text-to-Image Models

Next Data Preparation