Diffusion Transformer (DiT)#

The Diffusion Transformer (DiT) is a Vision Transformer backbone for diffusion models. It operates on image patches via a patchify embedding, processes tokens with a sequence of transformer blocks conditioned through adaptive layer normalization (adaLN-Zero), and reconstructs the output via an unpatchify step.

DiT was introduced in Scalable Diffusion Models with Transformers, Peebles & Xie.

DiT#