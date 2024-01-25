Before delving into practical considerations, it is essential to explore the foundational principles behind the formulation of diffusion models, which are typically structured using ordinary or stochastic differential equations. A significant portion of this section draws upon the insights from the paper titled Score-Based Generative Modeling through Stochastic Differential Equations.

To illustrate the underlying concepts, let’s consider a simplified 1D toy example that mirrors the essential processes involved when dealing with real images. Real image data resides in a high-dimensional space of pixel values. For instance, specifying a megapixel image requires a million pixel values. In this case, we are visualizing it in 1D, even though our dataset is one-dimensional, the analogy remains valid.

On the vertical axis, we represent pixel values, with different images corresponding to distinct points in this vertical dimension. The horizontal axis denotes time, and we incrementally increase the noise level over time. The core idea here is that we have a dataset comprising samples, and our objective is to train or formulate a model capable of generating additional examples that resemble the dataset. The dataset is illustrated by the bimodal density in red within this toy example.

Consider drawing a sample from this dataset, hypothetically representing an image from it. If we imagine this dataset as a collection of pictures of cats and dogs, we can use this scenario as a stepping stone toward understanding the denoising diffusion approach. Specifically, we explore what transpires when we gradually introduce noise to this image.

In this example, the process corresponds to taking a random walk within the pixel value space. The image gradually deteriorates until it transforms into pure white noise. When multiple samples are involved, they collectively converge toward this indistinguishable white noise distribution.

Analyzing these trajectories reveals the emergence of a density in the XT plane. If we were to visualize this density, it would appear as follows:

On the far-left end of the spectrum, we have the original probability density of the data. As time progresses, this density becomes gradually diffused and blurred, ultimately leading to a state where, after a sufficient duration, all data converges to a normally distributed, effectively indistinguishable white noise.

The fundamental concept behind denoising diffusion methods is that the distribution at this final stage is easy to sample from. These methods enable the reverse journey in time, retracing the path back to the original distribution from a random sample. This process effectively implements the gradual denoising of the image.

This yields a random sample from the data distribution. Similarly, when dealing with multiple samples, they collectively approach the data distribution at time zero.

We can understand this formulation in terms of Stochastic Differential Equations (SDE). The simplest SDE that describes the forward approach, where noise is added, states that over an infinitesimal time interval, the change in the image is simply a random white noise:

()\[d\mathbf{x} = d\omega\]



When we take small steps, a small amount of white noise is added at each step, causing the image to evolve randomly over time. Although there are more general versions of this equation, we will keep it simple for now. Importantly, there exists a reverse version of the same SDE, crucial to the entire diffusion framework:

()\[d\mathbf{x} = -

abla_{\mathbf{x}} \log p_t(\mathbf{x})dt + d\omega_{\mathbf{x}}\]



This SDE introduces a stochastic component, but additionally, it includes a deterministic drift component related to the density of the data at any given time and its gradient. In essence, this term attracts the point toward the data’s location at each time step. This well-known function is referred to as the score function and possesses the valuable property that it can be evaluated through denoising:

()\[\left( D(\mathbf{x}; \sigma) - \mathbf{x}

ight) / \sigma^2\]



If we have an optimal \(L_2\) denoiser, denoted as \(D\), then this equation allows us to evaluate the term. Notably, this eliminates the need to have access to the generally intractable density function \(p\); instead, having some form of denoiser suffices. In practice, this is often approximated using a Convolutional Neural Network (CNN). In essence, this is where the CNN is integrated into these models.

Song et al. 2021 introduced an alternative approach to formulate diffusion models as deterministic Ordinary Differential Equations (ODEs):

()\[d\mathbf{x} = - \frac{1}{2}

abla_{\mathbf{x}} \log p_t(\mathbf{x})dt\]



This formulation solely consists of a score-based term and does not involve a random walk component.

Qualitatively, the evolution of the image follows a distinct pattern. Rather than randomly perturbing the image, it exhibits a gradual transition from noise to the clean image. Armed with this insight, we can overlay these ideal trajectories on top of the density. These trajectories are known as flow curves, and they represent the solutions to the ODE. In practice, the ODE is solved through discretization in time. Given an initial sample, we take finite-length time steps that aim to follow these trajectories.

Zooming in, the process involves determining how much the image should change over a time interval \(dx\) for a given change in time \(dt\). These steps are repeated until time zero is reached, resulting in the generated sample.

In SDEs, noise is also injected at each of these steps, but we’ll explore that aspect later. For now, let’s focus on analyzing the ODE formulation as it provides valuable insights into the dynamics of this stepping procedure.

This concludes the background on previous works related to ODEs/SDEs for diffusion in a nutshell. While the theory is illuminating, it also imposes certain constraints on how the ODE or SDE must operate to recover the correct distribution. However, many questions and design choices remain, such as those related to sampling, stochasticity, and network training. These questions include considerations like the choice of ODE/PDE solver, step length, the need for stochasticity, the amount of noise to inject, scaling signals, predicting signals or noise, and determining loss weighting.

In the remainder of this tutorial, we will address these questions based on insights from the EDM Paper: Elucidating the Design Space of Diffusion-Based Generative Models. The approach taken in this paper involves dissecting key previous works to understand their components and design choices, particularly focus on Score-Based Generative Modeling through Stochastic Differential Equations, Denoising Diffusion Probabilistic Models, and Improved Denoising Diffusion Probabilistic Models.

The former work presents the Variance Preserving (VP) and Variance Exploding (VE) methods, which are fundamentally different in terms of architectures, stepping schedules, training methods, and more. The latter two are the DDPM and iDDPM methods. By analyzing these methods and their specific design choices related to sampling, preconditioning, training, and more, we aim to construct a comprehensive table of best practices for each of these design choices.

The ultimate goal is to develop a table where each entry represents an optimized design choice, drawing from the theory presented in the paper. It’s important to note that all these design choices are independent and can be studied in isolation.

To assess the generation quality, the Fréchet Inception Distance (FID) is a widely used metric in generative modeling. The remainder of this tutorial will be divided into two parts: Part 1 covering deterministic sampling and Part 2 exploring preconditioning and training. The detailed examination of neural network architectures for diffusion models falls outside the scope of this tutorial. Let’s begin with deterministic sampling!