Denoising Diffusion Probabilistic Models (DDPM)

Generate images by reversing a noising process. The forward process gradually destroys a clean image with Gaussian noise over T steps; a neural network is trained to undo one step at a time. Sample by starting from pure noise and denoising.

TL;DR. Define a fixed forward Markov chain that turns x_0 into pure noise x_T ~ N(0, I). Train a neural net (almost always a U-Net) to predict the noise eps that was added at a randomly chosen step t. At inference, start from x_T ~ N(0, I) and iteratively subtract predicted noise from t = T down to t = 1. Stable Diffusion runs the same algorithm in a compressed VAE latent space and conditions the U-Net on text via cross-attention. State of the art for image, audio, and protein generation; USAAIO uses DDPMs for image generation under tight compute.

1. The intuition

A GAN learns to generate in one shot — push z ~ N(0, I) through a network and out comes an image. That is hard: the network has to leap from pure noise to a sharp natural image in a single function evaluation. GANs work, but training is notoriously unstable.

Diffusion breaks the leap into many small denoising steps. Each step asks only: "given this slightly-noisy image, what was the noise I added?" That's a tiny, well-posed regression problem with clean MSE targets. Stack hundreds of such steps and the cumulative effect is a clean image. The training objective collapses to a single MSE loss; there is no adversary, no balance to tune, and the model rarely fails to train.

Two costs: (1) slow sampling — 50 to 1000 forward passes per image vs 1 for a GAN; (2) high memory for high-resolution images, which Stable Diffusion fixes by running diffusion in a compressed VAE latent space.

2. The math

Forward (noising) process

Fixed Markov chain with variance schedule {beta_1, ..., beta_T} in (0, 1):

q(x_t | x_{t-1}) = N( x_t ; sqrt(1 - beta_t) * x_{t-1}, beta_t * I )

Let alpha_t = 1 - beta_t and alpha_bar_t = prod_{s=1..t} alpha_s. A key derivation: composing Gaussians gives a closed form for any step t starting from x_0, so we never simulate the chain step by step at training time:

q(x_t | x_0) = N( x_t ; sqrt(alpha_bar_t) * x_0, (1 - alpha_bar_t) * I ) x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * eps, eps ~ N(0, I)

As t -> T, alpha_bar_t -> 0 and x_T -> N(0, I). The schedule is fixed (linear beta_t from 1e-4 to 0.02 in the original paper; cosine schedules are now standard).

Reverse (denoising) process

Parameterise the reverse step as a Gaussian whose mean is predicted by a network:

p_theta(x_{t-1} | x_t) = N( x_{t-1} ; mu_theta(x_t, t), Sigma_t )

With the right derivation (Bayes on the forward process), the optimal mean has the form:

mu_theta(x_t, t) = (1 / sqrt(alpha_t)) * ( x_t - (beta_t / sqrt(1 - alpha_bar_t)) * eps_theta(x_t, t) )

where eps_theta is the network's prediction of the noise that was added. The variance Sigma_t is usually fixed to beta_t * I (or a closed form posterior variance); some variants learn it.

Training loss (simplified)

The full ELBO has many KL terms, but Ho et al. (2020) showed a dramatically simpler surrogate works better in practice:

L_simple = E_{t, x_0, eps} [ || eps - eps_theta( sqrt(alpha_bar_t)*x_0 + sqrt(1-alpha_bar_t)*eps, t ) ||^2 ]

In English: pick a random training image x_0, pick a random timestep t in {1..T}, draw fresh Gaussian noise eps, build the noisy sample x_t via the closed form, and train the network to predict eps from (x_t, t). Plain MSE.

Stable Diffusion pointer

Stable Diffusion is exactly DDPM with two tweaks: (1) diffusion runs in a 64x64x4 VAE latent rather than 512x512x3 pixel space — much cheaper; (2) the U-Net is conditioned on text embeddings via cross-attention at every level (Q from image features, K and V from CLIP text features). Classifier-free guidance scales the conditional vs unconditional noise prediction to control prompt adherence.

3. PyTorch reference implementation

import torch
import torch.nn as nn
import torch.nn.functional as F


def linear_beta_schedule(T, beta_start=1e-4, beta_end=0.02):
    return torch.linspace(beta_start, beta_end, T)


class NoiseSchedule:
    """Precomputes betas / alphas / alpha_bars for fast lookup at step t."""
    def __init__(self, T=1000, device="cpu"):
        self.T          = T
        self.betas      = linear_beta_schedule(T).to(device)
        self.alphas     = 1.0 - self.betas
        self.alpha_bars = torch.cumprod(self.alphas, dim=0)

    def q_sample(self, x0, t, noise):
        """Forward closed form: x_t = sqrt(ab) * x0 + sqrt(1-ab) * noise."""
        ab = self.alpha_bars[t].view(-1, 1, 1, 1)        # (B, 1, 1, 1)
        return torch.sqrt(ab) * x0 + torch.sqrt(1.0 - ab) * noise


class DummyDenoiser(nn.Module):
    """Stub stand-in for a U-Net; takes (x_t, t) and returns predicted noise."""
    def __init__(self, channels=1):
        super().__init__()
        self.t_embed = nn.Embedding(1000, 32)
        self.net = nn.Sequential(
            nn.Conv2d(channels + 1, 32, 3, padding=1), nn.SiLU(),
            nn.Conv2d(32, 32, 3, padding=1), nn.SiLU(),
            nn.Conv2d(32, channels, 3, padding=1),
        )

    def forward(self, x_t, t):
        # Broadcast timestep as an extra channel (a real U-Net injects via FiLM).
        B, _, H, W = x_t.shape
        t_chan = (t.float() / 1000.0).view(B, 1, 1, 1).expand(B, 1, H, W)
        return self.net(torch.cat([x_t, t_chan], dim=1))


def training_step(model, sched, x0, optim):
    B = x0.size(0)
    t      = torch.randint(0, sched.T, (B,), device=x0.device)
    noise  = torch.randn_like(x0)
    x_t    = sched.q_sample(x0, t, noise)
    eps_p  = model(x_t, t)
    loss   = F.mse_loss(eps_p, noise)
    optim.zero_grad()
    loss.backward()
    optim.step()
    return float(loss)


@torch.no_grad()
def sample(model, sched, shape, device="cpu"):
    """Ancestral sampler: x_T ~ N(0, I), iteratively denoise to x_0."""
    model.train(False)                          # equivalent to .e+val()
    x = torch.randn(shape, device=device)
    for t in reversed(range(sched.T)):
        t_b   = torch.full((shape[0],), t, device=device, dtype=torch.long)
        eps_p = model(x, t_b)
        a     = sched.alphas[t]
        ab    = sched.alpha_bars[t]
        b     = sched.betas[t]
        mean  = (1.0 / torch.sqrt(a)) * (x - (b / torch.sqrt(1.0 - ab)) * eps_p)
        if t > 0:
            x = mean + torch.sqrt(b) * torch.randn_like(x)
        else:
            x = mean                            # last step: no noise added
    return x


if __name__ == "__main__":
    torch.manual_seed(0)
    sched = NoiseSchedule(T=100)
    model = DummyDenoiser(channels=1)
    optim = torch.optim.Adam(model.parameters(), lr=1e-3)
    x0 = torch.randn(8, 1, 16, 16)              # fake batch
    for step in range(3):
        loss = training_step(model, sched, x0, optim)
        print(step, loss)
    out = sample(model, sched, (4, 1, 16, 16))
    print(out.shape)                            # torch.Size([4, 1, 16, 16])

A production DDPM swaps DummyDenoiser for a real U-Net with attention and a time-step embedding fed in via FiLM-style scale-shift at each block. The math above is unchanged. model.train(False) is the standard inference switch.

4. Common USAAIO / IOAI applications

5. Drills

D1 · Closed-form forward step

With T = 1000, linear schedule, what is x_t at t = 0?

Solution

alpha_bar_0 = alpha_1 ≈ 1, so x_0 = sqrt(1) * x_0 + 0 * eps = x_0. Identity — no noise added at step 0. By t = T, alpha_bar is near 0 and x_T is essentially pure noise.

D2 · Why predict noise, not the clean image?

The simplified loss predicts eps rather than x_0. Why does that work better empirically?

Solution

At large t, x_t is almost pure noise; the network has essentially no signal to predict x_0. But predicting the noise itself is well-posed at every t — the target is just the Gaussian draw used to build x_t. It also gives a more uniform loss scale across timesteps, stabilising training.

D3 · Number of sampling steps

You trained with T = 1000 but inference is too slow. Options?

Solution

Use a deterministic ODE sampler (DDIM, DPM-Solver) — they produce comparable quality in 20-50 steps. Alternatively, distill the model into a few-step student (progressive distillation, consistency models). The schedule, not the training, sets the step budget at inference.

D4 · Why a U-Net for the denoiser?

One sentence.

Solution

The target eps lives at the same resolution as the input; U-Net's skip connections preserve high-resolution detail while letting deep bottleneck layers reason globally — exactly the symmetric mapping needed.

D5 · Debugging mode collapse vs blur

Your DDPM produces blurry, low-contrast samples. Diagnose.

Solution

Possible causes: (1) too short training — DDPMs need many epochs; (2) noise schedule too aggressive (try cosine); (3) predicting x_0 instead of eps; (4) too few sampling steps; (5) network capacity too small. Note that diffusion models do not collapse the way GANs do — diversity is usually fine; sharpness is the typical failure.

Next step

Loop back to U-Net for the architecture of the denoiser, to VAE for the latent compressor used by Stable Diffusion, and to Transformers for the cross-attention conditioning mechanism. Then drill ELBO and noise-schedule short answers in Round 2 theory.