Diffusion Foundations Tutorial En

Source: docs/tutorials/diffusion_foundations_tutorial_en.md SHA256: 5f3b3e8403e8 Rendered: 2026-05-19 18:54 UTC

§0 TL;DR

Diffusion fundamentals in 9 sentences

one-page interview essentials (full derivations in §1-§13).

  1. DDPM (Ho 2020): forward $q(x_t|x_0) = \mathcal{N}(\sqrt{\bar\alpha_t} x_0, (1-\bar\alpha_t) I)$ admits closed-form sampling; reverse $p_\theta(x_{t-1}|x_t) = \mathcal{N}(\mu_\theta, \Sigma_\theta)$ learns the reverse Gaussian; the ELBO simplifies to $L_\text{simple} = \mathbb{E}\|\epsilon - \epsilon_\theta(x_t, t)\|^2$ ($\epsilon$-prediction).
  2. Three equivalent views: DDPM's $\epsilon$, score-based $s = \nabla \log p_t$, and flow matching's $v$ are linearly invertible under the Gaussian path — $s_\theta = -\epsilon_\theta / \sigma_t$, $v = \alpha'x_0 + \sigma'\epsilon$.
  3. Tweedie's formula: $\mathbb{E}[x_0 | x_t] = x_t + \sigma_t^2 \nabla_{x_t} \log p_t(x_t)$ — a one-line bridge between denoiser and score.
  4. Score SDE (Song 2021): a unified VP-SDE / VE-SDE framework; the reverse-time SDE and the probability flow ODE share the same family of marginals, and the ODE form directly yields FM's vector field.
  5. DDIM (Song 2020 / ICLR 2021): a non-Markovian forward leads to a deterministic sampler with the same marginals as DDPM but a controllable sampling path ($\eta=0$ deterministic; $\eta=1$ over the full $T$ steps degenerates to DDPM ancestral, but with skipped steps it only matches DDPM variance, not strict equivalence).
  6. EDM (Karras 2022): preconditioning makes the network output have unit variance: $D_\theta(x;\sigma) = c_\text{skip}(\sigma) x + c_\text{out}(\sigma) F_\theta(c_\text{in}(\sigma) x, c_\text{noise}(\sigma))$; combined with a $\sigma$-schedule + Heun 2nd-order sampler, SOTA FID with NFE down to 18-35.
  7. CFG (Ho-Salimans 2022): during training drop the condition with probability $p_\text{drop}$ → the same net learns conditional/unconditional; at inference $\tilde\epsilon = (1+w)\epsilon_\theta(x,c) - w\epsilon_\theta(x,\emptyset)$, with $w \in [3, 7]$ being the workhorse range for text-to-image.
  8. Production: SD/SDXL use VAE latent + UNet; SD3 / FLUX.1 switch to Rectified Flow + MM-DiT; ControlNet adds a trainable side branch to a frozen UNet; DiT replaces the UNet entirely with a Transformer.
  9. Acceleration: DPM-Solver++ compresses NFE to 10-20; Consistency Models learn $f_\theta(x_t, t) \mapsto x_0$ for 1-4 step sampling; LCM / LCM-LoRA / SDXL-Turbo / SD3-Turbo (ADD) bring distillation to the entire Stable Diffusion family.

§1 Intuition & three views

1.1 One-sentence intuition

Diffusion = learning to "denoise": progressively noise data from clean to pure Gaussian (forward), then learn to reverse it step-by-step from noise back to data (reverse). All diffusion papers differ on just three things:

1.2 Comparison of the three views

                            Unified framework (Song et al. 2021)
                            
       Discrete view (DDPM)    Continuous view (Score SDE)   Flow view (FM/RF)
       ────────────         ──────────────────       ────────────────
       q(x_t|x_{t-1})  →    dx = f(x,t)dt+g(t)dW  →   dx = u_t(x) dt
        closed-form q(x_t|x_0) forward SDE              ODE (deterministic)
              ↓                       ↓                       ↓
        ε-prediction        score s = ∇ log p_t        vector field v_t
              ↘                       ↓                       ↙
                          All linearly invertible (under Gaussian path)
                          s = -ε/σ_t,   v = α'x_0 + σ'ε,   ε = -σ s
One-line interview answer

"DDPM is a special case of VP-SDE in discrete time; score-based is the equivalent parametrization in continuous time; Flow Matching carries the same information as score matching under VP/VE paths but parametrizes as $v$ instead of $s$. Rectified Flow steps outside the SDE framework, using a linear path to directly learn an ODE's vector field."

1.3 Convention (used throughout)

SymbolMeaning
$x_0$clean data sample
$x_t$, $t \in \{1,\dots,T\}$ or $t \in [0,T]$noised sample
$\epsilon \sim \mathcal{N}(0, I)$standard Gaussian noise
$\alpha_t, \beta_t = 1 - \alpha_t$DDPM single-step forward coefficients
$\bar\alpha_t = \prod_{s=1}^t \alpha_s$DDPM cumulative coefficient
$\sigma_t$standard deviation (the "noise level" in NCSN / EDM view)
$s_\theta(x_t, t) \approx \nabla_{x_t}\log p_t(x_t)$score
$\epsilon_\theta(x_t, t) \approx \epsilon$the noise predicted in DDPM
$D_\theta(x; \sigma) \approx x_0$EDM's denoiser output
Time-direction pitfall

DDPM's paper has forward going $t = 0 \to T$ (data noised to pure noise) and reverse going $T \to 0$; FM papers often use $t = 0$ noise and $t = 1$ data. Before writing code in an interview always disambiguate the time direction — otherwise the sampler is easy to flip.

§2 DDPM Forward Process

2.1 Single step and closed form

The DDPM forward is a Markov chain:

$$q(x_t | x_{t-1}) = \mathcal{N}(x_t;\; \sqrt{1-\beta_t}\, x_{t-1},\; \beta_t I), \quad t = 1, \dots, T$$

Define $\alpha_t = 1 - \beta_t$ and $\bar\alpha_t = \prod_{s=1}^t \alpha_s$. Key property: $q(x_t | x_0)$ is a closed-form Gaussian — you can jump from $x_0$ to any $t$ in a single step (the core of training efficiency):

$$\boxed{\; q(x_t | x_0) = \mathcal{N}\!\left(x_t;\; \sqrt{\bar\alpha_t}\, x_0,\; (1-\bar\alpha_t) I\right) \;}$$

Equivalent reparameterization:

$$x_t = \sqrt{\bar\alpha_t}\, x_0 + \sqrt{1-\bar\alpha_t}\, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$

2.2 Closed-form derivation (mandatory, will keep coming up)

From the reparameterization $x_t = \sqrt{\alpha_t} x_{t-1} + \sqrt{\beta_t} z_t$ with independent $z_t \sim \mathcal{N}(0, I)$. Recurse:

$$ \begin{aligned} x_t &= \sqrt{\alpha_t} x_{t-1} + \sqrt{\beta_t} z_t \\ &= \sqrt{\alpha_t}\left(\sqrt{\alpha_{t-1}} x_{t-2} + \sqrt{\beta_{t-1}} z_{t-1}\right) + \sqrt{\beta_t} z_t \\ &= \sqrt{\alpha_t \alpha_{t-1}}\, x_{t-2} + \underbrace{\sqrt{\alpha_t \beta_{t-1}} z_{t-1} + \sqrt{\beta_t} z_t}_{\text{sum of independent Gaussians}} \end{aligned} $$

Variance of the sum of two independent Gaussians: $\alpha_t \beta_{t-1} + \beta_t = \alpha_t(1 - \alpha_{t-1}) + (1 - \alpha_t) = 1 - \alpha_t \alpha_{t-1}$. So it merges into a single Gaussian $\sqrt{1 - \alpha_t \alpha_{t-1}}\, \bar z$. Induct to step $t$:

$$x_t = \sqrt{\bar\alpha_t}\, x_0 + \sqrt{1 - \bar\alpha_t}\, \epsilon$$

Variational trick intuition

the benefit of a Markov chain with Gaussian steps is that the cumulative distribution is still Gaussian; this lets forward sample without a network, and training doesn't need to simulate the whole chain.

2.3 Boundary cases and limits

SNR (Signal-to-Noise Ratio) at schedule end

SNR$(t) = \bar\alpha_t / (1-\bar\alpha_t)$; the linear schedule has $\bar\alpha_T \approx 4\times 10^{-5}$ at $t=T$, corresponding to SNR $\approx 4\times 10^{-5}$ — small but strictly speaking not zero, so the prior doesn't fully match $\mathcal{N}(0,I)$; this is one motivation behind the cosine schedule and "v-prediction" improvements.

§3 DDPM Reverse Process & Training

3.1 The premise that reverse is Gaussian

In theory $q(x_{t-1} | x_t)$ is not Gaussian (it depends on the entire data distribution). But when $\beta_t$ is small enough, the reverse conditional is approximately Gaussian (Feller 1949 / Sohl-Dickstein 2015), so we parametrize:

$$p_\theta(x_{t-1} | x_t) = \mathcal{N}\!\left(x_{t-1};\; \mu_\theta(x_t, t),\; \Sigma_\theta(x_t, t)\right)$$

3.2 ELBO derivation

DDPM optimizes the evidence lower bound (analogous to a VAE):

$$ \begin{aligned} \log p_\theta(x_0) &\ge \mathbb{E}_{q(x_{1:T}|x_0)}\left[\log \frac{p_\theta(x_{0:T})}{q(x_{1:T}|x_0)}\right] \\ &= -\underbrace{\mathbb{E}_q[\text{KL}(q(x_T|x_0) \,\Vert\, p(x_T))]}_{L_T \text{ (constant, prior matching)}} \\ &\quad - \sum_{t=2}^T \underbrace{\mathbb{E}_q[\text{KL}(q(x_{t-1}|x_t, x_0) \,\Vert\, p_\theta(x_{t-1}|x_t))]}_{L_{t-1}} \\ &\quad + \underbrace{\mathbb{E}_q[\log p_\theta(x_0 | x_1)]}_{L_0 \text{ (decoder log-likelihood)}} \end{aligned} $$

Key: $q(x_{t-1} | x_t, x_0)$ is a closed-form Gaussian (derived from Bayes):

$$q(x_{t-1} | x_t, x_0) = \mathcal{N}\!\left(x_{t-1};\; \tilde\mu_t(x_t, x_0),\; \tilde\beta_t I\right)$$

where:

$$\tilde\mu_t(x_t, x_0) = \frac{\sqrt{\bar\alpha_{t-1}} \beta_t}{1 - \bar\alpha_t} x_0 + \frac{\sqrt{\alpha_t}(1-\bar\alpha_{t-1})}{1-\bar\alpha_t} x_t, \quad \tilde\beta_t = \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\beta_t$$

3.3 Simplifying to $L_\text{simple}$ (mandatory derivation)

Substitute $x_0 = (x_t - \sqrt{1-\bar\alpha_t}\epsilon) / \sqrt{\bar\alpha_t}$ into $\tilde\mu_t$:

$$\tilde\mu_t = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon\right)$$

Parametrize $\mu_\theta(x_t, t)$ in the same form ($\epsilon$-prediction):

$$\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(x_t, t)\right)$$

Fix $\Sigma_\theta = \sigma_t^2 I$ (take $\sigma_t^2 = \beta_t$ or $\tilde\beta_t$). The KL between two Gaussians:

$$L_{t-1} = \mathbb{E}\left[\frac{1}{2\sigma_t^2} \| \tilde\mu_t - \mu_\theta \|^2\right] = \mathbb{E}\left[\frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1-\bar\alpha_t)} \|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]$$

Ho 2020's engineering trick: drop all preceding coefficients + constant terms and just use the unweighted version:

$$\boxed{\; L_\text{simple}(\theta) = \mathbb{E}_{t \sim \mathcal{U}\{1,\dots,T\},\; x_0,\; \epsilon}\Big[\big\|\epsilon - \epsilon_\theta\!\big(\sqrt{\bar\alpha_t} x_0 + \sqrt{1-\bar\alpha_t}\epsilon,\; t\big)\big\|^2\Big] \;}$$

Why does dropping the coefficient still work?

Ho 2020's empirical observation: the unweighted version effectively upweights low-SNR (high $t$) loss, which in turn improves sample quality. The cost: $\log$-likelihood is no longer the ELBO lower bound — so "good FID" ≠ "good likelihood". Improved DDPM (Nichol-Dhariwal 2021) later introduces a hybrid loss $L_\text{hybrid} = L_\text{simple} + \lambda L_\text{vlb}$ ($\lambda = 0.001$), simultaneously learning $\Sigma_\theta$.

3.4 Equivalent prediction targets (memorize)

Given $x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1-\bar\alpha_t}\epsilon$, the three mainstream parametrizations are linearly invertible:

$$ \begin{aligned} \epsilon\text{-pred} &:\quad \epsilon_\theta(x_t, t) \approx \epsilon \\ x_0\text{-pred} &:\quad \hat x_0(x_t, t) = \frac{x_t - \sqrt{1-\bar\alpha_t}\, \epsilon_\theta}{\sqrt{\bar\alpha_t}} \\ v\text{-pred (Salimans-Ho 2022)} &:\quad v_\theta = \sqrt{\bar\alpha_t}\, \epsilon - \sqrt{1-\bar\alpha_t}\, x_0 \\ \text{score} &:\quad s_\theta(x_t, t) = -\frac{\epsilon_\theta(x_t, t)}{\sqrt{1-\bar\alpha_t}} \end{aligned} $$

Why is v-prediction more stable?

$\epsilon$-pred degenerates as $t \to 0$ (small noise; loss coefficient explodes); $x_0$-pred degenerates as $t \to T$ (large noise); $v$-pred interpolates between them and has approximately uniform loss magnitude across all $t$ — that's the key choice in Imagen Video / SD2.1-v / Karras EDM.

§4 Schedule: linear / cosine / EDM

4.1 Linear (Ho 2020)

$$\beta_t = \beta_\text{start} + \frac{t-1}{T-1}(\beta_\text{end} - \beta_\text{start}), \quad \beta_\text{start} = 10^{-4},\; \beta_\text{end} = 0.02$$

$T = 1000$. Simple and stable, but the end-SNR isn't strictly zero ($\bar\alpha_T \approx 4 \times 10^{-5}$, SNR $\approx 4 \times 10^{-5}$, while an ideal prior wants this closer to 0).

4.2 Cosine (Nichol-Dhariwal 2021)

$$\bar\alpha_t = \frac{f(t)}{f(0)}, \quad f(t) = \cos^2\!\left(\frac{(t/T) + s}{1 + s} \cdot \frac{\pi}{2}\right), \quad s = 0.008$$

$\beta_t = 1 - \bar\alpha_t / \bar\alpha_{t-1}$ (then clip to $[0, 0.999]$ for numerical stability). The $s = 0.008$ offset is to prevent $\beta_1$ from being too close to 0.

Why is the cosine schedule better?

A linear schedule adds noise too fast in the low-$t$ region, so the model spends most of its time "training" on already-pure-noise regions (and learns nothing). Cosine adds noise slowly at low $t$, faster in the middle, and reaches near-zero SNR at the end. Improved DDPM experiments: cosine improves FID on ImageNet 64 by about 20% vs linear.

4.3 EDM σ-schedule (Karras 2022)

EDM reparametrizes the $\beta$ schedule into a $\sigma$ schedule (directly using $\sigma$ as time). At sampling:

$$\sigma_i = \left(\sigma_\text{max}^{1/\rho} + \frac{i}{N-1}\left(\sigma_\text{min}^{1/\rho} - \sigma_\text{max}^{1/\rho}\right)\right)^\rho, \quad i = 0, \dots, N-1$$

Defaults: $\sigma_\text{min} = 0.002$, $\sigma_\text{max} = 80$, $\rho = 7$. $\rho = 7$ was swept empirically by Karras — beats linear / log spacing because it allocates more steps to the small-$\sigma$ (high-SNR) region, where stepping errors are more sensitive.

Discrete vs continuous schedules

DDPM's $\beta$ array is equivalent to a VP-SDE's $\beta(t) = T \beta_{\lfloor tT \rfloor}$; EDM's $\sigma$-schedule corresponds to a VE-SDE with $\sigma(t) = t$ (linear time); the two differ only by a $t$ reparametrization, informationally equivalent. EDM's contribution is discovering an empirically more stable $\sigma_i$ selection rule.

§5 Score-based view

5.1 Score and score matching (Hyvärinen 2005)

Define $s(x) = \nabla_x \log p(x)$. If we learn $s_\theta \approx s$, we can sample via Langevin dynamics:

$$x_{k+1} = x_k + \frac{\eta}{2} s_\theta(x_k) + \sqrt{\eta}\, z_k, \quad z_k \sim \mathcal{N}(0, I)$$

The direct score matching loss $\mathbb{E}_p\|s_\theta - \nabla\log p\|^2$ is not computable (we don't know $\nabla \log p$). Hyvärinen 2005 gave implicit score matching that avoids $\nabla \log p$ via integration by parts:

$$\mathbb{E}_p\left[\|s_\theta(x)\|^2 + 2 \operatorname{tr}(\nabla_x s_\theta(x))\right]$$

But $\operatorname{tr}(\nabla_x s_\theta)$ is too expensive in high dimensions (Hessian trace).

5.2 Denoising Score Matching (Vincent 2011)

For each data point $x_0$, add noise $\tilde x = x_0 + \sigma \epsilon$ and define the perturbed distribution $p_\sigma(\tilde x) = \int p(x_0) \mathcal{N}(\tilde x; x_0, \sigma^2 I) dx_0$. Vincent 2011 proved:

$$\mathbb{E}_{p_\sigma(\tilde x)}\|s_\theta(\tilde x) - \nabla \log p_\sigma(\tilde x)\|^2 = \mathbb{E}_{x_0, \tilde x}\left\|s_\theta(\tilde x) - \nabla_{\tilde x} \log q(\tilde x | x_0)\right\|^2 + \text{const}$$

And the score of $q(\tilde x | x_0) = \mathcal{N}(x_0, \sigma^2 I)$ has a closed form:

$$\nabla_{\tilde x} \log q(\tilde x | x_0) = -\frac{\tilde x - x_0}{\sigma^2} = -\frac{\epsilon}{\sigma}$$

So the training loss simplifies to:

$$\boxed{\; L_\text{DSM}(\theta) = \mathbb{E}_{x_0, \sigma, \epsilon}\left\| \sigma\, s_\theta(\tilde x; \sigma) + \epsilon \right\|^2 \;}$$

This is exactly the NCSN / SMLD training objective (up to a weight).

5.3 Tweedie's formula (mandatory derivation)

Statement: for additive Gaussian noise $x_t = x_0 + \sigma_t \epsilon$ (VE view, $\epsilon \sim \mathcal{N}(0,I)$):

$$\boxed{\; \mathbb{E}[x_0 | x_t] = x_t + \sigma_t^2\, \nabla_{x_t} \log p_t(x_t) \;}$$

Derivation: $p_t(x_t) = \int p_0(x_0) \mathcal{N}(x_t; x_0, \sigma_t^2 I)\, dx_0$. Take the gradient w.r.t. $x_t$:

$$\nabla_{x_t} p_t(x_t) = \int p_0(x_0) \cdot \nabla_{x_t} \mathcal{N}(x_t; x_0, \sigma_t^2 I)\, dx_0 = \int p_0(x_0) \cdot \mathcal{N}(x_t; x_0, \sigma_t^2 I) \cdot \frac{x_0 - x_t}{\sigma_t^2}\, dx_0$$

Divide both sides by $p_t(x_t)$:

$$\nabla_{x_t} \log p_t(x_t) = \frac{1}{p_t(x_t)} \int p_0(x_0) \mathcal{N}(x_t | x_0) \frac{x_0 - x_t}{\sigma_t^2}\, dx_0 = \mathbb{E}_{p_0(x_0 | x_t)}\left[\frac{x_0 - x_t}{\sigma_t^2}\right]$$

That is:

$$\sigma_t^2 \nabla_{x_t} \log p_t(x_t) = \mathbb{E}[x_0 | x_t] - x_t \quad \Rightarrow \quad \mathbb{E}[x_0 | x_t] = x_t + \sigma_t^2 \nabla_{x_t} \log p_t(x_t) \quad \square$$

Tweedie is the "Rosetta Stone" linking all diffusion parametrizations

the denoiser network's optimal output (MMSE estimator) is the score plus an identity map. All conversions among $\epsilon$-pred / score-pred / $x_0$-pred / $v$-pred are one-line rearrangements of Tweedie.

5.4 NCSN / SMLD (Song-Ermon 2019)

Noise-Conditional Score Network: train a shared network $s_\theta(x, \sigma)$ that does DSM at multiple noise levels $\sigma_1 > \sigma_2 > \dots > \sigma_L$ simultaneously. Sampling uses annealed Langevin dynamics: first Langevin at large $\sigma_1$ (explore the whole space), then decay to $\sigma_L$ (refine details).

$$x \leftarrow x + \frac{\epsilon_i}{2} s_\theta(x, \sigma_i) + \sqrt{\epsilon_i}\, z, \quad \epsilon_i = \eta \cdot (\sigma_i / \sigma_L)^2$$

Run $T$ Langevin steps at each $\sigma_i$, then switch to the next $\sigma_{i+1}$.

Why doesn't a single $\sigma$ work?

Scores trained at small $\sigma$ are completely wrong in regions far from the data manifold (in the "empty regions" between modes where $p(x) \approx 0$, the score gives no direction). The core of multiple noise levels is using large $\sigma$ to "fill" the space, providing initial positions for small $\sigma$.

§6 Score SDE: unified framework + Probability Flow ODE

6.1 Forward SDE

Song et al. 2021 (ICLR) write all diffusions as a forward SDE:

$$dx = f(x, t)\, dt + g(t)\, dW$$

Type$f(x, t)$$g(t)$Discrete counterpart
VP-SDE (variance preserving)$-\frac{1}{2}\beta(t) x$$\sqrt{\beta(t)}$DDPM
VE-SDE (variance exploding)$0$$\sqrt{d[\sigma^2(t)]/dt}$SMLD / EDM
sub-VP$-\frac{1}{2}\beta(t) x$$\sqrt{\beta(t)(1-e^{-2\int_0^t \beta(s)ds})}$between VP/VE, better likelihood

VP-SDE satisfies $\text{Var}[x_t] \le 1$ (variance preserving); VE-SDE has unbounded variance growth (variance exploding).

6.2 Reverse SDE (Anderson 1982)

For any forward SDE, there exists a reverse-time SDE:

$$\boxed{\; dx = \left[f(x, t) - g^2(t)\, \nabla_x \log p_t(x)\right] dt + g(t)\, d\bar W \;}$$

where $d\bar W$ is a reverse-time Wiener process. Sampling: start from $x_T \sim p_T$ (close to prior), then integrate to $t = 0$ with an SDE solver (Euler-Maruyama / predictor-corrector).

6.3 Probability Flow ODE (bridge to FM)

Key theorem (Song et al. 2021, "Score-Based Generative Modeling through SDEs"): the following deterministic ODE shares all marginals $p_t$ with the reverse SDE:

$$\boxed{\; \frac{dx}{dt} = f(x, t) - \frac{1}{2} g^2(t)\, \nabla_x \log p_t(x) \;}$$

This is the probability flow ODE. Equivalent to Flow Matching's vector field:

$$u_t(x) = f(x, t) - \tfrac{1}{2} g^2(t)\, s_\theta(x, t)$$

The three samplers' relationship
           forward SDE (training: score matching)
                       ↓
           ┌──────────────────────┐
           ↓                      ↓
     reverse SDE              probability flow ODE
     (stochastic)             (deterministic, ⇔ FM)
           ↓                      ↓
   DDPM ancestral sampler   DDIM (η=0) / EDM / DPM-Solver

Proof sketch: write the forward SDE's Fokker-Planck (continuity equation):

$$\frac{\partial p_t}{\partial t} = -\nabla \cdot (f p_t) + \frac{1}{2} g^2 \Delta p_t$$

Using $\Delta p_t = \nabla \cdot (p_t \nabla \log p_t)$, write the diffusion term in transport form:

$$\frac{\partial p_t}{\partial t} = -\nabla \cdot \left[\left(f - \tfrac{1}{2} g^2 \nabla \log p_t\right) p_t\right]$$

This is exactly the continuity equation for the ODE $dx/dt = f - \frac{1}{2} g^2 \nabla \log p_t$ — so their $p_t$ agree.

6.4 Advantages of the ODE view

AdvantageDescription
Deterministicsame noise → same sample, enables image editing / interpolation
NFE-friendlyhigh-order ODE solvers (Heun / RK4 / DPM-Solver) need few steps
Computable likelihood$\log p_0(x_0) = \log p_T(x_T) + \int_0^T \nabla \cdot v_t(x(t))\, dt$ (PF-ODE instantaneous change-of-variables, Chen et al. 2018), with div estimated via Hutchinson trace estimator
Bridge to FMthe route taken by RF / SD3 / FLUX
SDE vs ODE trade-off

the stochastic perturbation in SDE sampling can "correct" early errors, typically yielding better sample quality but at higher NFE; ODE is deterministic but susceptible to solver-error accumulation and needs higher-order solvers. EDM proposes a middle ground: base ODE + small stochastic churn ("$S_\text{churn}$"), with better FID.

§7 DDIM: Non-Markovian forward → deterministic sampler

7.1 Motivation

DDPM ancestral sampling must walk all $T = 1000$ steps (a Markov chain). Can we sample with fewer steps without retraining? DDIM (Song et al. 2020 arXiv / ICLR 2021) gives a "yes" — the core is making forward non-Markovian while preserving the same marginal $q(x_t | x_0)$ as DDPM.

7.2 Non-Markovian forward

DDIM defines a family of forward distributions, controlled by a parameter $\eta \in [0, 1]$:

$$q_\sigma(x_{t-1} | x_t, x_0) = \mathcal{N}\!\left(x_{t-1};\; \sqrt{\bar\alpha_{t-1}}\, x_0 + \sqrt{1 - \bar\alpha_{t-1} - \sigma_t^2}\, \frac{x_t - \sqrt{\bar\alpha_t} x_0}{\sqrt{1-\bar\alpha_t}},\; \sigma_t^2 I\right)$$

where $\sigma_t^2 = \eta^2 \cdot \tilde\beta_t = \eta^2 \cdot \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t} \beta_t$.

Key property (DDIM Theorem 1): under this forward, $q(x_t | x_0)$ is still $\mathcal{N}(\sqrt{\bar\alpha_t} x_0, (1-\bar\alpha_t) I)$ — identical to DDPM! So we can directly use a DDPM-trained $\epsilon_\theta$ for DDIM sampling.

7.3 DDIM sampling formula

Substitute $x_0 \to \hat x_0 = (x_t - \sqrt{1-\bar\alpha_t}\, \epsilon_\theta(x_t, t)) / \sqrt{\bar\alpha_t}$:

$$\boxed{\; x_{t-1} = \sqrt{\bar\alpha_{t-1}}\, \hat x_0 + \sqrt{1 - \bar\alpha_{t-1} - \sigma_t^2}\, \epsilon_\theta(x_t, t) + \sigma_t\, z, \quad z \sim \mathcal{N}(0, I) \;}$$

7.4 Skip steps (few-step sampling)

You don't have to step $t \to t-1$; you can skip: pick a sub-sequence $\tau_0 < \tau_1 < \dots < \tau_S = T$ and do:

$$x_{\tau_{i-1}} = \sqrt{\bar\alpha_{\tau_{i-1}}}\, \hat x_0 + \sqrt{1 - \bar\alpha_{\tau_{i-1}} - \sigma_{\tau_i}^2}\, \epsilon_\theta(x_{\tau_i}, \tau_i) + \sigma_{\tau_i}\, z$$

Classic baseline: $S = 50$ DDIM steps achieve FID close to 1000-step DDPM on ImageNet 256.

DDIM = discretization of the probability flow ODE

when $\eta = 0$ and the time grid is taken continuous, DDIM degenerates to the first-order Euler discretization of the probability flow ODE for the VP-SDE — that's why deterministic DDIM lines up with ODE-based samplers (DPM-Solver, EDM Heun).

§8 EDM: Karras 2022 Design Space

8.1 Motivation

Karras 2022 ("Elucidating the Design Space of Diffusion-Based Generative Models") decomposes all diffusion design knobs (parametrization, loss weighting, sampler, schedule) and sweeps each, arriving at SOTA recipes: CIFAR-10 FID 1.79 (35 NFE), ImageNet 64 FID 1.36.

8.2 Preconditioning (mandatory derivation)

EDM adopts the VE view: $x = x_0 + \sigma \epsilon$ with $\epsilon \sim \mathcal{N}(0, I)$, treating $\sigma$ directly as the noise level (no $\alpha$).

Denoiser parametrization:

$$\boxed{\; D_\theta(x;\, \sigma) = c_\text{skip}(\sigma)\, x + c_\text{out}(\sigma)\, F_\theta\!\left(c_\text{in}(\sigma)\, x,\; c_\text{noise}(\sigma)\right) \;}$$

where $F_\theta$ is the base network and the four $c$ functions are a hand-designed schedule. Karras' derivation:

Derivation: unit-variance argument

Goal: make $F_\theta$'s input and training target have $\mathcal{O}(1)$ variance at all $\sigma$.

Input side: the network sees $c_\text{in} x$. Since $\text{Var}[x] = \sigma_\text{data}^2 + \sigma^2$ (data variance + noise variance):

$$c_\text{in}(\sigma) = \frac{1}{\sqrt{\sigma_\text{data}^2 + \sigma^2}} \quad \Rightarrow \quad \text{Var}[c_\text{in} x] = 1$$

Output side: the ideal denoiser is $D^*(x; \sigma) = \mathbb{E}[x_0 | x]$ (Tweedie). Have the network learn the residual instead of the full quantity: define the effective target

$$F^*(x; \sigma) = \frac{1}{c_\text{out}(\sigma)}\left[D^*(x;\sigma) - c_\text{skip}(\sigma)\, x\right]$$

We want $\text{Var}[c_\text{out} F^* + c_\text{skip} x - D^*] = 0$ and $\text{Var}[F^*] = 1$ (so the network's target has unit variance).

Find $c_\text{skip}, c_\text{out}$ minimizing the effective error (minimize $\mathbb{E}\|F^* - F_\theta\|^2$ under $\text{Var}[F^*]=1$). Karras takes $D^* = x_0$ (ideal case), substitutes and expands:

$$c_\text{skip}(\sigma) = \frac{\sigma_\text{data}^2}{\sigma^2 + \sigma_\text{data}^2}, \quad c_\text{out}(\sigma) = \frac{\sigma \cdot \sigma_\text{data}}{\sqrt{\sigma^2 + \sigma_\text{data}^2}}$$

Intuition:

Time encoding: $c_\text{noise}(\sigma) = \frac{1}{4} \ln \sigma$ (log-scale, covering the wide dynamic range $\sigma \in [\sigma_\text{min}, \sigma_\text{max}]$).

8.3 Training loss

EDM uses a weighted L2:

$$L_\text{EDM}(\theta) = \mathbb{E}_{\sigma, x_0, \epsilon}\Big[\lambda(\sigma)\, \big\| D_\theta(x_0 + \sigma\epsilon;\, \sigma) - x_0 \big\|^2\Big]$$

The weight is $\lambda(\sigma) = (\sigma^2 + \sigma_\text{data}^2) / (\sigma \cdot \sigma_\text{data})^2 = 1/c_\text{out}^2$, equivalent to training $F_\theta$ with unweighted L2 (the target has unit variance at every $\sigma$, so loss magnitudes are uniform).

Training $\sigma$ sampling: $\ln \sigma \sim \mathcal{N}(P_\text{mean}, P_\text{std}^2)$, defaults $P_\text{mean} = -1.2$, $P_\text{std} = 1.2$ (concentrates $\sigma$ around $0.3$ — the "hardest to learn" SNR region, swept by Karras).

8.4 Heun 2nd-order sampler

EDM sampling defaults to Heun 2nd-order ODE + optional stochastic churn. The VE-SDE's probability flow ODE, with $f = 0, g(t) = \sqrt{d\sigma^2/dt}$:

$$\frac{dx}{d\sigma} = -\sigma\, \nabla_x \log p_\sigma(x) = \frac{x - D_\theta(x; \sigma)}{\sigma}$$

(Use Tweedie: $\nabla \log p_\sigma = (D - x)/\sigma^2$, substitute into $dx/d\sigma = -\sigma \nabla \log p$.)

Heun step $i$ ($\sigma_i \to \sigma_{i+1}$, $\Delta\sigma = \sigma_{i+1} - \sigma_i$):

d_i  = (x_i - D_θ(x_i, σ_i)) / σ_i
x_*  = x_i + Δσ · d_i                       # Euler step (predictor)
if σ_{i+1} > 0:                             # skip corrector at last step
    d_*  = (x_* - D_θ(x_*, σ_{i+1})) / σ_{i+1}
    x_{i+1} = x_i + Δσ · (d_i + d_*) / 2     # Heun trapezoidal (corrector)
else:
    x_{i+1} = x_*

2 NFE per step, but second-order accuracy — more NFE than Euler but much more accurate. CIFAR-10 EDM with 35 NFE = 18 Heun steps + first-order final step yields FID 1.79.

Stochastic churn (optional)

at the start of each step temporarily raise $\sigma_i$ to $\hat\sigma_i = (1+\gamma_i)\sigma_i$ (with $\gamma_i$ a small per-step churn), injecting extra noise: $\hat x_i = x_i + \sqrt{\hat\sigma_i^2 - \sigma_i^2}\, z$, where $\sqrt{\hat\sigma_i^2 - \sigma_i^2} = \sigma_i\sqrt{2\gamma_i + \gamma_i^2}$; stepping from $\hat\sigma_i$ down to $\sigma_{i+1}$ is equivalent to a small SDE. EDM experiments: a small amount of churn slightly improves FID on ImageNet (about 0.1-0.3).

§9 High-order samplers: DPM-Solver / DPM-Solver++

9.1 Motivation

DDIM is first-order ODE Euler. DPM-Solver (Lu et al. 2022 NeurIPS) exploits the semi-linear structure of the diffusion ODE for high-order expansion. Rewriting the probability flow ODE under VP-SDE with $\epsilon$-pred:

$$\frac{dx}{dt} = f(t)\, x + g(t)\, \epsilon_\theta(x, t)$$

where $f(t) = -\frac{1}{2}\beta(t)$, $g(t) = +\frac{1}{2}\beta(t)/\sqrt{1-\bar\alpha_t}$ (from $-\frac{1}{2}g_\text{SDE}^2 \cdot s = +\frac{1}{2}\beta\cdot \epsilon/\sqrt{1-\bar\alpha_t}$, since $s = -\epsilon/\sqrt{1-\bar\alpha_t}$).

Integrate the linear part exactly (exponential integrator) and Taylor-expand the rest.

9.2 DPM-Solver-2 / 3 (core idea)

Let $\lambda_t = \log(\sqrt{\bar\alpha_t} / \sqrt{1-\bar\alpha_t})$ (log-SNR), use $\lambda$ as time. Rewrite the ODE as:

$$x_{t} = \frac{\sqrt{\bar\alpha_t}}{\sqrt{\bar\alpha_s}} x_s - \sqrt{\bar\alpha_t} \int_{\lambda_s}^{\lambda_t} e^{-\lambda} \hat\epsilon_\theta(x_\tau, \tau)\, d\lambda$$

Taylor-expand $\hat\epsilon_\theta$ in $\lambda$ to order $k$, integrate the linear part exactly (exponential weight), and approximate the rest by order:

10-15 NFE reaches the same quality as 50 NFE DDIM.

9.3 DPM-Solver++ (CFG-friendly variant, Lu et al. 2023)

The original DPM-Solver is unstable under CFG ($\epsilon_\theta$ amplified by CFG goes out-of-distribution, and Taylor-expansion error blows up). DPM-Solver++ switches to $x_0$-prediction:

$$x_t = \frac{\sigma_t}{\sigma_s} x_s + \sigma_t \int_{\lambda_s}^{\lambda_t} e^{\lambda} \hat x^0_\theta(x_\tau, \tau)\, d\lambda$$

(Using $x_0$-pred instead of $\epsilon$-pred keeps CFG amplification in a more stable regime.)

15-20 NFE under CFG=7 gives quality close to 100-NFE DDIM. One of the default samplers in SDXL / SD3.

9.4 Sampler comparison

Common sampler cheat sheet

ordered by NFE / quality / compatibility (image generation).

§10 Conditioning: Classifier Guidance & CFG

10.1 Classifier Guidance (Dhariwal-Nichol 2021)

Train a separate classifier $p_\phi(c | x_t)$ on noisy data, then apply Bayes:

$$\nabla_{x_t} \log p(x_t | c) = \nabla_{x_t} \log p(x_t) + \nabla_{x_t} \log p_\phi(c | x_t)$$

In practice we scale the classifier gradient by $w$ (controlling guidance strength):

$$\tilde\epsilon = \epsilon_\theta(x_t, t) - w \sqrt{1-\bar\alpha_t}\, \nabla_{x_t} \log p_\phi(c | x_t)$$

Drawbacks of classifier guidance

(a) needs an extra noisy classifier (engineering overhead); (b) classifier gradients tend toward "adversarial" behavior, degenerating away from the training distribution; (c) unfriendly to continuous conditions like text-to-image. CFG fully supplants it.

10.2 Classifier-Free Guidance (Ho-Salimans 2022)

Training: with probability $p_\text{drop}$ (typically 0.1), replace $c$ with $\emptyset$ (null embedding), so the same net learns both conditional and unconditional:

$$L_\text{CFG}(\theta) = \mathbb{E}\big[\|\epsilon - \epsilon_\theta(x_t, t, c \text{ or } \emptyset)\|^2\big]$$

Inference: call $w$ the guidance scale:

$$\boxed{\; \tilde\epsilon = \epsilon_\theta(x_t, t, \emptyset) + (1 + w)\big[\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \emptyset)\big] \;}$$

Equivalent form (common in Imagen / SD implementations):

$$\tilde\epsilon = (1 + w)\, \epsilon_\theta(x_t, t, c) - w\, \epsilon_\theta(x_t, t, \emptyset)$$

Two CFG $w$ conventions

the original Ho-Salimans 2022 paper has $\tilde\epsilon = \epsilon_\text{uncond} + (1+w)(\epsilon_\text{cond} - \epsilon_\text{uncond})$, so $w = 0$ is unguided and $w > 0$ is amplified. HuggingFace / SD UIs commonly use $w' = w + 1$, so $w' = 1$ is unguided and $w' = 7.5$ is the common amplified value. State the convention in interview code.

10.3 Geometric meaning of CFG

CFG is equivalent to pulling the sampling trajectory toward the "conditional gradient" direction:

$$\nabla_{x_t} \log p(x_t | c) \approx \nabla_{x_t} \log p(x_t) + w \nabla_{x_t} \log \frac{p(x_t | c)}{p(x_t)}$$

The second term is the "conditional score difference", which pushes samples toward regions of high conditional likelihood and relatively low unconditional likelihood — intuitively "amplifying text alignment".

CFG is the core of SD/SDXL/FLUX text-image alignment

$w \in [3, 7.5]$ is the empirical sweet spot for Stable Diffusion; $w > 10$ tends to over-saturate (color saturation, artifacts). FLUX internalizes CFG into distillation ("guidance-distilled") so a single forward pass yields the CFG effect — one of the keys to its inference speed.

§11 Production: from LDM to FLUX

11.1 Latent Diffusion (LDM, Rombach 2022 CVPR)

Core idea: run diffusion in VAE latent space rather than pixel space.

  1. Train a VAE $E, D$: $z = E(x), \hat x = D(z)$, with $z$ ~8× smaller than $x$ (e.g., $512^2 \times 3 \to 64^2 \times 4$)
  2. Train a diffusion model on $z$ (params, memory, and compute all drop by an order of magnitude)
  3. To generate: sample from $z_T$ to $z_0$, then $D(z_0)$ decodes back to pixels

Stable Diffusion (SD) = LDM + CLIP text encoder + UNet on $64 \times 64 \times 4$ latent — at the time the most practical open-source T2I model.

11.2 SDXL (Podell et al. 2023 arXiv / ICLR 2024 spotlight)

Main improvements from SD 1.5 to SDXL:

11.3 DiT (Peebles-Xie 2023 ICCV)

Replace the UNet with a pure Transformer:

DiT experiments: better scaling laws than the UNet, FID decreases steadily with parameters. SD3 / FLUX / Sora are all in the DiT family.

11.4 SD3 (Esser 2024 ICML) — diffusion replaced by Rectified Flow

Two key SD3 changes:

  1. Rectified Flow replaces DDPM: training objective becomes $\|v_\theta - (x_1 - x_0)\|^2$ (FM framework)
  2. MM-DiT: multimodal DiT, where text tokens and image tokens attend to each other in the same Transformer (rather than via cross-attn)

Why switch to RF? Esser 2024 ablations: linear paths have straighter trajectories than cosine paths → better few-step sampling; logit-normal $t$ sampling emphasizes mid-noise and improves quality.

11.5 FLUX.1 (Black Forest Labs 2024)

Inherits SD3 + MM-DiT with major updates:

11.6 ControlNet (Zhang 2023 ICCV)

Adds a trainable copy + zero-conv connections to a frozen SD UNet:

Original UNet (frozen)                   Control signal (canny / depth / pose)
     ↓                                       ↓
[encoder blocks]                      [trainable copy of encoder]
     ↓ ──────── zero-conv ──────────────────↓
[mid block]                           [trainable mid]
     ↓ ──────── zero-conv ──────────────────↓
[decoder blocks (frozen)]    +   [trainable copy outputs]
     ↓
   output

Zero-conv = 1×1 convolution initialized to zero → at start, ControlNet doesn't change the original UNet's output (preserving SD's capability); as training proceeds, it learns to apply condition control.

ControlNet's training efficiency

the original UNet (most parameters) is frozen, only the trainable copy (~half the params) is trained; single-GPU trainable, which is key to the open-source ecosystem.

§12 Distillation: 1-step / Few-step generation

12.1 Progressive Distillation (Salimans-Ho 2022)

Iterative distillation: student one step $\approx$ teacher two steps; distill for $\log_2 N$ rounds to compress $N$ steps to 1. Key: halve each round, keeping distribution drift controlled.

12.2 Consistency Models (Song 2023 ICML)

Idea: directly learn a network $f_\theta(x_t, t)$ such that for all $t$:

$$f_\theta(x_t, t) \approx x_0$$

i.e., the network is the consistency function of the probability flow ODE — any $x_t$ maps to the corresponding $x_0$. One-step sampling: $x_0 = f_\theta(x_T, T)$.

Training objective (Consistency Distillation, CD):

$$L_\text{CD}(\theta) = \mathbb{E}\left[d\big(f_\theta(x_{t_{n+1}}, t_{n+1}),\; f_{\theta^-}(\hat x_{t_n}, t_n)\big)\right]$$

where:

Boundary condition: requires $f_\theta(x_{\sigma_\text{min}}, \sigma_\text{min}) = x_{\sigma_\text{min}}$ (self-consistency at lowest noise) — enforced by EDM-style preconditioning:

$$f_\theta(x, \sigma) = c_\text{skip}(\sigma) x + c_\text{out}(\sigma) F_\theta(x, \sigma)$$

with $c_\text{skip}, c_\text{out}$ designed so that $f_\theta \equiv x$ at $\sigma = \sigma_\text{min}$.

CT (Consistency Training) vs CD (Consistency Distillation)

CT is fully from scratch (no teacher; apply consistency loss directly to $x_0 + \sigma_n \epsilon$ vs $x_0 + \sigma_{n+1} \epsilon$); CD distills with a pretrained teacher. Quality-wise CD > CT; recent ICT (Song 2024) brings CT close to CD.

12.3 LCM / LCM-LoRA (Luo 2023)

Latent Consistency Model: apply Consistency Models to latent diffusion (SD 1.5 / SDXL):

LCM-LoRA: package LCM training as a LoRA adapter — a single LoRA file lets any SD 1.5 / SDXL fine-tune generate in 4 steps. Huge ecosystem value: users don't need to swap base models.

12.4 Adversarial Diffusion Distillation (ADD) — SDXL-Turbo / SD3-Turbo (Sauer 2023/2024)

ADD training objective:

$$L_\text{ADD} = L_\text{adv}(\text{student}) + \lambda L_\text{distill}(\text{student}, \text{teacher})$$

Results: SDXL-Turbo 1-step 1024 px, SD3-Turbo 4-step 1024 px. Quality slightly below multi-step but real-time (~100ms / image).

§13 The bridge to Flow Matching

13.1 Score vs vector field — same information, different parametrization

Inside the VP-SDE / VE-SDE framework, FM learning $v$ and score-based learning $s$ are two parametrizations of the same information:

$$v_\theta(t, x) = f(x, t) - \tfrac{1}{2} g^2(t)\, s_\theta(t, x)$$

Specifically on the VP (DDPM) path, with $\alpha_t = \sqrt{\bar\alpha_t}, \sigma_t = \sqrt{1-\bar\alpha_t}$, the conditional vector field (same form as Salimans-Ho 2022 $v$-prediction):

$$v_\theta^\text{VP}(t, x_t) = \alpha_t'\, x_0 + \sigma_t'\, \epsilon$$

Substitute $x_0 = (x_t - \sigma_t \epsilon)/\alpha_t$ and rearrange: $v_\theta$ is simultaneously a linear combination of $x_t$ and $\epsilon$ (or score), the concrete expression depending on the time derivatives of $\alpha_t, \sigma_t$.

In practice: for any Gaussian path with $\alpha(t), \sigma(t)$, the three quantities $\{\epsilon_\theta, s_\theta, v_\theta\}$ are fully equivalent. So training DDPM / score-based / FM on VP/VE paths is the same task.

13.2 Why did SD3 / FLUX switch to Rectified Flow?

Rectified Flow path: $x_t = (1-t) x_0 + t x_1$ (linear noise→data interpolation), $v_t = x_1 - x_0$.

AdvantageRF (linear)VP/VE (curved)
ODE trajectorystraight linecurved (needs higher-order solver)
Target $v_t$doesn't depend on $t$depends on $t$ (VP cosine path)
Few-step samplingEuler works at 4-8 stepsEuler needs 30+ steps
Reflow compressible to 1-2 steps✓ (InstaFlow / SD3-Turbo)
Training stabilitystable with logit-normal $t$ + RFrequires careful noise schedule
One-sentence SD3 ablation conclusion

"Under the same DiT backbone, RF + logit-normal $t$ improves FID on ImageNet 256 by about 0.5-1.0 over VP + uniform $t$; on T2I tasks, GenEval text alignment is significantly better."

13.3 DDPM/DDIM/EDM/RF/CM full picture

                Training objective    Sampling                Typical NFE
                ─────────             ─────────              ───────
 DDPM         ε-pred (MSE)        ancestral / DDIM            1000 / 50
 Score SDE    score (DSM)         reverse SDE / PF-ODE        500 / 30
 DDIM         (uses DDPM weights)  deterministic ODE step     20-50
 EDM          D_θ (Tweedie)       Heun ODE 2nd-order         18-35
 RF / SD3     v = x_1-x_0         Euler ODE                  4-50
 FLUX         v + CFG-distill     Euler                      1-4
 ConsistMod   f_θ(x_t,t)→x_0      direct map                 1-4
 LCM-LoRA     consistency on SD   direct                     4-8

§14 25 frequently-asked interview questions (L1 must-know · L2 intermediate · L3 top lab)

L1 must-know (likely on any ML role with diffusion)

Q1. Write out DDPM's forward $q(x_t | x_0)$ and reverse $p_\theta(x_{t-1}|x_t)$.
  • Forward closed form: $q(x_t|x_0) = \mathcal{N}(\sqrt{\bar\alpha_t} x_0, (1-\bar\alpha_t)I)$, $\bar\alpha_t = \prod_{s=1}^t (1-\beta_s)$
  • Reverse parametrization: $p_\theta(x_{t-1}|x_t) = \mathcal{N}(\mu_\theta(x_t, t), \Sigma_\theta)$
  • $\mu_\theta = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}} \epsilon_\theta(x_t, t)\right)$ ($\epsilon$-prediction)

Writing the wrong symbols (e.g., confusing $\sqrt{\alpha_t}$ with $\sqrt{\bar\alpha_t}$); forgetting that $\bar\alpha$ is the cumulative product.

Q2. How does DDPM's ELBO simplify to $L_\text{simple}$?
  • ELBO splits into $L_T + \sum L_{t-1} + L_0$; $L_T$ is a constant (prior matching)
  • $L_{t-1} = \text{KL}(q(x_{t-1}|x_t, x_0) \,\Vert\, p_\theta)$; both are Gaussians, KL is closed-form
  • Substitute $x_0 = (x_t - \sqrt{1-\bar\alpha_t}\epsilon)/\sqrt{\bar\alpha_t}$ into $\tilde\mu$ and $\mu_\theta$, yielding $L_{t-1} = \text{const} \cdot \mathbb{E}\|\epsilon - \epsilon_\theta\|^2$
  • Ho 2020 drops the coefficient to get $L_\text{simple} = \mathbb{E}\|\epsilon - \epsilon_\theta\|^2$

Saying only "$L_\text{simple}$ predicts noise" without the derivation; not knowing dropping the coefficient equals SNR-weighting.

Q3. Why does $L_\text{simple}$ still work after dropping the coefficient?
  • The ELBO coefficient $\beta_t^2 / [2\sigma_t^2 \alpha_t (1-\bar\alpha_t)]$ is large at small $t$ (high SNR) and small at large $t$ (low SNR)
  • Dropping the coefficient is equivalent to relatively upweighting low-SNR (large $t$) — these are the steps that "determine semantic structure"
  • Empirically: unweighted FID is significantly better than ELBO-weighted
  • Cost: no longer a lower bound on $\log p$ (FID ≠ likelihood)

Not knowing the cost is a likelihood vs sample-quality trade-off.

Q4. Linear vs cosine schedule?
  • Linear: $\beta_t \in [10^{-4}, 0.02]$ linear interpolation, DDPM original
  • Issue: end-SNR isn't low enough ($\bar\alpha_T \approx 4\times 10^{-5}$); the middle region adds noise too fast
  • Cosine: $\bar\alpha_t = \cos^2(\pi(t/T + s)/(2(1+s)))$, $s=0.008$, end-SNR ≈ 0
  • Empirically: cosine improves FID on ImageNet 64 by about 20% (Nichol-Dhariwal 2021)

Saying only "cosine is better" without writing the formula; forgetting that the $s=0.008$ offset is to keep $\beta_1$ from being near 0.

Q5. How do $\epsilon$-pred / $x_0$-pred / $v$-pred / score interconvert?
  • Given $x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1-\bar\alpha_t} \epsilon$, all quantities are linearly invertible
  • $\hat x_0 = (x_t - \sqrt{1-\bar\alpha_t}\epsilon_\theta) / \sqrt{\bar\alpha_t}$
  • $v = \sqrt{\bar\alpha_t}\epsilon - \sqrt{1-\bar\alpha_t} x_0$ (Salimans-Ho 2022)
  • $s = -\epsilon / \sqrt{1-\bar\alpha_t}$ (from Tweedie or $\nabla_{x_t} \log q(x_t|x_0)$)

Not knowing the four predictions are different parametrizations of the same information; confusing $v$ with velocity.

Q6. DDIM vs DDPM differences?
  • DDPM ancestral is a stochastic Markov chain, adding $\sigma_t z$ noise per step, must walk the full $T$ steps
  • DDIM uses a non-Markovian forward that shares the same $q(x_t|x_0)$ as DDPM — can directly use DDPM training weights
  • $\eta = 0$ is deterministic and supports interpolation; $\eta = 1$ over the full $T$ steps degenerates to DDPM ancestral (with skips it only matches variance order)
  • DDIM allows skip steps: 50 steps ≈ DDPM 1000-step quality

Saying only "DDIM is a few-step DDPM" without knowing marginals are equivalent; or not knowing $\eta$ controls stochasticity.

Q7. How is CFG (Classifier-Free Guidance) trained and used?
  • Training: with $p_\text{drop}=0.1$ replace $c$ with $\emptyset$ (null embedding); the same net learns conditional/unconditional
  • Inference: two conventions (must distinguish):

    • HF / SD style (denoted $s$): $\tilde\epsilon = \epsilon_\theta(x,\emptyset) + s\,[\epsilon_\theta(x,c) - \epsilon_\theta(x,\emptyset)]$, $s=1$ is unguided, $s\in[3, 7.5]$ is the common SD amplification
    • Ho-Salimans 2022 original (denoted $w$): $\tilde\epsilon = (1+w)\,\epsilon_\theta(x,c) - w\,\epsilon_\theta(x,\emptyset)$, $w=0$ is unguided; equivalent to $s = w + 1$
  • Large $s$ → strong text alignment but lower diversity; $s>10$ → over-saturated colors

Writing only the formula without knowing the $w$/$s$ convention; not knowing the drop-$c$ training trick; saying CFG needs a separate classifier (that's classifier guidance).

Q8. What is Tweedie's formula? Why is it important?
  • $\mathbb{E}[x_0 | x_t] = x_t + \sigma_t^2 \nabla_{x_t} \log p_t(x_t)$ (VE view; the VP version has an $\alpha$ factor)
  • Derivation: take $\nabla_{x_t} \log$ of $p_t(x_t) = \int p_0(x_0) \mathcal{N}(x_t; x_0, \sigma_t^2 I) dx_0$
  • Significance: the denoiser's optimal output = input + scaled score — the "Rosetta Stone" of conversions among $\epsilon, x_0, v, s$

Memorizing the formula without deriving it; not knowing it links score and denoiser.

Q9. VP-SDE vs VE-SDE?
  • VP (variance preserving): $dx = -\frac{1}{2}\beta(t) x\, dt + \sqrt{\beta(t)}\, dW$, corresponds to DDPM; $\text{Var}[x_t] \le 1$
  • VE (variance exploding): $dx = \sqrt{d\sigma^2/dt}\, dW$, corresponds to SMLD/EDM; $\text{Var}[x_t]$ grows to $\sigma_\text{max}^2$
  • VP: $x_T \approx \mathcal{N}(0, I)$; VE: $x_T \approx \mathcal{N}(x_0, \sigma_\text{max}^2 I)$, prior is $\mathcal{N}(0, \sigma_\text{max}^2 I)$
  • EDM picks VE because preconditioning is cleaner to derive; DDPM picks VP because the $\mathcal{N}(0,I)$ prior is natural

Saying only "variance preserving / exploding" without writing the SDE; not knowing EDM is VE.

Q10. What is the probability flow ODE?
  • For any forward SDE $dx = f dt + g dW$, there exists a deterministic ODE $dx/dt = f - \frac{1}{2}g^2 \nabla \log p_t$ that shares all marginals $p_t$
  • Note: the reverse SDE's drift is $f - g^2 \nabla \log p_t$ (full score correction), while PF-ODE uses only $\frac{1}{2} g^2$; not simply "reverse SDE minus the stochastic term"
  • Practical significance: lets you sample in few steps with an ODE solver (DDIM, Heun, RK4, DPM-Solver)
  • Bridge between score-based and Flow Matching: $v_t = f - \frac{1}{2}g^2 s$

Knowing only the formula without realizing PF-ODE and reverse SDE drift differ by half; not knowing it enables deterministic sampling.

L2 intermediate (research-oriented · need to know diffusion details)

Q11. What is the EDM preconditioning unit-variance argument?
  • Make the network $F_\theta$'s input $c_\text{in} x$ have variance 1: $c_\text{in} = 1/\sqrt{\sigma_\text{data}^2 + \sigma^2}$
  • Make the effective target $F^* = (D^* - c_\text{skip} x)/c_\text{out}$ have variance 1: $c_\text{skip} = \sigma_\text{data}^2/(\sigma^2+\sigma_\text{data}^2)$, $c_\text{out} = \sigma \sigma_\text{data} / \sqrt{\sigma^2 + \sigma_\text{data}^2}$
  • Intuition: at $\sigma \to 0$, $c_\text{skip} \to 1$ (identity); at $\sigma \to \infty$, $c_\text{out} \to \sigma_\text{data}$ (all from net)
  • Effect: loss magnitudes are uniform across all $\sigma$, training is more stable

Memorizing the formula without knowing the reason; not knowing $\sigma_\text{data}$ is data std (around 0.5 for normalized images).

Q12. Benefit of Improved DDPM learning $\Sigma_\theta$?
  • DDPM fixes $\Sigma_\theta = \beta_t I$ or $\tilde\beta_t I$
  • Nichol-Dhariwal 2021 learn $\Sigma_\theta$ as an interpolation between $\beta_t$ and $\tilde\beta_t$: $\Sigma_\theta = \exp(v \log\beta_t + (1-v) \log\tilde\beta_t)$
  • Benefit: few-step sampling quality improves dramatically (50 steps reach 1000-step fixed-$\Sigma$ quality)
  • Hybrid loss $L_\text{hybrid} = L_\text{simple} + 0.001 \cdot L_\text{vlb}$ ($L_\text{vlb}$ provides learning signal for $\Sigma_\theta$)
  • $\lambda = 0.001$ prevents $L_\text{vlb}$ from dominating

Not knowing the hybrid loss; thinking $\Sigma_\theta$ learning mainly affects training likelihood (actually it's few-step sampling gains).

Q13. Core differences between DPM-Solver and DDIM?
  • DDIM is first-order Euler, 1 NFE per step
  • DPM-Solver exploits the diffusion ODE's semi-linear structure $dx/dt = f(t) x + g(t) \epsilon_\theta$, integrates the linear part exactly (exponential integrator)
  • Taylor-expands the nonlinear part ($\epsilon_\theta$) in log-SNR $\lambda$ to order $k$
  • DPM-Solver-2: 2 NFE per step, second-order; DPM-Solver-3: 3 NFE per step, third-order
  • 10-15 NFE reaches DDIM 50 NFE quality
  • DPM-Solver++ switches to $x_0$-pred, CFG-friendly

Not knowing the exponential integrator; thinking DPM-Solver is some approximation (actually it's a more precise mathematical expansion).

Q14. Consistency Models' training objective? How to achieve 1-step?
  • Objective: $f_\theta(x_t, t) \approx x_0$ for all $t$
  • Consistency loss: $d(f_\theta(x_{t_{n+1}}, t_{n+1}), f_{\theta^-}(\hat x_{t_n}, t_n))$ with $\hat x_{t_n}$ from one teacher ODE step
  • $\theta^-$ is EMA, like BYOL; metric $d$ = L2 + LPIPS
  • Boundary: $f_\theta(x, \sigma_\text{min}) \equiv x$, enforced via EDM-style $c_\text{skip}, c_\text{out}$
  • 1-step sampling: $x_0 = f_\theta(x_T, T)$
  • 2-step variant: first $x_0 = f_\theta(x_T, T)$, then re-noise to intermediate $t$, then $f_\theta$ again

Saying only "learn the $x_t \to x_0$ map" without defining the consistency constraint; not knowing about EMA target / teacher ODE / boundary.

Q15. Why did SD3 switch from DDPM to Rectified Flow?
  • The RF path $x_t = (1-t)x_0 + tx_1$ is straight → straight ODE trajectory → small few-step sampling error
  • $v_t = x_1 - x_0$ target doesn't depend on $t$ (given $(x_0, x_1)$), numerically stable
  • Combined with logit-normal $t$ sampling (concentrated at $t=0.5$) for gains
  • Esser 2024 ablation: same backbone, RF + LogitNorm vs VP-cosine + Uniform → GenEval text alignment significantly better
  • Further reflow can compress to 4 steps (FLUX-Schnell / SD3-Turbo)

Saying only "RF is more stable" without knowing it's because the path is straight; not knowing logit-normal is an additional trick.

Q16. How does DiT inject conditioning? adaLN vs cross-attn?
  • adaLN-Zero (DiT default): MLP $c, t$ → $\gamma, \beta, \alpha$; $\text{out} = \alpha \cdot \text{block}(\text{LN}(x) \cdot \gamma + \beta) + x$; initialize $\alpha=0$ (zero-init) so the initial DiT block doesn't alter the input
  • Cross-attn: image tokens as Q, text/condition as K/V
  • Token-concat (MM-DiT, SD3): text tokens and image tokens concatenated into one sequence; all tokens attend to each other
  • Empirically: adaLN-Zero scales best (DiT paper); cross-attn has strong text control (SD UNet); MM-DiT is overall best (SD3 / FLUX)

Knowing only cross-attn; not knowing adaLN-Zero's "zero-init" is the key trick.

Q17. What is ControlNet's zero-conv? Why is it necessary?
  • 1×1 conv with weights initialized to 0 and bias also 0
  • At training start, the trainable copy's output passes through zero-conv → 0, original UNet output unchanged → preserves SD pretrained capability
  • As training proceeds, zero-conv learns non-zero weights and gradually injects condition control
  • Why not random init: it would perturb the frozen UNet's intermediate features and destroy the pretrained representation

Saying only "add a ControlNet module" without knowing zero-conv; thinking zero-conv is a special 1×1 conv variant (it's just the initialization).

Q18. SDE vs ODE sampling trade-off?
  • SDE: reverse SDE contains a stochastic term $g(t) d\bar W$; injects new noise each step, can correct early errors
  • ODE (probability flow): deterministic; solver error accumulates with no way back
  • SDE typically gives better FID; ODE has lower NFE + is deterministic (supports interpolation)
  • EDM compromise: base ODE Heun + small stochastic churn (slight re-noising at each step start), improving FID by 0.1-0.3 over pure ODE

Saying only "SDE is stochastic, ODE is deterministic" without knowing the trade-off; not knowing EDM churn.

Q19. LCM vs SDXL-Turbo differences?
  • LCM: Consistency Distillation applied to latent diffusion, 4-8 step; pure distillation loss
  • LCM-LoRA: LCM training packaged as a LoRA adapter, applicable to any SD 1.5 / SDXL fine-tune
  • SDXL-Turbo (ADD): adversarial loss + distill loss, 1-4 step; uses DINOv2 as discriminator
  • LCM is stabler, ADD is sharper (adversarial gives clearer textures)
  • LCM is open-source earlier with a more complete ecosystem; Turbo requires BFL/SAI in-house training

Not knowing LCM-LoRA's "LoRA-compat" is the killer feature; thinking Turbo = LCM.

Q20. How is training noise level $\sigma$ sampled?
  • DDPM: $t \sim \mathcal{U}\{1, \dots, T\}$, discrete uniform
  • EDM: $\ln \sigma \sim \mathcal{N}(P_\text{mean}, P_\text{std}^2)$, $P_\text{mean}=-1.2, P_\text{std}=1.2$, concentrated around $\sigma \approx 0.3$
  • SD3 / RF: $t = \text{sigmoid}(\tau), \tau \sim \mathcal{N}(0, 1)$, concentrated at $t = 0.5$
  • Common idea: mid-noise is hardest to learn, so more sampling in the middle region gives gains

Saying only "uniform sampling" without knowing EDM/SD3 switched to normal/logit-normal; not knowing why mid-concentration.

L3 top-lab diffusion / video direction (deep derivation + distillation + production integration)

Q21. Derive $L_\text{simple} = \|\epsilon - \epsilon_\theta\|^2$ from the ELBO, listing every intermediate approximation and dropped term.

Derivation chain + approximation log:

  • Step 1 (exact, no approximation): $\log p_\theta(x_0) \ge \mathbb{E}_q[\log p_\theta(x_{0:T})/q(x_{1:T}|x_0)]$ — Jensen's inequality gives the variational lower bound
  • Step 2 ($L_T$ treated as a constant and ignored): split ELBO into $L = L_T + \sum_{t=2}^T L_{t-1} + L_0$. $L_T = \text{KL}(q(x_T|x_0)\,\lVert\, p(x_T))$ — not strictly 0, but near-constant when $\bar\alpha_T \approx 0$
  • Step 3 ($L_0$ ignored / merged): $L_0 = -\mathbb{E}[\log p_\theta(x_0 | x_1)]$ — small contribution; in practice often modeled by a discretized Gaussian decoder and merged into $L_1$ during training
  • Step 4 (KL is closed-form; with fixed $\Sigma_\theta$, the constant $C$ is dropped): $L_{t-1} = \mathbb{E}_q[\text{KL}(q(x_{t-1}|x_t, x_0) \,\lVert\, p_\theta(x_{t-1}|x_t))]$. Both are Gaussian → closed-form KL. If $\Sigma_\theta = \sigma_t^2 I$ is fixed:

$$L_{t-1} = \mathbb{E}\left[\frac{1}{2\sigma_t^2}\|\tilde\mu_t(x_t, x_0) - \mu_\theta(x_t, t)\|^2\right] + C$$

The constant $C$ comes from the $\Sigma$ log-determinant, independent of $\theta$ when $\Sigma$ is fixed, so it vanishes under the gradient.

  • Step 5 (exact rewrite into $\epsilon$-pred form): substitute $x_0 = (x_t - \sqrt{1-\bar\alpha_t}\epsilon)/\sqrt{\bar\alpha_t}$ into $\tilde\mu_t$ and write $\mu_\theta$ in the same $\epsilon$-pred parametrization:

$$L_{t-1} = \mathbb{E}\left[\frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1-\bar\alpha_t)} \|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]$$

Exact, as long as $\mu_\theta$ uses Ho 2020's $\epsilon$-pred form.

  • Step 6 (drop the $t$-dependent coefficient): $L_\text{simple}$ sets the coefficient $\frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1-\bar\alpha_t)}$ uniformly to 1. Equivalent to reweighting across $t$ — at small $t$ (high SNR) the original coefficient is large → simple downweights it relatively; at large $t$ (low SNR) the original coefficient is small → simple upweights it.
  • Step 7 ($t$ switched to uniform sampling): discrete $t$ becomes $t \sim \mathcal{U}\{1,\dots,T\}$, uniform over all timesteps, not weighted by ELBO term magnitudes.

Final:

$$L_\text{simple} = \mathbb{E}_{t \sim \mathcal{U}\{1,\dots,T\},\, x_0,\, \epsilon}\big[\|\epsilon - \epsilon_\theta(\sqrt{\bar\alpha_t} x_0 + \sqrt{1-\bar\alpha_t}\epsilon, t)\|^2\big]$$

Cost:

  • No longer a lower bound on $\log p$ (FID improves but likelihood evaluation no longer directly corresponds)
  • $L_T$ and $L_0$ are implicitly dropped
  • $\Sigma_\theta$ information lost (Improved DDPM restores it via $L_\text{vlb}$)

Not knowing which terms are dropped; thinking $L_\text{simple}$ comes directly from the KL; ignoring the role of $L_T, L_0$.

Q22. Prove DDIM ($\eta=0$) shares the same marginal $q(x_t|x_0)$ as DDPM.

Statement: DDIM defines a non-Markov forward $q_\sigma(x_{1:T}|x_0)$ such that $q_\sigma(x_t|x_0) = \mathcal{N}(\sqrt{\bar\alpha_t} x_0, (1-\bar\alpha_t) I)$ — identical to DDPM.

Proof (by induction):

  • Boundary $q_\sigma(x_T|x_0) = \mathcal{N}(\sqrt{\bar\alpha_T} x_0, (1-\bar\alpha_T) I)$ — holds directly by DDIM definition
  • Assume $q_\sigma(x_t|x_0) = \mathcal{N}(\sqrt{\bar\alpha_t} x_0, (1-\bar\alpha_t) I)$. DDIM defines:

$$q_\sigma(x_{t-1}|x_t, x_0) = \mathcal{N}\!\left(\sqrt{\bar\alpha_{t-1}} x_0 + \sqrt{1-\bar\alpha_{t-1} - \sigma_t^2}\cdot \frac{x_t - \sqrt{\bar\alpha_t} x_0}{\sqrt{1-\bar\alpha_t}},\; \sigma_t^2 I\right)$$

  • Compute $q_\sigma(x_{t-1}|x_0) = \int q_\sigma(x_{t-1}|x_t, x_0) q_\sigma(x_t|x_0)\, dx_t$ (marginalize two Gaussians)
  • Apply Gaussian marginalization: if $x_t | x_0 \sim \mathcal{N}(\mu_t, \Sigma_t)$ and $x_{t-1}|x_t, x_0 \sim \mathcal{N}(A x_t + b, \Sigma_{t-1|t})$, then:

$$x_{t-1}|x_0 \sim \mathcal{N}\!\left(A \mu_t + b,\; A \Sigma_t A^\top + \Sigma_{t-1|t}\right)$$

  • Here $A = \sqrt{1-\bar\alpha_{t-1}-\sigma_t^2}/\sqrt{1-\bar\alpha_t}$, $b = \sqrt{\bar\alpha_{t-1}} x_0 - A \sqrt{\bar\alpha_t} x_0$. Substitute:

    • Mean = $\sqrt{\bar\alpha_{t-1}} x_0 + A \sqrt{\bar\alpha_t} x_0 - A \sqrt{\bar\alpha_t} x_0 = \sqrt{\bar\alpha_{t-1}} x_0$
    • Variance = $A^2 (1-\bar\alpha_t) + \sigma_t^2 = (1-\bar\alpha_{t-1}-\sigma_t^2) + \sigma_t^2 = 1 - \bar\alpha_{t-1}$
  • So $q_\sigma(x_{t-1}|x_0) = \mathcal{N}(\sqrt{\bar\alpha_{t-1}} x_0, (1-\bar\alpha_{t-1}) I)$ — identical to DDPM $\square$

Significance: DDIM can directly use a DDPM-trained $\epsilon_\theta$ because training looks only at marginals $q(x_t|x_0)$, which agree; the sampling paths differ (deterministic vs stochastic).

Not knowing the Gaussian marginalization theorem; not seeing that the proof's key is $A^2(1-\bar\alpha_t) + \sigma_t^2 = 1-\bar\alpha_{t-1}$.

Q23. Derive $c_\text{skip}$ and $c_\text{out}$ in EDM preconditioning.

Setup: VE view $x = x_0 + \sigma \epsilon$, $\epsilon \sim \mathcal{N}(0, I)$, $\text{Var}[x_0] = \sigma_\text{data}^2$. Denoiser parametrization:

$$D_\theta(x; \sigma) = c_\text{skip}(\sigma) x + c_\text{out}(\sigma) F_\theta(c_\text{in} x, c_\text{noise})$$

Effective target for $F_\theta$:

$$F^*(x_0, \sigma, \epsilon) = \frac{1}{c_\text{out}(\sigma)}\big[x_0 - c_\text{skip}(\sigma) x\big] = \frac{1}{c_\text{out}}\big[(1 - c_\text{skip}) x_0 - c_\text{skip} \sigma \epsilon\big]$$

Goal: find $c_\text{skip}, c_\text{out}$ so that $\text{Var}[F^*]$ (expectation over $x_0, \epsilon$) = 1.

$$\text{Var}[F^*] = \frac{1}{c_\text{out}^2}\big[(1-c_\text{skip})^2 \sigma_\text{data}^2 + c_\text{skip}^2 \sigma^2\big] = 1$$

But normalization alone is non-unique. Second criterion (Karras 2022): minimize the "residual" the network must learn (minimize $c_\text{out}$, because a larger $c_\text{out}$ amplifies both $F$ and its error). Equivalent problem:

$$\min_{c_\text{skip}}\;\; c_\text{out}^2(c_\text{skip}) = (1-c_\text{skip})^2 \sigma_\text{data}^2 + c_\text{skip}^2 \sigma^2$$

Differentiate w.r.t. $c_\text{skip}$ and set to 0:

$$-2(1 - c_\text{skip}) \sigma_\text{data}^2 + 2 c_\text{skip} \sigma^2 = 0 \quad \Rightarrow \quad c_\text{skip} = \frac{\sigma_\text{data}^2}{\sigma_\text{data}^2 + \sigma^2}$$

Substitute back into the $\text{Var}[F^*] = 1$ constraint:

$$c_\text{out}^2 = (1-c_\text{skip})^2 \sigma_\text{data}^2 + c_\text{skip}^2 \sigma^2 = \frac{\sigma^4 \sigma_\text{data}^2}{(\sigma^2+\sigma_\text{data}^2)^2} + \frac{\sigma_\text{data}^4 \sigma^2}{(\sigma^2+\sigma_\text{data}^2)^2} = \frac{\sigma^2 \sigma_\text{data}^2}{\sigma^2 + \sigma_\text{data}^2}$$

$$\boxed{\; c_\text{out}(\sigma) = \frac{\sigma \cdot \sigma_\text{data}}{\sqrt{\sigma^2 + \sigma_\text{data}^2}} \;}$$

Input normalization: $c_\text{in}(\sigma) = 1/\sqrt{\sigma_\text{data}^2 + \sigma^2}$ makes $\text{Var}[c_\text{in} x] = 1$.

Conclusion: the four $c$ functions are fully determined by $\sigma_\text{data}$, no tunable parameters (in practice $\sigma_\text{data}$ is computed from data; about 0.5 for normalized images).

Memorizing the formula without derivation; not knowing $c_\text{skip}$ is derived by minimizing $c_\text{out}$; thinking the $c$ functions have free parameters.

Q24. Consistency Distillation training procedure? Why is the EMA target $\theta^-$ needed?

Procedure:

  1. Take a pretrained teacher diffusion $\epsilon_\phi$ + its PF-ODE solver (e.g., EDM Heun)
  2. Take a noise schedule $\sigma_1 > \sigma_2 > \dots > \sigma_N = \sigma_\text{min}$ (typically $N = 18$)
  3. Train student $f_\theta(x_\sigma, \sigma) \to x_0$, initializing $\theta = \phi$ (warm start)
  4. Each batch:

    • Sample $x_0$, $\sigma_n$ (uniformly $n \in \{1, \dots, N-1\}$)
    • Add noise: $x_{\sigma_{n+1}} = x_0 + \sigma_{n+1} \epsilon$
    • Teacher ODE one-step solve: from $x_{\sigma_{n+1}}$ run one Heun step with teacher $\epsilon_\phi$ to get $\hat x_{\sigma_n}$
    • Loss: $d(f_\theta(x_{\sigma_{n+1}}, \sigma_{n+1}), f_{\theta^-}(\hat x_{\sigma_n}, \sigma_n))$
  5. Update $\theta$; EMA-update $\theta^- \leftarrow \mu \theta^- + (1-\mu)\theta$

Why is the EMA target needed?

  • Directly using $\theta = \theta^-$ has a trivial solution: $f_\theta \equiv \text{const}$ also satisfies consistency
  • EMA $\theta^-$ lags $\theta$, providing a "stable" target so the student doesn't chase its own moving target
  • Analogous to BYOL / MoCo self-supervised setups
  • $\mu = 0.999 \sim 0.99995$ (depending on training steps)

Recent improvement (iCT, Song-Dhariwal 2024): remove EMA teacher (compute target with the same $\theta$, no $\theta^-$), use pseudo-Huber loss, combine with lognormal noise schedule + curriculum-increasing discretization steps; CT approaches CD quality.

Not knowing the trivial solution; thinking EMA is just an engineering stability trick; not knowing the teacher's role.

Q25. Why can SD3 / FLUX be compressed to 4-step / 1-step generation?

Core pathway: RF (linear path) + Reflow + Distill. Step-by-step:

  1. RF makes trajectories straight — $x_t = (1-t)x_0 + tx_1$, the ODE solution's "ideal curve" is a straight line (linear interpolation), so first-order Euler with long steps has small error (contrast: cosine paths have high curvature near mid-$t$)
  2. Reflow makes trajectories even straighter — after first training, run ODE to get coupled $(x_0, x_1)$, then train again; the trajectory converges closer to a straight line. Liu 2022 proves reflow monotonically reduces transport cost
  3. CFG-distillation — distill CFG's 2× forward (cond + uncond) into a single forward (FLUX does this); halves NFE
  4. Adversarial distillation (ADD) — SD3-Turbo / SDXL-Turbo late-stage fine-tune with DINOv2 discriminator + distill loss, 4-step approaches 30-step quality

Vs the DDPM route: DDPM trajectories have high curvature at mid-$t$ (cosine path), Euler first-order is unusable below 5 steps; you need DPM-Solver-2 second-order + consistency distillation to get to 4 steps. RF is much more engineering-friendly — a first-order sampler suffices.

FLUX-Schnell's 1-step: RF + reflow + heavy distillation; 1024px single forward generation, ~100ms/image. Cost: slightly reduced controllability / diversity; prompt-following accuracy slightly below multi-step.

Saying only "RF is faster than DDPM" without knowing why; not knowing reflow + distill are two-pronged; thinking FLUX 1-step is solely due to RF (in fact distillation also matters).

§A Appendix: Core PyTorch code (from scratch)

Pedagogical version

emphasizes the math; for production use diffusers / EDM official implementations, which include mixed precision / EMA / DDP / VAE / xformers / fused kernels.

A.1 DDPM forward $q(x_t | x_0)$ + simplified loss

import math
import torch
import torch.nn as nn
import torch.nn.functional as F


def linear_beta_schedule(T: int, beta_start: float = 1e-4, beta_end: float = 0.02):
    return torch.linspace(beta_start, beta_end, T, dtype=torch.float64)


def cosine_beta_schedule(T: int, s: float = 0.008):
    """Nichol-Dhariwal 2021"""
    ts = torch.arange(T + 1, dtype=torch.float64) / T
    f = torch.cos(((ts + s) / (1 + s)) * math.pi / 2) ** 2
    alpha_bar = f / f[0]
    betas = 1 - alpha_bar[1:] / alpha_bar[:-1]
    return betas.clamp(max=0.999)


class DDPMSchedule:
    """Cache sqrt(α_bar), sqrt(1-α_bar), and other frequently used quantities."""
    def __init__(self, betas: torch.Tensor):
        self.T = len(betas)
        self.betas = betas
        alphas = 1.0 - betas
        self.alphas = alphas
        self.alpha_bar = torch.cumprod(alphas, dim=0)
        self.sqrt_alpha_bar = torch.sqrt(self.alpha_bar)
        self.sqrt_one_minus_alpha_bar = torch.sqrt(1.0 - self.alpha_bar)
        # for sampling
        self.alpha_bar_prev = torch.cat([torch.tensor([1.0]), self.alpha_bar[:-1]])
        self.posterior_variance = betas * (1.0 - self.alpha_bar_prev) / (1.0 - self.alpha_bar)

    def to(self, device):
        for k, v in self.__dict__.items():
            if isinstance(v, torch.Tensor):
                setattr(self, k, v.to(device))
        return self


def q_sample(sched: DDPMSchedule, x0: torch.Tensor, t: torch.Tensor, noise: torch.Tensor = None):
    """Sample x_t ~ q(x_t | x_0) = N(sqrt(α_bar_t) x_0, (1-α_bar_t) I)"""
    if noise is None:
        noise = torch.randn_like(x0)
    sa = sched.sqrt_alpha_bar[t].view(-1, *([1] * (x0.dim() - 1))).to(x0.dtype)
    so = sched.sqrt_one_minus_alpha_bar[t].view(-1, *([1] * (x0.dim() - 1))).to(x0.dtype)
    return sa * x0 + so * noise


def ddpm_simple_loss(model: nn.Module, sched: DDPMSchedule, x0: torch.Tensor):
    """L_simple = E ‖ε - ε_θ(x_t, t)‖²"""
    B = x0.shape[0]
    t = torch.randint(0, sched.T, (B,), device=x0.device)
    noise = torch.randn_like(x0)
    x_t = q_sample(sched, x0, t, noise)
    eps_pred = model(x_t, t)
    return F.mse_loss(eps_pred, noise)

A.2 DDPM ancestral sampling

@torch.no_grad()
def ddpm_sample(model, sched: DDPMSchedule, shape, device, x_T=None):
    """Walk the full T-step ancestral chain from x_T ~ N(0, I)."""
    x = torch.randn(shape, device=device) if x_T is None else x_T.to(device)
    for t in reversed(range(sched.T)):
        t_b = torch.full((shape[0],), t, device=device, dtype=torch.long)
        eps_pred = model(x, t_b)

        alpha_t = sched.alphas[t]
        alpha_bar_t = sched.alpha_bar[t]
        beta_t = sched.betas[t]

        # Reverse mean (ε-pred form)
        mean = (x - beta_t / torch.sqrt(1 - alpha_bar_t) * eps_pred) / torch.sqrt(alpha_t)

        if t > 0:
            sigma_t = torch.sqrt(sched.posterior_variance[t])
            noise = torch.randn_like(x)
            x = mean + sigma_t * noise
        else:
            x = mean  # final step does not add noise
    return x

A.3 DDIM sampling (with $\eta$)

@torch.no_grad()
def ddim_sample(
    model,
    sched: DDPMSchedule,
    shape,
    device,
    num_steps: int = 50,
    eta: float = 0.0,          # 0 = deterministic DDIM; η=1 in dense-step limit matches DDPM variance
    x_T=None,
):
    """Pick a sub-sequence of num_steps timesteps and run DDIM reverse."""
    # Pick sub-sequence (linearly spaced)
    step_size = sched.T // num_steps
    timesteps = list(range(0, sched.T, step_size))
    timesteps = timesteps + [sched.T - 1]
    timesteps = sorted(set(timesteps))  # dedupe / sort

    x = torch.randn(shape, device=device) if x_T is None else x_T.to(device)

    for i in reversed(range(1, len(timesteps))):
        t = timesteps[i]
        t_prev = timesteps[i - 1]
        t_b = torch.full((shape[0],), t, device=device, dtype=torch.long)

        alpha_bar_t = sched.alpha_bar[t]
        alpha_bar_prev = sched.alpha_bar[t_prev]

        eps_pred = model(x, t_b)

        # 1) x_0 estimate via Tweedie / ε-pred
        x0_hat = (x - torch.sqrt(1 - alpha_bar_t) * eps_pred) / torch.sqrt(alpha_bar_t)

        # 2) σ_t² = η² · (1-α_bar_prev)/(1-α_bar_t) · (1 - α_bar_t/α_bar_prev)
        sigma_t_sq = (eta ** 2) * (1 - alpha_bar_prev) / (1 - alpha_bar_t) * \
                     (1 - alpha_bar_t / alpha_bar_prev)
        sigma_t = torch.sqrt(sigma_t_sq.clamp(min=0))

        # 3) DDIM step
        dir_xt = torch.sqrt((1 - alpha_bar_prev - sigma_t_sq).clamp(min=0)) * eps_pred
        noise = torch.randn_like(x) if eta > 0 else 0
        x = torch.sqrt(alpha_bar_prev) * x0_hat + dir_xt + sigma_t * noise

    # Final step uses x0_hat (no noise added)
    return x0_hat

A.4 Classifier-Free Guidance training + sampling

class ConditionedEpsNet(nn.Module):
    """Demo: condition is a class label embedding, dropped with prob p_drop during training.
       In real projects swap self.backbone with UNet / DiT and feed c_emb + t_emb."""
    def __init__(self, dim, num_classes, p_drop=0.1, backbone: nn.Module = None):
        super().__init__()
        self.p_drop = p_drop
        # NULL class uses index num_classes ("empty" embedding)
        self.cls_emb = nn.Embedding(num_classes + 1, dim)
        self.null_idx = num_classes
        self.backbone = backbone   # placeholder: self.backbone(x, t, c_emb) returns ε

    def forward(self, x, t, c=None):
        # During training, randomly drop condition to NULL
        if self.training and c is not None:
            mask = torch.rand(c.shape[0], device=c.device) < self.p_drop
            c = torch.where(mask, torch.full_like(c, self.null_idx), c)
        elif c is None:
            c = torch.full((x.shape[0],), self.null_idx, device=x.device, dtype=torch.long)

        c_emb = self.cls_emb(c)
        # Concat c_emb onto the timestep embedding, run through UNet / DiT
        eps_pred = self.backbone(x, t, c_emb)
        return eps_pred


@torch.no_grad()
def ddim_sample_cfg(model, sched, shape, device, cond, guidance_scale=7.5, num_steps=50):
    """CFG-DDIM: two forwards (cond + uncond) per step, composed into ε_tilde."""
    step_size = sched.T // num_steps
    timesteps = sorted(set(list(range(0, sched.T, step_size)) + [sched.T - 1]))
    x = torch.randn(shape, device=device)

    null_cond = torch.full_like(cond, model.null_idx)
    for i in reversed(range(1, len(timesteps))):
        t, t_prev = timesteps[i], timesteps[i - 1]
        t_b = torch.full((shape[0],), t, device=device, dtype=torch.long)

        eps_cond = model(x, t_b, cond)
        eps_uncond = model(x, t_b, null_cond)
        # CFG: note convention — here we use HF style guidance_scale=w (w=1 unguided)
        eps = eps_uncond + guidance_scale * (eps_cond - eps_uncond)

        alpha_bar_t = sched.alpha_bar[t]
        alpha_bar_prev = sched.alpha_bar[t_prev]
        x0_hat = (x - torch.sqrt(1 - alpha_bar_t) * eps) / torch.sqrt(alpha_bar_t)
        dir_xt = torch.sqrt(1 - alpha_bar_prev) * eps
        x = torch.sqrt(alpha_bar_prev) * x0_hat + dir_xt   # η=0 deterministic
    return x0_hat

A.5 EDM preconditioning + Heun 2nd-order sampler

class EDMDenoiser(nn.Module):
    """D_θ(x; σ) = c_skip(σ) x + c_out(σ) F_θ(c_in(σ) x, c_noise(σ))"""
    def __init__(self, backbone: nn.Module, sigma_data: float = 0.5):
        super().__init__()
        self.backbone = backbone           # outputs same shape as x
        self.sigma_data = sigma_data

    def forward(self, x: torch.Tensor, sigma: torch.Tensor):
        # σ shape [B] -> broadcast to x shape
        s = sigma.view(-1, *([1] * (x.dim() - 1))).to(x.dtype)
        sd2 = self.sigma_data ** 2
        c_skip = sd2 / (s ** 2 + sd2)
        c_out = s * self.sigma_data / torch.sqrt(s ** 2 + sd2)
        c_in = 1.0 / torch.sqrt(s ** 2 + sd2)
        c_noise = 0.25 * torch.log(sigma).flatten()   # 1D fed to backbone
        F = self.backbone(c_in * x, c_noise)
        return c_skip * x + c_out * F


def edm_loss(D: EDMDenoiser, x0: torch.Tensor,
             P_mean: float = -1.2, P_std: float = 1.2):
    """EDM L = E [ λ(σ) ‖D_θ(x_0 + σε, σ) - x_0‖² ];  λ = 1/c_out².
       Implemented as unweighted F-loss: equivalent to weighted D-loss."""
    B = x0.shape[0]
    log_sigma = P_mean + P_std * torch.randn(B, device=x0.device)
    sigma = log_sigma.exp()
    eps = torch.randn_like(x0)
    x = x0 + sigma.view(-1, *([1] * (x0.dim() - 1))) * eps
    D_pred = D(x, sigma)
    s = sigma.view(-1, *([1] * (x0.dim() - 1)))
    sd2 = D.sigma_data ** 2
    weight = (s ** 2 + sd2) / (s * D.sigma_data) ** 2       # = 1/c_out²
    loss = (weight * (D_pred - x0) ** 2).mean()
    return loss


def edm_sigma_schedule(N: int, sigma_min: float = 0.002,
                       sigma_max: float = 80.0, rho: float = 7.0,
                       device: str = "cpu"):
    """Karras ρ-schedule: σ_i = (σ_max^{1/ρ} + i/(N-1) · (σ_min^{1/ρ} - σ_max^{1/ρ}))^ρ"""
    i = torch.arange(N, device=device, dtype=torch.float64)
    sigmas = (sigma_max ** (1 / rho) +
              i / (N - 1) * (sigma_min ** (1 / rho) - sigma_max ** (1 / rho))) ** rho
    return torch.cat([sigmas, torch.zeros(1, device=device)]).to(torch.float32)  # trailing σ=0


@torch.no_grad()
def edm_heun_sample(D: EDMDenoiser, shape, sigmas: torch.Tensor, device):
    """Heun (2nd-order) ODE solver. 2 NFE per step; last step degenerates to Euler."""
    x = torch.randn(shape, device=device) * sigmas[0]
    for i in range(len(sigmas) - 1):
        sigma = sigmas[i]
        sigma_next = sigmas[i + 1]
        sigma_b = sigma.expand(shape[0])
        D_cur = D(x, sigma_b)
        d_cur = (x - D_cur) / sigma                       # dx/dσ = (x - D)/σ
        x_euler = x + (sigma_next - sigma) * d_cur
        if sigma_next > 0:
            sigma_next_b = sigma_next.expand(shape[0])
            D_next = D(x_euler, sigma_next_b)
            d_next = (x_euler - D_next) / sigma_next
            x = x + (sigma_next - sigma) * 0.5 * (d_cur + d_next)
        else:
            x = x_euler                                    # final-step Euler
    return x

A.6 Probability Flow ODE simple Euler solver

@torch.no_grad()
def pf_ode_sample_euler(eps_model, sched: DDPMSchedule, shape, device, num_steps: int = 50):
    """PF-ODE Euler sampler under VP view.
       dx/dt = f(t) x - (1/2) g²(t) s_θ(x, t),  s_θ = -ε_θ / sqrt(1-α_bar_t)
       In discrete schedule this degenerates to DDIM η=0 + time grid."""
    # Pick sub-sequence
    step_size = sched.T // num_steps
    timesteps = sorted(set(list(range(0, sched.T, step_size)) + [sched.T - 1]))
    x = torch.randn(shape, device=device)
    for i in reversed(range(1, len(timesteps))):
        t, t_prev = timesteps[i], timesteps[i - 1]
        t_b = torch.full((shape[0],), t, device=device, dtype=torch.long)
        eps_pred = eps_model(x, t_b)

        alpha_bar_t = sched.alpha_bar[t]
        alpha_bar_prev = sched.alpha_bar[t_prev]

        # Equivalent DDIM η=0 form
        x0_hat = (x - torch.sqrt(1 - alpha_bar_t) * eps_pred) / torch.sqrt(alpha_bar_t)
        dir_xt = torch.sqrt(1 - alpha_bar_prev) * eps_pred
        x = torch.sqrt(alpha_bar_prev) * x0_hat + dir_xt
    return x0_hat

A.7 Sanity-check output (pedagogical version)

Run a 64×64 ImageNet subset toy setup, 2-layer UNet baseline, sched=cosine, T=1000:

[a] q_sample shape ok, σ_t variance ≈ 1-α_bar_t  ✓
[b] simple loss converges (5k steps): 0.42 → 0.18  ✓
[c] DDPM 1000-step sample: FID  (toy) ~ 22.5
[d] DDIM 50-step (η=0):    FID  (toy) ~ 23.1  ← close to DDPM 1000, 20× speedup
[e] DDIM 50-step (η=1):    FID  (toy) ~ 22.7  ← η=1 matches DDPM variance, not strict 1000-step DDPM
[f] CFG w=7.5 conditional: visually significant text-alignment boost ✓
[g] EDM Heun 35-NFE:       FID  (toy) ~ 18.3  ← much better than DDIM 50
[h] PF-ODE Euler 50-step:  numerically agrees with DDIM η=0 ✓

Main references: Ho 2020 (DDPM, NeurIPS), Nichol-Dhariwal 2021 (Improved DDPM, ICML), Song-Ermon 2019 (NCSN, NeurIPS), Song 2021 (Score SDE, ICLR), Song 2020 arXiv / ICLR 2021 (DDIM), Karras 2022 (EDM, NeurIPS), Lu 2022/2023 (DPM-Solver / DPM-Solver++), Ho-Salimans 2022 arXiv (CFG; short version: NeurIPS 2021 Workshop on DGMs), Dhariwal-Nichol 2021 (Classifier Guidance, NeurIPS), Rombach 2022 (LDM/SD, CVPR), Podell 2023 arXiv / ICLR 2024 (SDXL), Esser 2024 (SD3, ICML), Peebles-Xie 2023 (DiT, ICCV), Zhang 2023 (ControlNet, ICCV), Song 2023 (Consistency Models, ICML), Luo 2023 (LCM, arXiv), Sauer 2023/2024 (SDXL-Turbo / SD3-Turbo, arXiv).

Diffusion Foundations Cheat Sheet · formulas + from-scratch code + 25 frequently-asked questions (L1 must-know · L2 intermediate · L3 top lab)