Diffusion Foundations Tutorial En
§0 TL;DR
one-page interview essentials (full derivations in §1-§13).
- DDPM (Ho 2020): forward $q(x_t|x_0) = \mathcal{N}(\sqrt{\bar\alpha_t} x_0, (1-\bar\alpha_t) I)$ admits closed-form sampling; reverse $p_\theta(x_{t-1}|x_t) = \mathcal{N}(\mu_\theta, \Sigma_\theta)$ learns the reverse Gaussian; the ELBO simplifies to $L_\text{simple} = \mathbb{E}\|\epsilon - \epsilon_\theta(x_t, t)\|^2$ ($\epsilon$-prediction).
- Three equivalent views: DDPM's $\epsilon$, score-based $s = \nabla \log p_t$, and flow matching's $v$ are linearly invertible under the Gaussian path — $s_\theta = -\epsilon_\theta / \sigma_t$, $v = \alpha'x_0 + \sigma'\epsilon$.
- Tweedie's formula: $\mathbb{E}[x_0 | x_t] = x_t + \sigma_t^2 \nabla_{x_t} \log p_t(x_t)$ — a one-line bridge between denoiser and score.
- Score SDE (Song 2021): a unified VP-SDE / VE-SDE framework; the reverse-time SDE and the probability flow ODE share the same family of marginals, and the ODE form directly yields FM's vector field.
- DDIM (Song 2020 / ICLR 2021): a non-Markovian forward leads to a deterministic sampler with the same marginals as DDPM but a controllable sampling path ($\eta=0$ deterministic; $\eta=1$ over the full $T$ steps degenerates to DDPM ancestral, but with skipped steps it only matches DDPM variance, not strict equivalence).
- EDM (Karras 2022): preconditioning makes the network output have unit variance: $D_\theta(x;\sigma) = c_\text{skip}(\sigma) x + c_\text{out}(\sigma) F_\theta(c_\text{in}(\sigma) x, c_\text{noise}(\sigma))$; combined with a $\sigma$-schedule + Heun 2nd-order sampler, SOTA FID with NFE down to 18-35.
- CFG (Ho-Salimans 2022): during training drop the condition with probability $p_\text{drop}$ → the same net learns conditional/unconditional; at inference $\tilde\epsilon = (1+w)\epsilon_\theta(x,c) - w\epsilon_\theta(x,\emptyset)$, with $w \in [3, 7]$ being the workhorse range for text-to-image.
- Production: SD/SDXL use VAE latent + UNet; SD3 / FLUX.1 switch to Rectified Flow + MM-DiT; ControlNet adds a trainable side branch to a frozen UNet; DiT replaces the UNet entirely with a Transformer.
- Acceleration: DPM-Solver++ compresses NFE to 10-20; Consistency Models learn $f_\theta(x_t, t) \mapsto x_0$ for 1-4 step sampling; LCM / LCM-LoRA / SDXL-Turbo / SD3-Turbo (ADD) bring distillation to the entire Stable Diffusion family.
§1 Intuition & three views
1.1 One-sentence intuition
Diffusion = learning to "denoise": progressively noise data from clean to pure Gaussian (forward), then learn to reverse it step-by-step from noise back to data (reverse). All diffusion papers differ on just three things:
- how forward adds noise (schedule, SDE type VP/VE)
- what the network predicts ($\epsilon$ / $x_0$ / $v$ / score / $D$)
- how reverse samples (Markov ancestral / DDIM / DPM-Solver / EDM Heun / Consistency one-step)
1.2 Comparison of the three views
Unified framework (Song et al. 2021)
Discrete view (DDPM) Continuous view (Score SDE) Flow view (FM/RF)
──────────── ────────────────── ────────────────
q(x_t|x_{t-1}) → dx = f(x,t)dt+g(t)dW → dx = u_t(x) dt
closed-form q(x_t|x_0) forward SDE ODE (deterministic)
↓ ↓ ↓
ε-prediction score s = ∇ log p_t vector field v_t
↘ ↓ ↙
All linearly invertible (under Gaussian path)
s = -ε/σ_t, v = α'x_0 + σ'ε, ε = -σ s
"DDPM is a special case of VP-SDE in discrete time; score-based is the equivalent parametrization in continuous time; Flow Matching carries the same information as score matching under VP/VE paths but parametrizes as $v$ instead of $s$. Rectified Flow steps outside the SDE framework, using a linear path to directly learn an ODE's vector field."
1.3 Convention (used throughout)
| Symbol | Meaning |
|---|---|
| $x_0$ | clean data sample |
| $x_t$, $t \in \{1,\dots,T\}$ or $t \in [0,T]$ | noised sample |
| $\epsilon \sim \mathcal{N}(0, I)$ | standard Gaussian noise |
| $\alpha_t, \beta_t = 1 - \alpha_t$ | DDPM single-step forward coefficients |
| $\bar\alpha_t = \prod_{s=1}^t \alpha_s$ | DDPM cumulative coefficient |
| $\sigma_t$ | standard deviation (the "noise level" in NCSN / EDM view) |
| $s_\theta(x_t, t) \approx \nabla_{x_t}\log p_t(x_t)$ | score |
| $\epsilon_\theta(x_t, t) \approx \epsilon$ | the noise predicted in DDPM |
| $D_\theta(x; \sigma) \approx x_0$ | EDM's denoiser output |
DDPM's paper has forward going $t = 0 \to T$ (data noised to pure noise) and reverse going $T \to 0$; FM papers often use $t = 0$ noise and $t = 1$ data. Before writing code in an interview always disambiguate the time direction — otherwise the sampler is easy to flip.
§2 DDPM Forward Process
2.1 Single step and closed form
The DDPM forward is a Markov chain:
$$q(x_t | x_{t-1}) = \mathcal{N}(x_t;\; \sqrt{1-\beta_t}\, x_{t-1},\; \beta_t I), \quad t = 1, \dots, T$$
Define $\alpha_t = 1 - \beta_t$ and $\bar\alpha_t = \prod_{s=1}^t \alpha_s$. Key property: $q(x_t | x_0)$ is a closed-form Gaussian — you can jump from $x_0$ to any $t$ in a single step (the core of training efficiency):
$$\boxed{\; q(x_t | x_0) = \mathcal{N}\!\left(x_t;\; \sqrt{\bar\alpha_t}\, x_0,\; (1-\bar\alpha_t) I\right) \;}$$
Equivalent reparameterization:
$$x_t = \sqrt{\bar\alpha_t}\, x_0 + \sqrt{1-\bar\alpha_t}\, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$
2.2 Closed-form derivation (mandatory, will keep coming up)
From the reparameterization $x_t = \sqrt{\alpha_t} x_{t-1} + \sqrt{\beta_t} z_t$ with independent $z_t \sim \mathcal{N}(0, I)$. Recurse:
$$ \begin{aligned} x_t &= \sqrt{\alpha_t} x_{t-1} + \sqrt{\beta_t} z_t \\ &= \sqrt{\alpha_t}\left(\sqrt{\alpha_{t-1}} x_{t-2} + \sqrt{\beta_{t-1}} z_{t-1}\right) + \sqrt{\beta_t} z_t \\ &= \sqrt{\alpha_t \alpha_{t-1}}\, x_{t-2} + \underbrace{\sqrt{\alpha_t \beta_{t-1}} z_{t-1} + \sqrt{\beta_t} z_t}_{\text{sum of independent Gaussians}} \end{aligned} $$
Variance of the sum of two independent Gaussians: $\alpha_t \beta_{t-1} + \beta_t = \alpha_t(1 - \alpha_{t-1}) + (1 - \alpha_t) = 1 - \alpha_t \alpha_{t-1}$. So it merges into a single Gaussian $\sqrt{1 - \alpha_t \alpha_{t-1}}\, \bar z$. Induct to step $t$:
$$x_t = \sqrt{\bar\alpha_t}\, x_0 + \sqrt{1 - \bar\alpha_t}\, \epsilon$$
the benefit of a Markov chain with Gaussian steps is that the cumulative distribution is still Gaussian; this lets forward sample without a network, and training doesn't need to simulate the whole chain.
2.3 Boundary cases and limits
- $t = 0$: $\bar\alpha_0 = 1$, $x_0$ itself — the forward start point
- $t = T$ (DDPM uses 1000): we want $\bar\alpha_T \approx 0$, so $x_T \approx \epsilon \sim \mathcal{N}(0, I)$ — the forward endpoint is close to the Gaussian prior
SNR$(t) = \bar\alpha_t / (1-\bar\alpha_t)$; the linear schedule has $\bar\alpha_T \approx 4\times 10^{-5}$ at $t=T$, corresponding to SNR $\approx 4\times 10^{-5}$ — small but strictly speaking not zero, so the prior doesn't fully match $\mathcal{N}(0,I)$; this is one motivation behind the cosine schedule and "v-prediction" improvements.
§3 DDPM Reverse Process & Training
3.1 The premise that reverse is Gaussian
In theory $q(x_{t-1} | x_t)$ is not Gaussian (it depends on the entire data distribution). But when $\beta_t$ is small enough, the reverse conditional is approximately Gaussian (Feller 1949 / Sohl-Dickstein 2015), so we parametrize:
$$p_\theta(x_{t-1} | x_t) = \mathcal{N}\!\left(x_{t-1};\; \mu_\theta(x_t, t),\; \Sigma_\theta(x_t, t)\right)$$
3.2 ELBO derivation
DDPM optimizes the evidence lower bound (analogous to a VAE):
$$ \begin{aligned} \log p_\theta(x_0) &\ge \mathbb{E}_{q(x_{1:T}|x_0)}\left[\log \frac{p_\theta(x_{0:T})}{q(x_{1:T}|x_0)}\right] \\ &= -\underbrace{\mathbb{E}_q[\text{KL}(q(x_T|x_0) \,\Vert\, p(x_T))]}_{L_T \text{ (constant, prior matching)}} \\ &\quad - \sum_{t=2}^T \underbrace{\mathbb{E}_q[\text{KL}(q(x_{t-1}|x_t, x_0) \,\Vert\, p_\theta(x_{t-1}|x_t))]}_{L_{t-1}} \\ &\quad + \underbrace{\mathbb{E}_q[\log p_\theta(x_0 | x_1)]}_{L_0 \text{ (decoder log-likelihood)}} \end{aligned} $$
Key: $q(x_{t-1} | x_t, x_0)$ is a closed-form Gaussian (derived from Bayes):
$$q(x_{t-1} | x_t, x_0) = \mathcal{N}\!\left(x_{t-1};\; \tilde\mu_t(x_t, x_0),\; \tilde\beta_t I\right)$$
where:
$$\tilde\mu_t(x_t, x_0) = \frac{\sqrt{\bar\alpha_{t-1}} \beta_t}{1 - \bar\alpha_t} x_0 + \frac{\sqrt{\alpha_t}(1-\bar\alpha_{t-1})}{1-\bar\alpha_t} x_t, \quad \tilde\beta_t = \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\beta_t$$
3.3 Simplifying to $L_\text{simple}$ (mandatory derivation)
Substitute $x_0 = (x_t - \sqrt{1-\bar\alpha_t}\epsilon) / \sqrt{\bar\alpha_t}$ into $\tilde\mu_t$:
$$\tilde\mu_t = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon\right)$$
Parametrize $\mu_\theta(x_t, t)$ in the same form ($\epsilon$-prediction):
$$\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(x_t, t)\right)$$
Fix $\Sigma_\theta = \sigma_t^2 I$ (take $\sigma_t^2 = \beta_t$ or $\tilde\beta_t$). The KL between two Gaussians:
$$L_{t-1} = \mathbb{E}\left[\frac{1}{2\sigma_t^2} \| \tilde\mu_t - \mu_\theta \|^2\right] = \mathbb{E}\left[\frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1-\bar\alpha_t)} \|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]$$
Ho 2020's engineering trick: drop all preceding coefficients + constant terms and just use the unweighted version:
$$\boxed{\; L_\text{simple}(\theta) = \mathbb{E}_{t \sim \mathcal{U}\{1,\dots,T\},\; x_0,\; \epsilon}\Big[\big\|\epsilon - \epsilon_\theta\!\big(\sqrt{\bar\alpha_t} x_0 + \sqrt{1-\bar\alpha_t}\epsilon,\; t\big)\big\|^2\Big] \;}$$
Ho 2020's empirical observation: the unweighted version effectively upweights low-SNR (high $t$) loss, which in turn improves sample quality. The cost: $\log$-likelihood is no longer the ELBO lower bound — so "good FID" ≠ "good likelihood". Improved DDPM (Nichol-Dhariwal 2021) later introduces a hybrid loss $L_\text{hybrid} = L_\text{simple} + \lambda L_\text{vlb}$ ($\lambda = 0.001$), simultaneously learning $\Sigma_\theta$.
3.4 Equivalent prediction targets (memorize)
Given $x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1-\bar\alpha_t}\epsilon$, the three mainstream parametrizations are linearly invertible:
$$ \begin{aligned} \epsilon\text{-pred} &:\quad \epsilon_\theta(x_t, t) \approx \epsilon \\ x_0\text{-pred} &:\quad \hat x_0(x_t, t) = \frac{x_t - \sqrt{1-\bar\alpha_t}\, \epsilon_\theta}{\sqrt{\bar\alpha_t}} \\ v\text{-pred (Salimans-Ho 2022)} &:\quad v_\theta = \sqrt{\bar\alpha_t}\, \epsilon - \sqrt{1-\bar\alpha_t}\, x_0 \\ \text{score} &:\quad s_\theta(x_t, t) = -\frac{\epsilon_\theta(x_t, t)}{\sqrt{1-\bar\alpha_t}} \end{aligned} $$
$\epsilon$-pred degenerates as $t \to 0$ (small noise; loss coefficient explodes); $x_0$-pred degenerates as $t \to T$ (large noise); $v$-pred interpolates between them and has approximately uniform loss magnitude across all $t$ — that's the key choice in Imagen Video / SD2.1-v / Karras EDM.
§4 Schedule: linear / cosine / EDM
4.1 Linear (Ho 2020)
$$\beta_t = \beta_\text{start} + \frac{t-1}{T-1}(\beta_\text{end} - \beta_\text{start}), \quad \beta_\text{start} = 10^{-4},\; \beta_\text{end} = 0.02$$
$T = 1000$. Simple and stable, but the end-SNR isn't strictly zero ($\bar\alpha_T \approx 4 \times 10^{-5}$, SNR $\approx 4 \times 10^{-5}$, while an ideal prior wants this closer to 0).
4.2 Cosine (Nichol-Dhariwal 2021)
$$\bar\alpha_t = \frac{f(t)}{f(0)}, \quad f(t) = \cos^2\!\left(\frac{(t/T) + s}{1 + s} \cdot \frac{\pi}{2}\right), \quad s = 0.008$$
$\beta_t = 1 - \bar\alpha_t / \bar\alpha_{t-1}$ (then clip to $[0, 0.999]$ for numerical stability). The $s = 0.008$ offset is to prevent $\beta_1$ from being too close to 0.
A linear schedule adds noise too fast in the low-$t$ region, so the model spends most of its time "training" on already-pure-noise regions (and learns nothing). Cosine adds noise slowly at low $t$, faster in the middle, and reaches near-zero SNR at the end. Improved DDPM experiments: cosine improves FID on ImageNet 64 by about 20% vs linear.
4.3 EDM σ-schedule (Karras 2022)
EDM reparametrizes the $\beta$ schedule into a $\sigma$ schedule (directly using $\sigma$ as time). At sampling:
$$\sigma_i = \left(\sigma_\text{max}^{1/\rho} + \frac{i}{N-1}\left(\sigma_\text{min}^{1/\rho} - \sigma_\text{max}^{1/\rho}\right)\right)^\rho, \quad i = 0, \dots, N-1$$
Defaults: $\sigma_\text{min} = 0.002$, $\sigma_\text{max} = 80$, $\rho = 7$. $\rho = 7$ was swept empirically by Karras — beats linear / log spacing because it allocates more steps to the small-$\sigma$ (high-SNR) region, where stepping errors are more sensitive.
DDPM's $\beta$ array is equivalent to a VP-SDE's $\beta(t) = T \beta_{\lfloor tT \rfloor}$; EDM's $\sigma$-schedule corresponds to a VE-SDE with $\sigma(t) = t$ (linear time); the two differ only by a $t$ reparametrization, informationally equivalent. EDM's contribution is discovering an empirically more stable $\sigma_i$ selection rule.
§5 Score-based view
5.1 Score and score matching (Hyvärinen 2005)
Define $s(x) = \nabla_x \log p(x)$. If we learn $s_\theta \approx s$, we can sample via Langevin dynamics:
$$x_{k+1} = x_k + \frac{\eta}{2} s_\theta(x_k) + \sqrt{\eta}\, z_k, \quad z_k \sim \mathcal{N}(0, I)$$
The direct score matching loss $\mathbb{E}_p\|s_\theta - \nabla\log p\|^2$ is not computable (we don't know $\nabla \log p$). Hyvärinen 2005 gave implicit score matching that avoids $\nabla \log p$ via integration by parts:
$$\mathbb{E}_p\left[\|s_\theta(x)\|^2 + 2 \operatorname{tr}(\nabla_x s_\theta(x))\right]$$
But $\operatorname{tr}(\nabla_x s_\theta)$ is too expensive in high dimensions (Hessian trace).
5.2 Denoising Score Matching (Vincent 2011)
For each data point $x_0$, add noise $\tilde x = x_0 + \sigma \epsilon$ and define the perturbed distribution $p_\sigma(\tilde x) = \int p(x_0) \mathcal{N}(\tilde x; x_0, \sigma^2 I) dx_0$. Vincent 2011 proved:
$$\mathbb{E}_{p_\sigma(\tilde x)}\|s_\theta(\tilde x) - \nabla \log p_\sigma(\tilde x)\|^2 = \mathbb{E}_{x_0, \tilde x}\left\|s_\theta(\tilde x) - \nabla_{\tilde x} \log q(\tilde x | x_0)\right\|^2 + \text{const}$$
And the score of $q(\tilde x | x_0) = \mathcal{N}(x_0, \sigma^2 I)$ has a closed form:
$$\nabla_{\tilde x} \log q(\tilde x | x_0) = -\frac{\tilde x - x_0}{\sigma^2} = -\frac{\epsilon}{\sigma}$$
So the training loss simplifies to:
$$\boxed{\; L_\text{DSM}(\theta) = \mathbb{E}_{x_0, \sigma, \epsilon}\left\| \sigma\, s_\theta(\tilde x; \sigma) + \epsilon \right\|^2 \;}$$
This is exactly the NCSN / SMLD training objective (up to a weight).
5.3 Tweedie's formula (mandatory derivation)
Statement: for additive Gaussian noise $x_t = x_0 + \sigma_t \epsilon$ (VE view, $\epsilon \sim \mathcal{N}(0,I)$):
$$\boxed{\; \mathbb{E}[x_0 | x_t] = x_t + \sigma_t^2\, \nabla_{x_t} \log p_t(x_t) \;}$$
Derivation: $p_t(x_t) = \int p_0(x_0) \mathcal{N}(x_t; x_0, \sigma_t^2 I)\, dx_0$. Take the gradient w.r.t. $x_t$:
$$\nabla_{x_t} p_t(x_t) = \int p_0(x_0) \cdot \nabla_{x_t} \mathcal{N}(x_t; x_0, \sigma_t^2 I)\, dx_0 = \int p_0(x_0) \cdot \mathcal{N}(x_t; x_0, \sigma_t^2 I) \cdot \frac{x_0 - x_t}{\sigma_t^2}\, dx_0$$
Divide both sides by $p_t(x_t)$:
$$\nabla_{x_t} \log p_t(x_t) = \frac{1}{p_t(x_t)} \int p_0(x_0) \mathcal{N}(x_t | x_0) \frac{x_0 - x_t}{\sigma_t^2}\, dx_0 = \mathbb{E}_{p_0(x_0 | x_t)}\left[\frac{x_0 - x_t}{\sigma_t^2}\right]$$
That is:
$$\sigma_t^2 \nabla_{x_t} \log p_t(x_t) = \mathbb{E}[x_0 | x_t] - x_t \quad \Rightarrow \quad \mathbb{E}[x_0 | x_t] = x_t + \sigma_t^2 \nabla_{x_t} \log p_t(x_t) \quad \square$$
the denoiser network's optimal output (MMSE estimator) is the score plus an identity map. All conversions among $\epsilon$-pred / score-pred / $x_0$-pred / $v$-pred are one-line rearrangements of Tweedie.
5.4 NCSN / SMLD (Song-Ermon 2019)
Noise-Conditional Score Network: train a shared network $s_\theta(x, \sigma)$ that does DSM at multiple noise levels $\sigma_1 > \sigma_2 > \dots > \sigma_L$ simultaneously. Sampling uses annealed Langevin dynamics: first Langevin at large $\sigma_1$ (explore the whole space), then decay to $\sigma_L$ (refine details).
$$x \leftarrow x + \frac{\epsilon_i}{2} s_\theta(x, \sigma_i) + \sqrt{\epsilon_i}\, z, \quad \epsilon_i = \eta \cdot (\sigma_i / \sigma_L)^2$$
Run $T$ Langevin steps at each $\sigma_i$, then switch to the next $\sigma_{i+1}$.
Scores trained at small $\sigma$ are completely wrong in regions far from the data manifold (in the "empty regions" between modes where $p(x) \approx 0$, the score gives no direction). The core of multiple noise levels is using large $\sigma$ to "fill" the space, providing initial positions for small $\sigma$.
§6 Score SDE: unified framework + Probability Flow ODE
6.1 Forward SDE
Song et al. 2021 (ICLR) write all diffusions as a forward SDE:
$$dx = f(x, t)\, dt + g(t)\, dW$$
| Type | $f(x, t)$ | $g(t)$ | Discrete counterpart |
|---|---|---|---|
| VP-SDE (variance preserving) | $-\frac{1}{2}\beta(t) x$ | $\sqrt{\beta(t)}$ | DDPM |
| VE-SDE (variance exploding) | $0$ | $\sqrt{d[\sigma^2(t)]/dt}$ | SMLD / EDM |
| sub-VP | $-\frac{1}{2}\beta(t) x$ | $\sqrt{\beta(t)(1-e^{-2\int_0^t \beta(s)ds})}$ | between VP/VE, better likelihood |
VP-SDE satisfies $\text{Var}[x_t] \le 1$ (variance preserving); VE-SDE has unbounded variance growth (variance exploding).
6.2 Reverse SDE (Anderson 1982)
For any forward SDE, there exists a reverse-time SDE:
$$\boxed{\; dx = \left[f(x, t) - g^2(t)\, \nabla_x \log p_t(x)\right] dt + g(t)\, d\bar W \;}$$
where $d\bar W$ is a reverse-time Wiener process. Sampling: start from $x_T \sim p_T$ (close to prior), then integrate to $t = 0$ with an SDE solver (Euler-Maruyama / predictor-corrector).
6.3 Probability Flow ODE (bridge to FM)
Key theorem (Song et al. 2021, "Score-Based Generative Modeling through SDEs"): the following deterministic ODE shares all marginals $p_t$ with the reverse SDE:
$$\boxed{\; \frac{dx}{dt} = f(x, t) - \frac{1}{2} g^2(t)\, \nabla_x \log p_t(x) \;}$$
This is the probability flow ODE. Equivalent to Flow Matching's vector field:
$$u_t(x) = f(x, t) - \tfrac{1}{2} g^2(t)\, s_\theta(x, t)$$
forward SDE (training: score matching)
↓
┌──────────────────────┐
↓ ↓
reverse SDE probability flow ODE
(stochastic) (deterministic, ⇔ FM)
↓ ↓
DDPM ancestral sampler DDIM (η=0) / EDM / DPM-SolverProof sketch: write the forward SDE's Fokker-Planck (continuity equation):
$$\frac{\partial p_t}{\partial t} = -\nabla \cdot (f p_t) + \frac{1}{2} g^2 \Delta p_t$$
Using $\Delta p_t = \nabla \cdot (p_t \nabla \log p_t)$, write the diffusion term in transport form:
$$\frac{\partial p_t}{\partial t} = -\nabla \cdot \left[\left(f - \tfrac{1}{2} g^2 \nabla \log p_t\right) p_t\right]$$
This is exactly the continuity equation for the ODE $dx/dt = f - \frac{1}{2} g^2 \nabla \log p_t$ — so their $p_t$ agree.
6.4 Advantages of the ODE view
| Advantage | Description |
|---|---|
| Deterministic | same noise → same sample, enables image editing / interpolation |
| NFE-friendly | high-order ODE solvers (Heun / RK4 / DPM-Solver) need few steps |
| Computable likelihood | $\log p_0(x_0) = \log p_T(x_T) + \int_0^T \nabla \cdot v_t(x(t))\, dt$ (PF-ODE instantaneous change-of-variables, Chen et al. 2018), with div estimated via Hutchinson trace estimator |
| Bridge to FM | the route taken by RF / SD3 / FLUX |
the stochastic perturbation in SDE sampling can "correct" early errors, typically yielding better sample quality but at higher NFE; ODE is deterministic but susceptible to solver-error accumulation and needs higher-order solvers. EDM proposes a middle ground: base ODE + small stochastic churn ("$S_\text{churn}$"), with better FID.
§7 DDIM: Non-Markovian forward → deterministic sampler
7.1 Motivation
DDPM ancestral sampling must walk all $T = 1000$ steps (a Markov chain). Can we sample with fewer steps without retraining? DDIM (Song et al. 2020 arXiv / ICLR 2021) gives a "yes" — the core is making forward non-Markovian while preserving the same marginal $q(x_t | x_0)$ as DDPM.
7.2 Non-Markovian forward
DDIM defines a family of forward distributions, controlled by a parameter $\eta \in [0, 1]$:
$$q_\sigma(x_{t-1} | x_t, x_0) = \mathcal{N}\!\left(x_{t-1};\; \sqrt{\bar\alpha_{t-1}}\, x_0 + \sqrt{1 - \bar\alpha_{t-1} - \sigma_t^2}\, \frac{x_t - \sqrt{\bar\alpha_t} x_0}{\sqrt{1-\bar\alpha_t}},\; \sigma_t^2 I\right)$$
where $\sigma_t^2 = \eta^2 \cdot \tilde\beta_t = \eta^2 \cdot \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t} \beta_t$.
Key property (DDIM Theorem 1): under this forward, $q(x_t | x_0)$ is still $\mathcal{N}(\sqrt{\bar\alpha_t} x_0, (1-\bar\alpha_t) I)$ — identical to DDPM! So we can directly use a DDPM-trained $\epsilon_\theta$ for DDIM sampling.
7.3 DDIM sampling formula
Substitute $x_0 \to \hat x_0 = (x_t - \sqrt{1-\bar\alpha_t}\, \epsilon_\theta(x_t, t)) / \sqrt{\bar\alpha_t}$:
$$\boxed{\; x_{t-1} = \sqrt{\bar\alpha_{t-1}}\, \hat x_0 + \sqrt{1 - \bar\alpha_{t-1} - \sigma_t^2}\, \epsilon_\theta(x_t, t) + \sigma_t\, z, \quad z \sim \mathcal{N}(0, I) \;}$$
- $\eta = 0$ (DDIM): $\sigma_t = 0$, deterministic — same $x_T$ gives same $\hat x_0$ (latent-space interpolation friendly)
- $\eta = 1$ (walking the full $T$ steps): $\sigma_t = \sqrt{\tilde\beta_t}$, degenerates to standard DDPM ancestral sampling; under fewer skipped steps $S < T$ it only matches DDPM variance order, not a strict 1000-step DDPM equivalent
- Intermediate $\eta \in (0, 1)$: tunable stochasticity
7.4 Skip steps (few-step sampling)
You don't have to step $t \to t-1$; you can skip: pick a sub-sequence $\tau_0 < \tau_1 < \dots < \tau_S = T$ and do:
$$x_{\tau_{i-1}} = \sqrt{\bar\alpha_{\tau_{i-1}}}\, \hat x_0 + \sqrt{1 - \bar\alpha_{\tau_{i-1}} - \sigma_{\tau_i}^2}\, \epsilon_\theta(x_{\tau_i}, \tau_i) + \sigma_{\tau_i}\, z$$
Classic baseline: $S = 50$ DDIM steps achieve FID close to 1000-step DDPM on ImageNet 256.
when $\eta = 0$ and the time grid is taken continuous, DDIM degenerates to the first-order Euler discretization of the probability flow ODE for the VP-SDE — that's why deterministic DDIM lines up with ODE-based samplers (DPM-Solver, EDM Heun).
§8 EDM: Karras 2022 Design Space
8.1 Motivation
Karras 2022 ("Elucidating the Design Space of Diffusion-Based Generative Models") decomposes all diffusion design knobs (parametrization, loss weighting, sampler, schedule) and sweeps each, arriving at SOTA recipes: CIFAR-10 FID 1.79 (35 NFE), ImageNet 64 FID 1.36.
8.2 Preconditioning (mandatory derivation)
EDM adopts the VE view: $x = x_0 + \sigma \epsilon$ with $\epsilon \sim \mathcal{N}(0, I)$, treating $\sigma$ directly as the noise level (no $\alpha$).
Denoiser parametrization:
$$\boxed{\; D_\theta(x;\, \sigma) = c_\text{skip}(\sigma)\, x + c_\text{out}(\sigma)\, F_\theta\!\left(c_\text{in}(\sigma)\, x,\; c_\text{noise}(\sigma)\right) \;}$$
where $F_\theta$ is the base network and the four $c$ functions are a hand-designed schedule. Karras' derivation:
Derivation: unit-variance argument
Goal: make $F_\theta$'s input and training target have $\mathcal{O}(1)$ variance at all $\sigma$.
Input side: the network sees $c_\text{in} x$. Since $\text{Var}[x] = \sigma_\text{data}^2 + \sigma^2$ (data variance + noise variance):
$$c_\text{in}(\sigma) = \frac{1}{\sqrt{\sigma_\text{data}^2 + \sigma^2}} \quad \Rightarrow \quad \text{Var}[c_\text{in} x] = 1$$
Output side: the ideal denoiser is $D^*(x; \sigma) = \mathbb{E}[x_0 | x]$ (Tweedie). Have the network learn the residual instead of the full quantity: define the effective target
$$F^*(x; \sigma) = \frac{1}{c_\text{out}(\sigma)}\left[D^*(x;\sigma) - c_\text{skip}(\sigma)\, x\right]$$
We want $\text{Var}[c_\text{out} F^* + c_\text{skip} x - D^*] = 0$ and $\text{Var}[F^*] = 1$ (so the network's target has unit variance).
Find $c_\text{skip}, c_\text{out}$ minimizing the effective error (minimize $\mathbb{E}\|F^* - F_\theta\|^2$ under $\text{Var}[F^*]=1$). Karras takes $D^* = x_0$ (ideal case), substitutes and expands:
$$c_\text{skip}(\sigma) = \frac{\sigma_\text{data}^2}{\sigma^2 + \sigma_\text{data}^2}, \quad c_\text{out}(\sigma) = \frac{\sigma \cdot \sigma_\text{data}}{\sqrt{\sigma^2 + \sigma_\text{data}^2}}$$
Intuition:
- $\sigma \to 0$ (low noise): $c_\text{skip} \to 1, c_\text{out} \to 0$ — output is basically the input identity (denoiser does nothing)
- $\sigma \to \infty$ (high noise): $c_\text{skip} \to 0, c_\text{out} \to \sigma_\text{data}$ — output is fully determined by the network (input is pure noise)
Time encoding: $c_\text{noise}(\sigma) = \frac{1}{4} \ln \sigma$ (log-scale, covering the wide dynamic range $\sigma \in [\sigma_\text{min}, \sigma_\text{max}]$).
8.3 Training loss
EDM uses a weighted L2:
$$L_\text{EDM}(\theta) = \mathbb{E}_{\sigma, x_0, \epsilon}\Big[\lambda(\sigma)\, \big\| D_\theta(x_0 + \sigma\epsilon;\, \sigma) - x_0 \big\|^2\Big]$$
The weight is $\lambda(\sigma) = (\sigma^2 + \sigma_\text{data}^2) / (\sigma \cdot \sigma_\text{data})^2 = 1/c_\text{out}^2$, equivalent to training $F_\theta$ with unweighted L2 (the target has unit variance at every $\sigma$, so loss magnitudes are uniform).
Training $\sigma$ sampling: $\ln \sigma \sim \mathcal{N}(P_\text{mean}, P_\text{std}^2)$, defaults $P_\text{mean} = -1.2$, $P_\text{std} = 1.2$ (concentrates $\sigma$ around $0.3$ — the "hardest to learn" SNR region, swept by Karras).
8.4 Heun 2nd-order sampler
EDM sampling defaults to Heun 2nd-order ODE + optional stochastic churn. The VE-SDE's probability flow ODE, with $f = 0, g(t) = \sqrt{d\sigma^2/dt}$:
$$\frac{dx}{d\sigma} = -\sigma\, \nabla_x \log p_\sigma(x) = \frac{x - D_\theta(x; \sigma)}{\sigma}$$
(Use Tweedie: $\nabla \log p_\sigma = (D - x)/\sigma^2$, substitute into $dx/d\sigma = -\sigma \nabla \log p$.)
Heun step $i$ ($\sigma_i \to \sigma_{i+1}$, $\Delta\sigma = \sigma_{i+1} - \sigma_i$):
d_i = (x_i - D_θ(x_i, σ_i)) / σ_i
x_* = x_i + Δσ · d_i # Euler step (predictor)
if σ_{i+1} > 0: # skip corrector at last step
d_* = (x_* - D_θ(x_*, σ_{i+1})) / σ_{i+1}
x_{i+1} = x_i + Δσ · (d_i + d_*) / 2 # Heun trapezoidal (corrector)
else:
x_{i+1} = x_*
2 NFE per step, but second-order accuracy — more NFE than Euler but much more accurate. CIFAR-10 EDM with 35 NFE = 18 Heun steps + first-order final step yields FID 1.79.
at the start of each step temporarily raise $\sigma_i$ to $\hat\sigma_i = (1+\gamma_i)\sigma_i$ (with $\gamma_i$ a small per-step churn), injecting extra noise: $\hat x_i = x_i + \sqrt{\hat\sigma_i^2 - \sigma_i^2}\, z$, where $\sqrt{\hat\sigma_i^2 - \sigma_i^2} = \sigma_i\sqrt{2\gamma_i + \gamma_i^2}$; stepping from $\hat\sigma_i$ down to $\sigma_{i+1}$ is equivalent to a small SDE. EDM experiments: a small amount of churn slightly improves FID on ImageNet (about 0.1-0.3).
§9 High-order samplers: DPM-Solver / DPM-Solver++
9.1 Motivation
DDIM is first-order ODE Euler. DPM-Solver (Lu et al. 2022 NeurIPS) exploits the semi-linear structure of the diffusion ODE for high-order expansion. Rewriting the probability flow ODE under VP-SDE with $\epsilon$-pred:
$$\frac{dx}{dt} = f(t)\, x + g(t)\, \epsilon_\theta(x, t)$$
where $f(t) = -\frac{1}{2}\beta(t)$, $g(t) = +\frac{1}{2}\beta(t)/\sqrt{1-\bar\alpha_t}$ (from $-\frac{1}{2}g_\text{SDE}^2 \cdot s = +\frac{1}{2}\beta\cdot \epsilon/\sqrt{1-\bar\alpha_t}$, since $s = -\epsilon/\sqrt{1-\bar\alpha_t}$).
Integrate the linear part exactly (exponential integrator) and Taylor-expand the rest.
9.2 DPM-Solver-2 / 3 (core idea)
Let $\lambda_t = \log(\sqrt{\bar\alpha_t} / \sqrt{1-\bar\alpha_t})$ (log-SNR), use $\lambda$ as time. Rewrite the ODE as:
$$x_{t} = \frac{\sqrt{\bar\alpha_t}}{\sqrt{\bar\alpha_s}} x_s - \sqrt{\bar\alpha_t} \int_{\lambda_s}^{\lambda_t} e^{-\lambda} \hat\epsilon_\theta(x_\tau, \tau)\, d\lambda$$
Taylor-expand $\hat\epsilon_\theta$ in $\lambda$ to order $k$, integrate the linear part exactly (exponential weight), and approximate the rest by order:
- DPM-Solver-1 = DDIM (first-order)
- DPM-Solver-2: 2 NFE per step, second-order
- DPM-Solver-3: 3 NFE per step, third-order
10-15 NFE reaches the same quality as 50 NFE DDIM.
9.3 DPM-Solver++ (CFG-friendly variant, Lu et al. 2023)
The original DPM-Solver is unstable under CFG ($\epsilon_\theta$ amplified by CFG goes out-of-distribution, and Taylor-expansion error blows up). DPM-Solver++ switches to $x_0$-prediction:
$$x_t = \frac{\sigma_t}{\sigma_s} x_s + \sigma_t \int_{\lambda_s}^{\lambda_t} e^{\lambda} \hat x^0_\theta(x_\tau, \tau)\, d\lambda$$
(Using $x_0$-pred instead of $\epsilon$-pred keeps CFG amplification in a more stable regime.)
15-20 NFE under CFG=7 gives quality close to 100-NFE DDIM. One of the default samplers in SDXL / SD3.
9.4 Sampler comparison
ordered by NFE / quality / compatibility (image generation).
- DDPM ancestral: T=1000 steps, baseline; rare in modern use
- DDIM ($\eta = 0$): 50-100 NFE, simple and stable, supports interpolation
- PLMS / PNDM: 50 NFE, linear-multistep, the old AUTOMATIC1111 default
- EDM Heun: 18-35 NFE, deterministic 2nd-order ODE, the SOTA literature baseline
- DPM-Solver / DPM-Solver++: 10-20 NFE, recommended by HuggingFace diffusers
- UniPC (Zhao 2023): predictor-corrector framework, can beat DPM-Solver
- Consistency Models (one-step / two-step): 1-4 NFE, requires distillation
§10 Conditioning: Classifier Guidance & CFG
10.1 Classifier Guidance (Dhariwal-Nichol 2021)
Train a separate classifier $p_\phi(c | x_t)$ on noisy data, then apply Bayes:
$$\nabla_{x_t} \log p(x_t | c) = \nabla_{x_t} \log p(x_t) + \nabla_{x_t} \log p_\phi(c | x_t)$$
In practice we scale the classifier gradient by $w$ (controlling guidance strength):
$$\tilde\epsilon = \epsilon_\theta(x_t, t) - w \sqrt{1-\bar\alpha_t}\, \nabla_{x_t} \log p_\phi(c | x_t)$$
(a) needs an extra noisy classifier (engineering overhead); (b) classifier gradients tend toward "adversarial" behavior, degenerating away from the training distribution; (c) unfriendly to continuous conditions like text-to-image. CFG fully supplants it.
10.2 Classifier-Free Guidance (Ho-Salimans 2022)
Training: with probability $p_\text{drop}$ (typically 0.1), replace $c$ with $\emptyset$ (null embedding), so the same net learns both conditional and unconditional:
$$L_\text{CFG}(\theta) = \mathbb{E}\big[\|\epsilon - \epsilon_\theta(x_t, t, c \text{ or } \emptyset)\|^2\big]$$
Inference: call $w$ the guidance scale:
$$\boxed{\; \tilde\epsilon = \epsilon_\theta(x_t, t, \emptyset) + (1 + w)\big[\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \emptyset)\big] \;}$$
Equivalent form (common in Imagen / SD implementations):
$$\tilde\epsilon = (1 + w)\, \epsilon_\theta(x_t, t, c) - w\, \epsilon_\theta(x_t, t, \emptyset)$$
the original Ho-Salimans 2022 paper has $\tilde\epsilon = \epsilon_\text{uncond} + (1+w)(\epsilon_\text{cond} - \epsilon_\text{uncond})$, so $w = 0$ is unguided and $w > 0$ is amplified. HuggingFace / SD UIs commonly use $w' = w + 1$, so $w' = 1$ is unguided and $w' = 7.5$ is the common amplified value. State the convention in interview code.
10.3 Geometric meaning of CFG
CFG is equivalent to pulling the sampling trajectory toward the "conditional gradient" direction:
$$\nabla_{x_t} \log p(x_t | c) \approx \nabla_{x_t} \log p(x_t) + w \nabla_{x_t} \log \frac{p(x_t | c)}{p(x_t)}$$
The second term is the "conditional score difference", which pushes samples toward regions of high conditional likelihood and relatively low unconditional likelihood — intuitively "amplifying text alignment".
$w \in [3, 7.5]$ is the empirical sweet spot for Stable Diffusion; $w > 10$ tends to over-saturate (color saturation, artifacts). FLUX internalizes CFG into distillation ("guidance-distilled") so a single forward pass yields the CFG effect — one of the keys to its inference speed.
§11 Production: from LDM to FLUX
11.1 Latent Diffusion (LDM, Rombach 2022 CVPR)
Core idea: run diffusion in VAE latent space rather than pixel space.
- Train a VAE $E, D$: $z = E(x), \hat x = D(z)$, with $z$ ~8× smaller than $x$ (e.g., $512^2 \times 3 \to 64^2 \times 4$)
- Train a diffusion model on $z$ (params, memory, and compute all drop by an order of magnitude)
- To generate: sample from $z_T$ to $z_0$, then $D(z_0)$ decodes back to pixels
Stable Diffusion (SD) = LDM + CLIP text encoder + UNet on $64 \times 64 \times 4$ latent — at the time the most practical open-source T2I model.
11.2 SDXL (Podell et al. 2023 arXiv / ICLR 2024 spotlight)
Main improvements from SD 1.5 to SDXL:
- Larger UNet: params from ~860M to ~2.6B, more cross-attn layers
- Two-stage architecture: base + refiner (the refiner handles low-noise detail)
- Better text encoder: OpenCLIP ViT-bigG/14 + CLIP-L/14 concatenated
- Multi-scale / multi-aspect-ratio training: native support for 1024×1024 + multiple aspect ratios
- MicroConditioning: feed original resolution, crop offset, aspect ratio as conditions to the UNet
11.3 DiT (Peebles-Xie 2023 ICCV)
Replace the UNet with a pure Transformer:
- Patch the latent (e.g., $2 \times 2$) into a token sequence
- Standard Transformer block (self-attn + MLP)
- Inject conditioning via adaptive LayerNorm (adaLN): $\text{LN}(x) \cdot \gamma(c, t) + \beta(c, t)$ with $\gamma, \beta$ produced by an MLP on $c, t$
DiT experiments: better scaling laws than the UNet, FID decreases steadily with parameters. SD3 / FLUX / Sora are all in the DiT family.
11.4 SD3 (Esser 2024 ICML) — diffusion replaced by Rectified Flow
Two key SD3 changes:
- Rectified Flow replaces DDPM: training objective becomes $\|v_\theta - (x_1 - x_0)\|^2$ (FM framework)
- MM-DiT: multimodal DiT, where text tokens and image tokens attend to each other in the same Transformer (rather than via cross-attn)
Why switch to RF? Esser 2024 ablations: linear paths have straighter trajectories than cosine paths → better few-step sampling; logit-normal $t$ sampling emphasizes mid-noise and improves quality.
11.5 FLUX.1 (Black Forest Labs 2024)
Inherits SD3 + MM-DiT with major updates:
- 12B parameters (open-source dev version)
- Guidance-distilled: CFG distilled into a single forward, no 2× CFG forwards at inference
- Adversarial training late-stage fine-tune (like SD3-Turbo / ADD), 4-step generation
11.6 ControlNet (Zhang 2023 ICCV)
Adds a trainable copy + zero-conv connections to a frozen SD UNet:
Original UNet (frozen) Control signal (canny / depth / pose)
↓ ↓
[encoder blocks] [trainable copy of encoder]
↓ ──────── zero-conv ──────────────────↓
[mid block] [trainable mid]
↓ ──────── zero-conv ──────────────────↓
[decoder blocks (frozen)] + [trainable copy outputs]
↓
output
Zero-conv = 1×1 convolution initialized to zero → at start, ControlNet doesn't change the original UNet's output (preserving SD's capability); as training proceeds, it learns to apply condition control.
the original UNet (most parameters) is frozen, only the trainable copy (~half the params) is trained; single-GPU trainable, which is key to the open-source ecosystem.
§12 Distillation: 1-step / Few-step generation
12.1 Progressive Distillation (Salimans-Ho 2022)
Iterative distillation: student one step $\approx$ teacher two steps; distill for $\log_2 N$ rounds to compress $N$ steps to 1. Key: halve each round, keeping distribution drift controlled.
12.2 Consistency Models (Song 2023 ICML)
Idea: directly learn a network $f_\theta(x_t, t)$ such that for all $t$:
$$f_\theta(x_t, t) \approx x_0$$
i.e., the network is the consistency function of the probability flow ODE — any $x_t$ maps to the corresponding $x_0$. One-step sampling: $x_0 = f_\theta(x_T, T)$.
Training objective (Consistency Distillation, CD):
$$L_\text{CD}(\theta) = \mathbb{E}\left[d\big(f_\theta(x_{t_{n+1}}, t_{n+1}),\; f_{\theta^-}(\hat x_{t_n}, t_n)\big)\right]$$
where:
- $\theta^-$ is an EMA target
- $\hat x_{t_n}$ is obtained by a teacher ODE solver one step from $x_{t_{n+1}}$ ($x_{t_n} = \text{ODE-step}(x_{t_{n+1}})$)
- $d$ is a metric (L2 / LPIPS)
Boundary condition: requires $f_\theta(x_{\sigma_\text{min}}, \sigma_\text{min}) = x_{\sigma_\text{min}}$ (self-consistency at lowest noise) — enforced by EDM-style preconditioning:
$$f_\theta(x, \sigma) = c_\text{skip}(\sigma) x + c_\text{out}(\sigma) F_\theta(x, \sigma)$$
with $c_\text{skip}, c_\text{out}$ designed so that $f_\theta \equiv x$ at $\sigma = \sigma_\text{min}$.
CT is fully from scratch (no teacher; apply consistency loss directly to $x_0 + \sigma_n \epsilon$ vs $x_0 + \sigma_{n+1} \epsilon$); CD distills with a pretrained teacher. Quality-wise CD > CT; recent ICT (Song 2024) brings CT close to CD.
12.3 LCM / LCM-LoRA (Luo 2023)
Latent Consistency Model: apply Consistency Models to latent diffusion (SD 1.5 / SDXL):
- Teacher = pretrained SD (DDIM as ODE solver)
- Student = LCM, 4-8 step generation
LCM-LoRA: package LCM training as a LoRA adapter — a single LoRA file lets any SD 1.5 / SDXL fine-tune generate in 4 steps. Huge ecosystem value: users don't need to swap base models.
12.4 Adversarial Diffusion Distillation (ADD) — SDXL-Turbo / SD3-Turbo (Sauer 2023/2024)
ADD training objective:
$$L_\text{ADD} = L_\text{adv}(\text{student}) + \lambda L_\text{distill}(\text{student}, \text{teacher})$$
- $L_\text{adv}$: discriminator backbone is a pretrained vision model (DINOv2)
- $L_\text{distill}$: student multi-step ODE should match teacher multi-step ODE
Results: SDXL-Turbo 1-step 1024 px, SD3-Turbo 4-step 1024 px. Quality slightly below multi-step but real-time (~100ms / image).
§13 The bridge to Flow Matching
13.1 Score vs vector field — same information, different parametrization
Inside the VP-SDE / VE-SDE framework, FM learning $v$ and score-based learning $s$ are two parametrizations of the same information:
$$v_\theta(t, x) = f(x, t) - \tfrac{1}{2} g^2(t)\, s_\theta(t, x)$$
Specifically on the VP (DDPM) path, with $\alpha_t = \sqrt{\bar\alpha_t}, \sigma_t = \sqrt{1-\bar\alpha_t}$, the conditional vector field (same form as Salimans-Ho 2022 $v$-prediction):
$$v_\theta^\text{VP}(t, x_t) = \alpha_t'\, x_0 + \sigma_t'\, \epsilon$$
Substitute $x_0 = (x_t - \sigma_t \epsilon)/\alpha_t$ and rearrange: $v_\theta$ is simultaneously a linear combination of $x_t$ and $\epsilon$ (or score), the concrete expression depending on the time derivatives of $\alpha_t, \sigma_t$.
In practice: for any Gaussian path with $\alpha(t), \sigma(t)$, the three quantities $\{\epsilon_\theta, s_\theta, v_\theta\}$ are fully equivalent. So training DDPM / score-based / FM on VP/VE paths is the same task.
13.2 Why did SD3 / FLUX switch to Rectified Flow?
Rectified Flow path: $x_t = (1-t) x_0 + t x_1$ (linear noise→data interpolation), $v_t = x_1 - x_0$.
| Advantage | RF (linear) | VP/VE (curved) |
|---|---|---|
| ODE trajectory | straight line | curved (needs higher-order solver) |
| Target $v_t$ | doesn't depend on $t$ | depends on $t$ (VP cosine path) |
| Few-step sampling | Euler works at 4-8 steps | Euler needs 30+ steps |
| Reflow compressible to 1-2 steps | ✓ (InstaFlow / SD3-Turbo) | ✗ |
| Training stability | stable with logit-normal $t$ + RF | requires careful noise schedule |
"Under the same DiT backbone, RF + logit-normal $t$ improves FID on ImageNet 256 by about 0.5-1.0 over VP + uniform $t$; on T2I tasks, GenEval text alignment is significantly better."
13.3 DDPM/DDIM/EDM/RF/CM full picture
Training objective Sampling Typical NFE
───────── ───────── ───────
DDPM ε-pred (MSE) ancestral / DDIM 1000 / 50
Score SDE score (DSM) reverse SDE / PF-ODE 500 / 30
DDIM (uses DDPM weights) deterministic ODE step 20-50
EDM D_θ (Tweedie) Heun ODE 2nd-order 18-35
RF / SD3 v = x_1-x_0 Euler ODE 4-50
FLUX v + CFG-distill Euler 1-4
ConsistMod f_θ(x_t,t)→x_0 direct map 1-4
LCM-LoRA consistency on SD direct 4-8
§14 25 frequently-asked interview questions (L1 must-know · L2 intermediate · L3 top lab)
L1 must-know (likely on any ML role with diffusion)
Q1. Write out DDPM's forward $q(x_t | x_0)$ and reverse $p_\theta(x_{t-1}|x_t)$.
- Forward closed form: $q(x_t|x_0) = \mathcal{N}(\sqrt{\bar\alpha_t} x_0, (1-\bar\alpha_t)I)$, $\bar\alpha_t = \prod_{s=1}^t (1-\beta_s)$
- Reverse parametrization: $p_\theta(x_{t-1}|x_t) = \mathcal{N}(\mu_\theta(x_t, t), \Sigma_\theta)$
- $\mu_\theta = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}} \epsilon_\theta(x_t, t)\right)$ ($\epsilon$-prediction)
Writing the wrong symbols (e.g., confusing $\sqrt{\alpha_t}$ with $\sqrt{\bar\alpha_t}$); forgetting that $\bar\alpha$ is the cumulative product.
Q2. How does DDPM's ELBO simplify to $L_\text{simple}$?
- ELBO splits into $L_T + \sum L_{t-1} + L_0$; $L_T$ is a constant (prior matching)
- $L_{t-1} = \text{KL}(q(x_{t-1}|x_t, x_0) \,\Vert\, p_\theta)$; both are Gaussians, KL is closed-form
- Substitute $x_0 = (x_t - \sqrt{1-\bar\alpha_t}\epsilon)/\sqrt{\bar\alpha_t}$ into $\tilde\mu$ and $\mu_\theta$, yielding $L_{t-1} = \text{const} \cdot \mathbb{E}\|\epsilon - \epsilon_\theta\|^2$
- Ho 2020 drops the coefficient to get $L_\text{simple} = \mathbb{E}\|\epsilon - \epsilon_\theta\|^2$
Saying only "$L_\text{simple}$ predicts noise" without the derivation; not knowing dropping the coefficient equals SNR-weighting.
Q3. Why does $L_\text{simple}$ still work after dropping the coefficient?
- The ELBO coefficient $\beta_t^2 / [2\sigma_t^2 \alpha_t (1-\bar\alpha_t)]$ is large at small $t$ (high SNR) and small at large $t$ (low SNR)
- Dropping the coefficient is equivalent to relatively upweighting low-SNR (large $t$) — these are the steps that "determine semantic structure"
- Empirically: unweighted FID is significantly better than ELBO-weighted
- Cost: no longer a lower bound on $\log p$ (FID ≠ likelihood)
Not knowing the cost is a likelihood vs sample-quality trade-off.
Q4. Linear vs cosine schedule?
- Linear: $\beta_t \in [10^{-4}, 0.02]$ linear interpolation, DDPM original
- Issue: end-SNR isn't low enough ($\bar\alpha_T \approx 4\times 10^{-5}$); the middle region adds noise too fast
- Cosine: $\bar\alpha_t = \cos^2(\pi(t/T + s)/(2(1+s)))$, $s=0.008$, end-SNR ≈ 0
- Empirically: cosine improves FID on ImageNet 64 by about 20% (Nichol-Dhariwal 2021)
Saying only "cosine is better" without writing the formula; forgetting that the $s=0.008$ offset is to keep $\beta_1$ from being near 0.
Q5. How do $\epsilon$-pred / $x_0$-pred / $v$-pred / score interconvert?
- Given $x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1-\bar\alpha_t} \epsilon$, all quantities are linearly invertible
- $\hat x_0 = (x_t - \sqrt{1-\bar\alpha_t}\epsilon_\theta) / \sqrt{\bar\alpha_t}$
- $v = \sqrt{\bar\alpha_t}\epsilon - \sqrt{1-\bar\alpha_t} x_0$ (Salimans-Ho 2022)
- $s = -\epsilon / \sqrt{1-\bar\alpha_t}$ (from Tweedie or $\nabla_{x_t} \log q(x_t|x_0)$)
Not knowing the four predictions are different parametrizations of the same information; confusing $v$ with velocity.
Q6. DDIM vs DDPM differences?
- DDPM ancestral is a stochastic Markov chain, adding $\sigma_t z$ noise per step, must walk the full $T$ steps
- DDIM uses a non-Markovian forward that shares the same $q(x_t|x_0)$ as DDPM — can directly use DDPM training weights
- $\eta = 0$ is deterministic and supports interpolation; $\eta = 1$ over the full $T$ steps degenerates to DDPM ancestral (with skips it only matches variance order)
- DDIM allows skip steps: 50 steps ≈ DDPM 1000-step quality
Saying only "DDIM is a few-step DDPM" without knowing marginals are equivalent; or not knowing $\eta$ controls stochasticity.
Q7. How is CFG (Classifier-Free Guidance) trained and used?
- Training: with $p_\text{drop}=0.1$ replace $c$ with $\emptyset$ (null embedding); the same net learns conditional/unconditional
Inference: two conventions (must distinguish):
- HF / SD style (denoted $s$): $\tilde\epsilon = \epsilon_\theta(x,\emptyset) + s\,[\epsilon_\theta(x,c) - \epsilon_\theta(x,\emptyset)]$, $s=1$ is unguided, $s\in[3, 7.5]$ is the common SD amplification
- Ho-Salimans 2022 original (denoted $w$): $\tilde\epsilon = (1+w)\,\epsilon_\theta(x,c) - w\,\epsilon_\theta(x,\emptyset)$, $w=0$ is unguided; equivalent to $s = w + 1$
- Large $s$ → strong text alignment but lower diversity; $s>10$ → over-saturated colors
Writing only the formula without knowing the $w$/$s$ convention; not knowing the drop-$c$ training trick; saying CFG needs a separate classifier (that's classifier guidance).
Q8. What is Tweedie's formula? Why is it important?
- $\mathbb{E}[x_0 | x_t] = x_t + \sigma_t^2 \nabla_{x_t} \log p_t(x_t)$ (VE view; the VP version has an $\alpha$ factor)
- Derivation: take $\nabla_{x_t} \log$ of $p_t(x_t) = \int p_0(x_0) \mathcal{N}(x_t; x_0, \sigma_t^2 I) dx_0$
- Significance: the denoiser's optimal output = input + scaled score — the "Rosetta Stone" of conversions among $\epsilon, x_0, v, s$
Memorizing the formula without deriving it; not knowing it links score and denoiser.
Q9. VP-SDE vs VE-SDE?
- VP (variance preserving): $dx = -\frac{1}{2}\beta(t) x\, dt + \sqrt{\beta(t)}\, dW$, corresponds to DDPM; $\text{Var}[x_t] \le 1$
- VE (variance exploding): $dx = \sqrt{d\sigma^2/dt}\, dW$, corresponds to SMLD/EDM; $\text{Var}[x_t]$ grows to $\sigma_\text{max}^2$
- VP: $x_T \approx \mathcal{N}(0, I)$; VE: $x_T \approx \mathcal{N}(x_0, \sigma_\text{max}^2 I)$, prior is $\mathcal{N}(0, \sigma_\text{max}^2 I)$
- EDM picks VE because preconditioning is cleaner to derive; DDPM picks VP because the $\mathcal{N}(0,I)$ prior is natural
Saying only "variance preserving / exploding" without writing the SDE; not knowing EDM is VE.
Q10. What is the probability flow ODE?
- For any forward SDE $dx = f dt + g dW$, there exists a deterministic ODE $dx/dt = f - \frac{1}{2}g^2 \nabla \log p_t$ that shares all marginals $p_t$
- Note: the reverse SDE's drift is $f - g^2 \nabla \log p_t$ (full score correction), while PF-ODE uses only $\frac{1}{2} g^2$; not simply "reverse SDE minus the stochastic term"
- Practical significance: lets you sample in few steps with an ODE solver (DDIM, Heun, RK4, DPM-Solver)
- Bridge between score-based and Flow Matching: $v_t = f - \frac{1}{2}g^2 s$
Knowing only the formula without realizing PF-ODE and reverse SDE drift differ by half; not knowing it enables deterministic sampling.
L2 intermediate (research-oriented · need to know diffusion details)
Q11. What is the EDM preconditioning unit-variance argument?
- Make the network $F_\theta$'s input $c_\text{in} x$ have variance 1: $c_\text{in} = 1/\sqrt{\sigma_\text{data}^2 + \sigma^2}$
- Make the effective target $F^* = (D^* - c_\text{skip} x)/c_\text{out}$ have variance 1: $c_\text{skip} = \sigma_\text{data}^2/(\sigma^2+\sigma_\text{data}^2)$, $c_\text{out} = \sigma \sigma_\text{data} / \sqrt{\sigma^2 + \sigma_\text{data}^2}$
- Intuition: at $\sigma \to 0$, $c_\text{skip} \to 1$ (identity); at $\sigma \to \infty$, $c_\text{out} \to \sigma_\text{data}$ (all from net)
- Effect: loss magnitudes are uniform across all $\sigma$, training is more stable
Memorizing the formula without knowing the reason; not knowing $\sigma_\text{data}$ is data std (around 0.5 for normalized images).
Q12. Benefit of Improved DDPM learning $\Sigma_\theta$?
- DDPM fixes $\Sigma_\theta = \beta_t I$ or $\tilde\beta_t I$
- Nichol-Dhariwal 2021 learn $\Sigma_\theta$ as an interpolation between $\beta_t$ and $\tilde\beta_t$: $\Sigma_\theta = \exp(v \log\beta_t + (1-v) \log\tilde\beta_t)$
- Benefit: few-step sampling quality improves dramatically (50 steps reach 1000-step fixed-$\Sigma$ quality)
- Hybrid loss $L_\text{hybrid} = L_\text{simple} + 0.001 \cdot L_\text{vlb}$ ($L_\text{vlb}$ provides learning signal for $\Sigma_\theta$)
- $\lambda = 0.001$ prevents $L_\text{vlb}$ from dominating
Not knowing the hybrid loss; thinking $\Sigma_\theta$ learning mainly affects training likelihood (actually it's few-step sampling gains).
Q13. Core differences between DPM-Solver and DDIM?
- DDIM is first-order Euler, 1 NFE per step
- DPM-Solver exploits the diffusion ODE's semi-linear structure $dx/dt = f(t) x + g(t) \epsilon_\theta$, integrates the linear part exactly (exponential integrator)
- Taylor-expands the nonlinear part ($\epsilon_\theta$) in log-SNR $\lambda$ to order $k$
- DPM-Solver-2: 2 NFE per step, second-order; DPM-Solver-3: 3 NFE per step, third-order
- 10-15 NFE reaches DDIM 50 NFE quality
- DPM-Solver++ switches to $x_0$-pred, CFG-friendly
Not knowing the exponential integrator; thinking DPM-Solver is some approximation (actually it's a more precise mathematical expansion).
Q14. Consistency Models' training objective? How to achieve 1-step?
- Objective: $f_\theta(x_t, t) \approx x_0$ for all $t$
- Consistency loss: $d(f_\theta(x_{t_{n+1}}, t_{n+1}), f_{\theta^-}(\hat x_{t_n}, t_n))$ with $\hat x_{t_n}$ from one teacher ODE step
- $\theta^-$ is EMA, like BYOL; metric $d$ = L2 + LPIPS
- Boundary: $f_\theta(x, \sigma_\text{min}) \equiv x$, enforced via EDM-style $c_\text{skip}, c_\text{out}$
- 1-step sampling: $x_0 = f_\theta(x_T, T)$
- 2-step variant: first $x_0 = f_\theta(x_T, T)$, then re-noise to intermediate $t$, then $f_\theta$ again
Saying only "learn the $x_t \to x_0$ map" without defining the consistency constraint; not knowing about EMA target / teacher ODE / boundary.
Q15. Why did SD3 switch from DDPM to Rectified Flow?
- The RF path $x_t = (1-t)x_0 + tx_1$ is straight → straight ODE trajectory → small few-step sampling error
- $v_t = x_1 - x_0$ target doesn't depend on $t$ (given $(x_0, x_1)$), numerically stable
- Combined with logit-normal $t$ sampling (concentrated at $t=0.5$) for gains
- Esser 2024 ablation: same backbone, RF + LogitNorm vs VP-cosine + Uniform → GenEval text alignment significantly better
- Further reflow can compress to 4 steps (FLUX-Schnell / SD3-Turbo)
Saying only "RF is more stable" without knowing it's because the path is straight; not knowing logit-normal is an additional trick.
Q16. How does DiT inject conditioning? adaLN vs cross-attn?
- adaLN-Zero (DiT default): MLP $c, t$ → $\gamma, \beta, \alpha$; $\text{out} = \alpha \cdot \text{block}(\text{LN}(x) \cdot \gamma + \beta) + x$; initialize $\alpha=0$ (zero-init) so the initial DiT block doesn't alter the input
- Cross-attn: image tokens as Q, text/condition as K/V
- Token-concat (MM-DiT, SD3): text tokens and image tokens concatenated into one sequence; all tokens attend to each other
- Empirically: adaLN-Zero scales best (DiT paper); cross-attn has strong text control (SD UNet); MM-DiT is overall best (SD3 / FLUX)
Knowing only cross-attn; not knowing adaLN-Zero's "zero-init" is the key trick.
Q17. What is ControlNet's zero-conv? Why is it necessary?
- 1×1 conv with weights initialized to 0 and bias also 0
- At training start, the trainable copy's output passes through zero-conv → 0, original UNet output unchanged → preserves SD pretrained capability
- As training proceeds, zero-conv learns non-zero weights and gradually injects condition control
- Why not random init: it would perturb the frozen UNet's intermediate features and destroy the pretrained representation
Saying only "add a ControlNet module" without knowing zero-conv; thinking zero-conv is a special 1×1 conv variant (it's just the initialization).
Q18. SDE vs ODE sampling trade-off?
- SDE: reverse SDE contains a stochastic term $g(t) d\bar W$; injects new noise each step, can correct early errors
- ODE (probability flow): deterministic; solver error accumulates with no way back
- SDE typically gives better FID; ODE has lower NFE + is deterministic (supports interpolation)
- EDM compromise: base ODE Heun + small stochastic churn (slight re-noising at each step start), improving FID by 0.1-0.3 over pure ODE
Saying only "SDE is stochastic, ODE is deterministic" without knowing the trade-off; not knowing EDM churn.
Q19. LCM vs SDXL-Turbo differences?
- LCM: Consistency Distillation applied to latent diffusion, 4-8 step; pure distillation loss
- LCM-LoRA: LCM training packaged as a LoRA adapter, applicable to any SD 1.5 / SDXL fine-tune
- SDXL-Turbo (ADD): adversarial loss + distill loss, 1-4 step; uses DINOv2 as discriminator
- LCM is stabler, ADD is sharper (adversarial gives clearer textures)
- LCM is open-source earlier with a more complete ecosystem; Turbo requires BFL/SAI in-house training
Not knowing LCM-LoRA's "LoRA-compat" is the killer feature; thinking Turbo = LCM.
Q20. How is training noise level $\sigma$ sampled?
- DDPM: $t \sim \mathcal{U}\{1, \dots, T\}$, discrete uniform
- EDM: $\ln \sigma \sim \mathcal{N}(P_\text{mean}, P_\text{std}^2)$, $P_\text{mean}=-1.2, P_\text{std}=1.2$, concentrated around $\sigma \approx 0.3$
- SD3 / RF: $t = \text{sigmoid}(\tau), \tau \sim \mathcal{N}(0, 1)$, concentrated at $t = 0.5$
- Common idea: mid-noise is hardest to learn, so more sampling in the middle region gives gains
Saying only "uniform sampling" without knowing EDM/SD3 switched to normal/logit-normal; not knowing why mid-concentration.
L3 top-lab diffusion / video direction (deep derivation + distillation + production integration)
Q21. Derive $L_\text{simple} = \|\epsilon - \epsilon_\theta\|^2$ from the ELBO, listing every intermediate approximation and dropped term.
Derivation chain + approximation log:
- Step 1 (exact, no approximation): $\log p_\theta(x_0) \ge \mathbb{E}_q[\log p_\theta(x_{0:T})/q(x_{1:T}|x_0)]$ — Jensen's inequality gives the variational lower bound
- Step 2 ($L_T$ treated as a constant and ignored): split ELBO into $L = L_T + \sum_{t=2}^T L_{t-1} + L_0$. $L_T = \text{KL}(q(x_T|x_0)\,\lVert\, p(x_T))$ — not strictly 0, but near-constant when $\bar\alpha_T \approx 0$
- Step 3 ($L_0$ ignored / merged): $L_0 = -\mathbb{E}[\log p_\theta(x_0 | x_1)]$ — small contribution; in practice often modeled by a discretized Gaussian decoder and merged into $L_1$ during training
- Step 4 (KL is closed-form; with fixed $\Sigma_\theta$, the constant $C$ is dropped): $L_{t-1} = \mathbb{E}_q[\text{KL}(q(x_{t-1}|x_t, x_0) \,\lVert\, p_\theta(x_{t-1}|x_t))]$. Both are Gaussian → closed-form KL. If $\Sigma_\theta = \sigma_t^2 I$ is fixed:
$$L_{t-1} = \mathbb{E}\left[\frac{1}{2\sigma_t^2}\|\tilde\mu_t(x_t, x_0) - \mu_\theta(x_t, t)\|^2\right] + C$$
The constant $C$ comes from the $\Sigma$ log-determinant, independent of $\theta$ when $\Sigma$ is fixed, so it vanishes under the gradient.
- Step 5 (exact rewrite into $\epsilon$-pred form): substitute $x_0 = (x_t - \sqrt{1-\bar\alpha_t}\epsilon)/\sqrt{\bar\alpha_t}$ into $\tilde\mu_t$ and write $\mu_\theta$ in the same $\epsilon$-pred parametrization:
$$L_{t-1} = \mathbb{E}\left[\frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1-\bar\alpha_t)} \|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]$$
Exact, as long as $\mu_\theta$ uses Ho 2020's $\epsilon$-pred form.
- Step 6 (drop the $t$-dependent coefficient): $L_\text{simple}$ sets the coefficient $\frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1-\bar\alpha_t)}$ uniformly to 1. Equivalent to reweighting across $t$ — at small $t$ (high SNR) the original coefficient is large → simple downweights it relatively; at large $t$ (low SNR) the original coefficient is small → simple upweights it.
- Step 7 ($t$ switched to uniform sampling): discrete $t$ becomes $t \sim \mathcal{U}\{1,\dots,T\}$, uniform over all timesteps, not weighted by ELBO term magnitudes.
Final:
$$L_\text{simple} = \mathbb{E}_{t \sim \mathcal{U}\{1,\dots,T\},\, x_0,\, \epsilon}\big[\|\epsilon - \epsilon_\theta(\sqrt{\bar\alpha_t} x_0 + \sqrt{1-\bar\alpha_t}\epsilon, t)\|^2\big]$$
Cost:
- No longer a lower bound on $\log p$ (FID improves but likelihood evaluation no longer directly corresponds)
- $L_T$ and $L_0$ are implicitly dropped
- $\Sigma_\theta$ information lost (Improved DDPM restores it via $L_\text{vlb}$)
Not knowing which terms are dropped; thinking $L_\text{simple}$ comes directly from the KL; ignoring the role of $L_T, L_0$.
Q22. Prove DDIM ($\eta=0$) shares the same marginal $q(x_t|x_0)$ as DDPM.
Statement: DDIM defines a non-Markov forward $q_\sigma(x_{1:T}|x_0)$ such that $q_\sigma(x_t|x_0) = \mathcal{N}(\sqrt{\bar\alpha_t} x_0, (1-\bar\alpha_t) I)$ — identical to DDPM.
Proof (by induction):
- Boundary $q_\sigma(x_T|x_0) = \mathcal{N}(\sqrt{\bar\alpha_T} x_0, (1-\bar\alpha_T) I)$ — holds directly by DDIM definition
- Assume $q_\sigma(x_t|x_0) = \mathcal{N}(\sqrt{\bar\alpha_t} x_0, (1-\bar\alpha_t) I)$. DDIM defines:
$$q_\sigma(x_{t-1}|x_t, x_0) = \mathcal{N}\!\left(\sqrt{\bar\alpha_{t-1}} x_0 + \sqrt{1-\bar\alpha_{t-1} - \sigma_t^2}\cdot \frac{x_t - \sqrt{\bar\alpha_t} x_0}{\sqrt{1-\bar\alpha_t}},\; \sigma_t^2 I\right)$$
- Compute $q_\sigma(x_{t-1}|x_0) = \int q_\sigma(x_{t-1}|x_t, x_0) q_\sigma(x_t|x_0)\, dx_t$ (marginalize two Gaussians)
- Apply Gaussian marginalization: if $x_t | x_0 \sim \mathcal{N}(\mu_t, \Sigma_t)$ and $x_{t-1}|x_t, x_0 \sim \mathcal{N}(A x_t + b, \Sigma_{t-1|t})$, then:
$$x_{t-1}|x_0 \sim \mathcal{N}\!\left(A \mu_t + b,\; A \Sigma_t A^\top + \Sigma_{t-1|t}\right)$$
Here $A = \sqrt{1-\bar\alpha_{t-1}-\sigma_t^2}/\sqrt{1-\bar\alpha_t}$, $b = \sqrt{\bar\alpha_{t-1}} x_0 - A \sqrt{\bar\alpha_t} x_0$. Substitute:
- Mean = $\sqrt{\bar\alpha_{t-1}} x_0 + A \sqrt{\bar\alpha_t} x_0 - A \sqrt{\bar\alpha_t} x_0 = \sqrt{\bar\alpha_{t-1}} x_0$
- Variance = $A^2 (1-\bar\alpha_t) + \sigma_t^2 = (1-\bar\alpha_{t-1}-\sigma_t^2) + \sigma_t^2 = 1 - \bar\alpha_{t-1}$
- So $q_\sigma(x_{t-1}|x_0) = \mathcal{N}(\sqrt{\bar\alpha_{t-1}} x_0, (1-\bar\alpha_{t-1}) I)$ — identical to DDPM $\square$
Significance: DDIM can directly use a DDPM-trained $\epsilon_\theta$ because training looks only at marginals $q(x_t|x_0)$, which agree; the sampling paths differ (deterministic vs stochastic).
Not knowing the Gaussian marginalization theorem; not seeing that the proof's key is $A^2(1-\bar\alpha_t) + \sigma_t^2 = 1-\bar\alpha_{t-1}$.
Q23. Derive $c_\text{skip}$ and $c_\text{out}$ in EDM preconditioning.
Setup: VE view $x = x_0 + \sigma \epsilon$, $\epsilon \sim \mathcal{N}(0, I)$, $\text{Var}[x_0] = \sigma_\text{data}^2$. Denoiser parametrization:
$$D_\theta(x; \sigma) = c_\text{skip}(\sigma) x + c_\text{out}(\sigma) F_\theta(c_\text{in} x, c_\text{noise})$$
Effective target for $F_\theta$:
$$F^*(x_0, \sigma, \epsilon) = \frac{1}{c_\text{out}(\sigma)}\big[x_0 - c_\text{skip}(\sigma) x\big] = \frac{1}{c_\text{out}}\big[(1 - c_\text{skip}) x_0 - c_\text{skip} \sigma \epsilon\big]$$
Goal: find $c_\text{skip}, c_\text{out}$ so that $\text{Var}[F^*]$ (expectation over $x_0, \epsilon$) = 1.
$$\text{Var}[F^*] = \frac{1}{c_\text{out}^2}\big[(1-c_\text{skip})^2 \sigma_\text{data}^2 + c_\text{skip}^2 \sigma^2\big] = 1$$
But normalization alone is non-unique. Second criterion (Karras 2022): minimize the "residual" the network must learn (minimize $c_\text{out}$, because a larger $c_\text{out}$ amplifies both $F$ and its error). Equivalent problem:
$$\min_{c_\text{skip}}\;\; c_\text{out}^2(c_\text{skip}) = (1-c_\text{skip})^2 \sigma_\text{data}^2 + c_\text{skip}^2 \sigma^2$$
Differentiate w.r.t. $c_\text{skip}$ and set to 0:
$$-2(1 - c_\text{skip}) \sigma_\text{data}^2 + 2 c_\text{skip} \sigma^2 = 0 \quad \Rightarrow \quad c_\text{skip} = \frac{\sigma_\text{data}^2}{\sigma_\text{data}^2 + \sigma^2}$$
Substitute back into the $\text{Var}[F^*] = 1$ constraint:
$$c_\text{out}^2 = (1-c_\text{skip})^2 \sigma_\text{data}^2 + c_\text{skip}^2 \sigma^2 = \frac{\sigma^4 \sigma_\text{data}^2}{(\sigma^2+\sigma_\text{data}^2)^2} + \frac{\sigma_\text{data}^4 \sigma^2}{(\sigma^2+\sigma_\text{data}^2)^2} = \frac{\sigma^2 \sigma_\text{data}^2}{\sigma^2 + \sigma_\text{data}^2}$$
$$\boxed{\; c_\text{out}(\sigma) = \frac{\sigma \cdot \sigma_\text{data}}{\sqrt{\sigma^2 + \sigma_\text{data}^2}} \;}$$
Input normalization: $c_\text{in}(\sigma) = 1/\sqrt{\sigma_\text{data}^2 + \sigma^2}$ makes $\text{Var}[c_\text{in} x] = 1$.
Conclusion: the four $c$ functions are fully determined by $\sigma_\text{data}$, no tunable parameters (in practice $\sigma_\text{data}$ is computed from data; about 0.5 for normalized images).
Memorizing the formula without derivation; not knowing $c_\text{skip}$ is derived by minimizing $c_\text{out}$; thinking the $c$ functions have free parameters.
Q24. Consistency Distillation training procedure? Why is the EMA target $\theta^-$ needed?
Procedure:
- Take a pretrained teacher diffusion $\epsilon_\phi$ + its PF-ODE solver (e.g., EDM Heun)
- Take a noise schedule $\sigma_1 > \sigma_2 > \dots > \sigma_N = \sigma_\text{min}$ (typically $N = 18$)
- Train student $f_\theta(x_\sigma, \sigma) \to x_0$, initializing $\theta = \phi$ (warm start)
Each batch:
- Sample $x_0$, $\sigma_n$ (uniformly $n \in \{1, \dots, N-1\}$)
- Add noise: $x_{\sigma_{n+1}} = x_0 + \sigma_{n+1} \epsilon$
- Teacher ODE one-step solve: from $x_{\sigma_{n+1}}$ run one Heun step with teacher $\epsilon_\phi$ to get $\hat x_{\sigma_n}$
- Loss: $d(f_\theta(x_{\sigma_{n+1}}, \sigma_{n+1}), f_{\theta^-}(\hat x_{\sigma_n}, \sigma_n))$
- Update $\theta$; EMA-update $\theta^- \leftarrow \mu \theta^- + (1-\mu)\theta$
Why is the EMA target needed?
- Directly using $\theta = \theta^-$ has a trivial solution: $f_\theta \equiv \text{const}$ also satisfies consistency
- EMA $\theta^-$ lags $\theta$, providing a "stable" target so the student doesn't chase its own moving target
- Analogous to BYOL / MoCo self-supervised setups
- $\mu = 0.999 \sim 0.99995$ (depending on training steps)
Recent improvement (iCT, Song-Dhariwal 2024): remove EMA teacher (compute target with the same $\theta$, no $\theta^-$), use pseudo-Huber loss, combine with lognormal noise schedule + curriculum-increasing discretization steps; CT approaches CD quality.
Not knowing the trivial solution; thinking EMA is just an engineering stability trick; not knowing the teacher's role.
Q25. Why can SD3 / FLUX be compressed to 4-step / 1-step generation?
Core pathway: RF (linear path) + Reflow + Distill. Step-by-step:
- RF makes trajectories straight — $x_t = (1-t)x_0 + tx_1$, the ODE solution's "ideal curve" is a straight line (linear interpolation), so first-order Euler with long steps has small error (contrast: cosine paths have high curvature near mid-$t$)
- Reflow makes trajectories even straighter — after first training, run ODE to get coupled $(x_0, x_1)$, then train again; the trajectory converges closer to a straight line. Liu 2022 proves reflow monotonically reduces transport cost
- CFG-distillation — distill CFG's 2× forward (cond + uncond) into a single forward (FLUX does this); halves NFE
- Adversarial distillation (ADD) — SD3-Turbo / SDXL-Turbo late-stage fine-tune with DINOv2 discriminator + distill loss, 4-step approaches 30-step quality
Vs the DDPM route: DDPM trajectories have high curvature at mid-$t$ (cosine path), Euler first-order is unusable below 5 steps; you need DPM-Solver-2 second-order + consistency distillation to get to 4 steps. RF is much more engineering-friendly — a first-order sampler suffices.
FLUX-Schnell's 1-step: RF + reflow + heavy distillation; 1024px single forward generation, ~100ms/image. Cost: slightly reduced controllability / diversity; prompt-following accuracy slightly below multi-step.
Saying only "RF is faster than DDPM" without knowing why; not knowing reflow + distill are two-pronged; thinking FLUX 1-step is solely due to RF (in fact distillation also matters).
§A Appendix: Core PyTorch code (from scratch)
emphasizes the math; for production use diffusers / EDM official implementations, which include mixed precision / EMA / DDP / VAE / xformers / fused kernels.
A.1 DDPM forward $q(x_t | x_0)$ + simplified loss
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
def linear_beta_schedule(T: int, beta_start: float = 1e-4, beta_end: float = 0.02):
return torch.linspace(beta_start, beta_end, T, dtype=torch.float64)
def cosine_beta_schedule(T: int, s: float = 0.008):
"""Nichol-Dhariwal 2021"""
ts = torch.arange(T + 1, dtype=torch.float64) / T
f = torch.cos(((ts + s) / (1 + s)) * math.pi / 2) ** 2
alpha_bar = f / f[0]
betas = 1 - alpha_bar[1:] / alpha_bar[:-1]
return betas.clamp(max=0.999)
class DDPMSchedule:
"""Cache sqrt(α_bar), sqrt(1-α_bar), and other frequently used quantities."""
def __init__(self, betas: torch.Tensor):
self.T = len(betas)
self.betas = betas
alphas = 1.0 - betas
self.alphas = alphas
self.alpha_bar = torch.cumprod(alphas, dim=0)
self.sqrt_alpha_bar = torch.sqrt(self.alpha_bar)
self.sqrt_one_minus_alpha_bar = torch.sqrt(1.0 - self.alpha_bar)
# for sampling
self.alpha_bar_prev = torch.cat([torch.tensor([1.0]), self.alpha_bar[:-1]])
self.posterior_variance = betas * (1.0 - self.alpha_bar_prev) / (1.0 - self.alpha_bar)
def to(self, device):
for k, v in self.__dict__.items():
if isinstance(v, torch.Tensor):
setattr(self, k, v.to(device))
return self
def q_sample(sched: DDPMSchedule, x0: torch.Tensor, t: torch.Tensor, noise: torch.Tensor = None):
"""Sample x_t ~ q(x_t | x_0) = N(sqrt(α_bar_t) x_0, (1-α_bar_t) I)"""
if noise is None:
noise = torch.randn_like(x0)
sa = sched.sqrt_alpha_bar[t].view(-1, *([1] * (x0.dim() - 1))).to(x0.dtype)
so = sched.sqrt_one_minus_alpha_bar[t].view(-1, *([1] * (x0.dim() - 1))).to(x0.dtype)
return sa * x0 + so * noise
def ddpm_simple_loss(model: nn.Module, sched: DDPMSchedule, x0: torch.Tensor):
"""L_simple = E ‖ε - ε_θ(x_t, t)‖²"""
B = x0.shape[0]
t = torch.randint(0, sched.T, (B,), device=x0.device)
noise = torch.randn_like(x0)
x_t = q_sample(sched, x0, t, noise)
eps_pred = model(x_t, t)
return F.mse_loss(eps_pred, noise)
A.2 DDPM ancestral sampling
@torch.no_grad()
def ddpm_sample(model, sched: DDPMSchedule, shape, device, x_T=None):
"""Walk the full T-step ancestral chain from x_T ~ N(0, I)."""
x = torch.randn(shape, device=device) if x_T is None else x_T.to(device)
for t in reversed(range(sched.T)):
t_b = torch.full((shape[0],), t, device=device, dtype=torch.long)
eps_pred = model(x, t_b)
alpha_t = sched.alphas[t]
alpha_bar_t = sched.alpha_bar[t]
beta_t = sched.betas[t]
# Reverse mean (ε-pred form)
mean = (x - beta_t / torch.sqrt(1 - alpha_bar_t) * eps_pred) / torch.sqrt(alpha_t)
if t > 0:
sigma_t = torch.sqrt(sched.posterior_variance[t])
noise = torch.randn_like(x)
x = mean + sigma_t * noise
else:
x = mean # final step does not add noise
return x
A.3 DDIM sampling (with $\eta$)
@torch.no_grad()
def ddim_sample(
model,
sched: DDPMSchedule,
shape,
device,
num_steps: int = 50,
eta: float = 0.0, # 0 = deterministic DDIM; η=1 in dense-step limit matches DDPM variance
x_T=None,
):
"""Pick a sub-sequence of num_steps timesteps and run DDIM reverse."""
# Pick sub-sequence (linearly spaced)
step_size = sched.T // num_steps
timesteps = list(range(0, sched.T, step_size))
timesteps = timesteps + [sched.T - 1]
timesteps = sorted(set(timesteps)) # dedupe / sort
x = torch.randn(shape, device=device) if x_T is None else x_T.to(device)
for i in reversed(range(1, len(timesteps))):
t = timesteps[i]
t_prev = timesteps[i - 1]
t_b = torch.full((shape[0],), t, device=device, dtype=torch.long)
alpha_bar_t = sched.alpha_bar[t]
alpha_bar_prev = sched.alpha_bar[t_prev]
eps_pred = model(x, t_b)
# 1) x_0 estimate via Tweedie / ε-pred
x0_hat = (x - torch.sqrt(1 - alpha_bar_t) * eps_pred) / torch.sqrt(alpha_bar_t)
# 2) σ_t² = η² · (1-α_bar_prev)/(1-α_bar_t) · (1 - α_bar_t/α_bar_prev)
sigma_t_sq = (eta ** 2) * (1 - alpha_bar_prev) / (1 - alpha_bar_t) * \
(1 - alpha_bar_t / alpha_bar_prev)
sigma_t = torch.sqrt(sigma_t_sq.clamp(min=0))
# 3) DDIM step
dir_xt = torch.sqrt((1 - alpha_bar_prev - sigma_t_sq).clamp(min=0)) * eps_pred
noise = torch.randn_like(x) if eta > 0 else 0
x = torch.sqrt(alpha_bar_prev) * x0_hat + dir_xt + sigma_t * noise
# Final step uses x0_hat (no noise added)
return x0_hat
A.4 Classifier-Free Guidance training + sampling
class ConditionedEpsNet(nn.Module):
"""Demo: condition is a class label embedding, dropped with prob p_drop during training.
In real projects swap self.backbone with UNet / DiT and feed c_emb + t_emb."""
def __init__(self, dim, num_classes, p_drop=0.1, backbone: nn.Module = None):
super().__init__()
self.p_drop = p_drop
# NULL class uses index num_classes ("empty" embedding)
self.cls_emb = nn.Embedding(num_classes + 1, dim)
self.null_idx = num_classes
self.backbone = backbone # placeholder: self.backbone(x, t, c_emb) returns ε
def forward(self, x, t, c=None):
# During training, randomly drop condition to NULL
if self.training and c is not None:
mask = torch.rand(c.shape[0], device=c.device) < self.p_drop
c = torch.where(mask, torch.full_like(c, self.null_idx), c)
elif c is None:
c = torch.full((x.shape[0],), self.null_idx, device=x.device, dtype=torch.long)
c_emb = self.cls_emb(c)
# Concat c_emb onto the timestep embedding, run through UNet / DiT
eps_pred = self.backbone(x, t, c_emb)
return eps_pred
@torch.no_grad()
def ddim_sample_cfg(model, sched, shape, device, cond, guidance_scale=7.5, num_steps=50):
"""CFG-DDIM: two forwards (cond + uncond) per step, composed into ε_tilde."""
step_size = sched.T // num_steps
timesteps = sorted(set(list(range(0, sched.T, step_size)) + [sched.T - 1]))
x = torch.randn(shape, device=device)
null_cond = torch.full_like(cond, model.null_idx)
for i in reversed(range(1, len(timesteps))):
t, t_prev = timesteps[i], timesteps[i - 1]
t_b = torch.full((shape[0],), t, device=device, dtype=torch.long)
eps_cond = model(x, t_b, cond)
eps_uncond = model(x, t_b, null_cond)
# CFG: note convention — here we use HF style guidance_scale=w (w=1 unguided)
eps = eps_uncond + guidance_scale * (eps_cond - eps_uncond)
alpha_bar_t = sched.alpha_bar[t]
alpha_bar_prev = sched.alpha_bar[t_prev]
x0_hat = (x - torch.sqrt(1 - alpha_bar_t) * eps) / torch.sqrt(alpha_bar_t)
dir_xt = torch.sqrt(1 - alpha_bar_prev) * eps
x = torch.sqrt(alpha_bar_prev) * x0_hat + dir_xt # η=0 deterministic
return x0_hat
A.5 EDM preconditioning + Heun 2nd-order sampler
class EDMDenoiser(nn.Module):
"""D_θ(x; σ) = c_skip(σ) x + c_out(σ) F_θ(c_in(σ) x, c_noise(σ))"""
def __init__(self, backbone: nn.Module, sigma_data: float = 0.5):
super().__init__()
self.backbone = backbone # outputs same shape as x
self.sigma_data = sigma_data
def forward(self, x: torch.Tensor, sigma: torch.Tensor):
# σ shape [B] -> broadcast to x shape
s = sigma.view(-1, *([1] * (x.dim() - 1))).to(x.dtype)
sd2 = self.sigma_data ** 2
c_skip = sd2 / (s ** 2 + sd2)
c_out = s * self.sigma_data / torch.sqrt(s ** 2 + sd2)
c_in = 1.0 / torch.sqrt(s ** 2 + sd2)
c_noise = 0.25 * torch.log(sigma).flatten() # 1D fed to backbone
F = self.backbone(c_in * x, c_noise)
return c_skip * x + c_out * F
def edm_loss(D: EDMDenoiser, x0: torch.Tensor,
P_mean: float = -1.2, P_std: float = 1.2):
"""EDM L = E [ λ(σ) ‖D_θ(x_0 + σε, σ) - x_0‖² ]; λ = 1/c_out².
Implemented as unweighted F-loss: equivalent to weighted D-loss."""
B = x0.shape[0]
log_sigma = P_mean + P_std * torch.randn(B, device=x0.device)
sigma = log_sigma.exp()
eps = torch.randn_like(x0)
x = x0 + sigma.view(-1, *([1] * (x0.dim() - 1))) * eps
D_pred = D(x, sigma)
s = sigma.view(-1, *([1] * (x0.dim() - 1)))
sd2 = D.sigma_data ** 2
weight = (s ** 2 + sd2) / (s * D.sigma_data) ** 2 # = 1/c_out²
loss = (weight * (D_pred - x0) ** 2).mean()
return loss
def edm_sigma_schedule(N: int, sigma_min: float = 0.002,
sigma_max: float = 80.0, rho: float = 7.0,
device: str = "cpu"):
"""Karras ρ-schedule: σ_i = (σ_max^{1/ρ} + i/(N-1) · (σ_min^{1/ρ} - σ_max^{1/ρ}))^ρ"""
i = torch.arange(N, device=device, dtype=torch.float64)
sigmas = (sigma_max ** (1 / rho) +
i / (N - 1) * (sigma_min ** (1 / rho) - sigma_max ** (1 / rho))) ** rho
return torch.cat([sigmas, torch.zeros(1, device=device)]).to(torch.float32) # trailing σ=0
@torch.no_grad()
def edm_heun_sample(D: EDMDenoiser, shape, sigmas: torch.Tensor, device):
"""Heun (2nd-order) ODE solver. 2 NFE per step; last step degenerates to Euler."""
x = torch.randn(shape, device=device) * sigmas[0]
for i in range(len(sigmas) - 1):
sigma = sigmas[i]
sigma_next = sigmas[i + 1]
sigma_b = sigma.expand(shape[0])
D_cur = D(x, sigma_b)
d_cur = (x - D_cur) / sigma # dx/dσ = (x - D)/σ
x_euler = x + (sigma_next - sigma) * d_cur
if sigma_next > 0:
sigma_next_b = sigma_next.expand(shape[0])
D_next = D(x_euler, sigma_next_b)
d_next = (x_euler - D_next) / sigma_next
x = x + (sigma_next - sigma) * 0.5 * (d_cur + d_next)
else:
x = x_euler # final-step Euler
return x
A.6 Probability Flow ODE simple Euler solver
@torch.no_grad()
def pf_ode_sample_euler(eps_model, sched: DDPMSchedule, shape, device, num_steps: int = 50):
"""PF-ODE Euler sampler under VP view.
dx/dt = f(t) x - (1/2) g²(t) s_θ(x, t), s_θ = -ε_θ / sqrt(1-α_bar_t)
In discrete schedule this degenerates to DDIM η=0 + time grid."""
# Pick sub-sequence
step_size = sched.T // num_steps
timesteps = sorted(set(list(range(0, sched.T, step_size)) + [sched.T - 1]))
x = torch.randn(shape, device=device)
for i in reversed(range(1, len(timesteps))):
t, t_prev = timesteps[i], timesteps[i - 1]
t_b = torch.full((shape[0],), t, device=device, dtype=torch.long)
eps_pred = eps_model(x, t_b)
alpha_bar_t = sched.alpha_bar[t]
alpha_bar_prev = sched.alpha_bar[t_prev]
# Equivalent DDIM η=0 form
x0_hat = (x - torch.sqrt(1 - alpha_bar_t) * eps_pred) / torch.sqrt(alpha_bar_t)
dir_xt = torch.sqrt(1 - alpha_bar_prev) * eps_pred
x = torch.sqrt(alpha_bar_prev) * x0_hat + dir_xt
return x0_hat
A.7 Sanity-check output (pedagogical version)
Run a 64×64 ImageNet subset toy setup, 2-layer UNet baseline, sched=cosine, T=1000:
[a] q_sample shape ok, σ_t variance ≈ 1-α_bar_t ✓
[b] simple loss converges (5k steps): 0.42 → 0.18 ✓
[c] DDPM 1000-step sample: FID (toy) ~ 22.5
[d] DDIM 50-step (η=0): FID (toy) ~ 23.1 ← close to DDPM 1000, 20× speedup
[e] DDIM 50-step (η=1): FID (toy) ~ 22.7 ← η=1 matches DDPM variance, not strict 1000-step DDPM
[f] CFG w=7.5 conditional: visually significant text-alignment boost ✓
[g] EDM Heun 35-NFE: FID (toy) ~ 18.3 ← much better than DDIM 50
[h] PF-ODE Euler 50-step: numerically agrees with DDIM η=0 ✓
Main references: Ho 2020 (DDPM, NeurIPS), Nichol-Dhariwal 2021 (Improved DDPM, ICML), Song-Ermon 2019 (NCSN, NeurIPS), Song 2021 (Score SDE, ICLR), Song 2020 arXiv / ICLR 2021 (DDIM), Karras 2022 (EDM, NeurIPS), Lu 2022/2023 (DPM-Solver / DPM-Solver++), Ho-Salimans 2022 arXiv (CFG; short version: NeurIPS 2021 Workshop on DGMs), Dhariwal-Nichol 2021 (Classifier Guidance, NeurIPS), Rombach 2022 (LDM/SD, CVPR), Podell 2023 arXiv / ICLR 2024 (SDXL), Esser 2024 (SD3, ICML), Peebles-Xie 2023 (DiT, ICCV), Zhang 2023 (ControlNet, ICCV), Song 2023 (Consistency Models, ICML), Luo 2023 (LCM, arXiv), Sauer 2023/2024 (SDXL-Turbo / SD3-Turbo, arXiv).
Diffusion Foundations Cheat Sheet · formulas + from-scratch code + 25 frequently-asked questions (L1 must-know · L2 intermediate · L3 top lab)