Diffusion Post Training Tutorial En

Source: docs/tutorials/diffusion_post_training_tutorial_en.md SHA256: 6143ed98abc2 Rendered: 2026-05-19 18:54 UTC

§0 TL;DR

Diffusion Post-Training in 9 sentences

one page covering the RL/DPO/Flow-RL family (see §1–§10 for derivations).

  1. Why it's hard: diffusion generation runs $T$ steps (typically 20–50), with reward given only once at the terminal state $x_0$ — so sparse terminal reward + long denoising trajectory + credit assignment combine, giving one extra "trajectory integration" dimension beyond LLM RLHF.
  2. Three main lines: (i) RL on denoising MDP (DDPO / DPOK — treat $T$-step denoising as an MDP); (ii) Direct reward backprop (DRaFT / AlignProp / ReFL — treat reward as a differentiable loss and backprop through the $T$-step sampler); (iii) Preference optimization (Diffusion-DPO / D3PO / SPO / Diffusion-KTO / MaPO — port the LLM DPO family to diffusion).
  3. DDPO (Black et al. 2024 ICLR, arXiv 2305.13301): view denoising as a $T$-step MDP, state $= (x_t, t, c)$, action $= x_{t-1}$, per-trajectory reward $R(x_0, c)$; use REINFORCE or PPO-clip to update $\log p_\theta(x_{t-1} \mid x_t, c)$.
  4. AlignProp (Prabhudesai et al. 2024 ICLR, arXiv 2310.03739) & DRaFT (Clark et al. 2024 ICLR, arXiv 2309.17400): when reward $R$ is differentiable in $x_0$, directly backprop $R(x_0)$ through the $T$-step sampler into $\theta$. The key engineering problem: memory is $\mathcal{O}(T)$; DRaFT-K / AlignProp only backprop the final $K$ steps (typically $K \in \{1, 5\}$), and combined with gradient checkpointing the memory drops to $\mathcal{O}(K)$.
  5. Diffusion-DPO (Wallace et al. 2024 CVPR, arXiv 2311.12908): replace the LLM DPO's $\log\pi/\pi_\text{ref}$ with diffusion's per-step ELBO surrogate — specifically, use $-\|\epsilon - \epsilon_\theta(x_t, t)\|^2$ as one lower-bound term of $\log p_\theta(x_0)$, and assemble it into a DPO contrastive loss over the pair $(y_w, y_l)$.
  6. D3PO (Yang et al. 2024 CVPR, arXiv 2311.13231): completely RM-free — directly plug human thumbs up/down feedback on generated images into the KL-regularized optimal solution's implicit reward; the derivation parallels DPO but operates on the diffusion per-step Markov chain.
  7. SPO (Liang et al. 2024, arXiv 2406.04314): observes that different denoising steps have different preferences (high-noise steps learn composition, low-noise steps learn detail), and extends DPO to be step-aware — for each $t$, independently sample an in-step pair $(x_{t-1}^w, x_{t-1}^l)$, and weight the loss along the step dimension.
  8. Flow-GRPO (Liu et al. 2025, arXiv 2505.05470): the first work bringing GRPO to Flow Matching. Two key tricks: an ODE→SDE equivalent conversion that turns the deterministic flow into an explorable stochastic process; and denoising reduction — fewer steps for training, full steps for inference. RL-tuned SD3.5-M raised GenEval from 63% to 95%.
  9. Reward hacking is the real boss: over-saturated colors, monotonous composition, style convergence, high PickScore but ugly to humans — mitigations include reward ensembles (HPSv2 + PickScore + ImageReward + CLIP-Score), KL anchor (Diffusion-DPO's $\beta$), and early stopping on reward plateau. SD3 / FLUX barely disclose post-training details, but the community consensus is that SD3.5 Turbo and FLUX.1 dev use a DPO + distillation hybrid.
vs LLM RLHF in one sentence

LLM RLHF cares about "token-level credit assignment + KL anchor"; diffusion post-training cares about "denoising-step credit assignment + memory blow-up (backprop) or sample blow-up (RL)". Same problem in essence — sparse reward + long trajectory — only the physical meaning of "trajectory" changes.

§1 Intuition: why diffusion post-training is hard

1.1 Single-step vs multi-step generation: the essential difference

LLM rewards are generally also sequence-level, but tokens are discrete, trajectory length $L \sim 10^3$, vocabulary moderate. The diffusion "trajectory" is $T$ steps of denoising, each operating on a continuous high-dimensional tensor $x_t \in \mathbb{R}^{C \times H \times W}$ (SDXL latent is $4\times128\times128 = 65536$ dimensions), with $T$ typically 20–50.

DimensionLLM RLHFDiffusion Post-Training
Trajectory length$L$ (response token count)$T$ (denoising step count, typically 20–50)
Single-step actiondiscrete tokencontinuous vector in $\mathbb{R}^d$ ($d \sim 10^4$–$10^5$)
Reward frequencyusually only at the terminalusually only at terminal $x_0$
Explorationsampling temperature / top-pDDIM is deterministic; "noise injection" needed to explore (DDPO uses stochastic DDPM; Flow-GRPO uses ODE→SDE)
Reward sourcetrained RM (BT) / ruletrained image RM (ImageReward / HPSv2 / PickScore) / rule (object count / OCR)
Memory bottleneck4 copies (policy + ref + RM + V)1 copy UNet/DiT, but direct backprop must store $T$-step activations

1.2 The three main lines

                              preference / reward signal
                                    │
            ┌───────────────────────┼───────────────────────┐
            │                       │                       │
       reward non-diff           reward diff             preference pairs (offline)
            │                       │                       │
        Line A: RL              Line B: backprop         Line C: DPO family
        DDPO, DPOK,             DRaFT, AlignProp,        Diffusion-DPO,
        Flow-GRPO               ReFL                     D3PO, SPO, KTO, MaPO
            │                       │                       │
        most general            memory-friendly (K-step)   off-policy fast
        but sampling expensive  but requires diff reward   needs preference data

1.3 One-line intuitions

Core intuitions

1.4 Convention (used throughout)

SymbolMeaning
$x_0$clean image (latent or pixel)
$x_t,\; t = 0, \dots, T$noised sample; $x_T \approx \mathcal{N}(0, I)$
$\epsilon_\theta(x_t, t, c)$noise predicted by UNet/DiT (DDPM parameterization)
$v_\theta(t, x, c)$Flow Matching vector field
$c$condition (text embedding / class)
$p_\theta(x_{t-1} \mid x_t, c)$reverse process single-step conditional distribution
$R(x_0, c)$terminal reward (scalar, from RM or rule)
$\pi_\text{ref}$ / $p_\text{ref}$reference model (usually the SFT-trained base)
$\beta$KL/temperature hyperparameter (same meaning as in LLM DPO)

§2 RL for Diffusion: DDPO and DPOK

2.1 Treating denoising as an MDP (DDPO viewpoint)

Black et al. 2024 ICLR Training Diffusion Models with Reinforcement Learning (arXiv 2305.13301) key observation: the DDPM/DDIM reverse process is itself a finite-horizon MDP.

Definitions:

Policy gradient theorem:

$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta}\!\left[\sum_{t=1}^{T} \nabla_\theta \log p_\theta(x_{t-1} \mid x_t, c)\, R(x_0, c)\right]$$

Key: $\log p_\theta(x_{t-1} \mid x_t, c)$ is Gaussian in DDPM, so the log-prob is analytical and the gradient can be computed directly.

2.2 DDPO-SF (Score Function) algorithm

The simplest variant (DDPO-SF, score function estimator):

  1. Sampling phase — starting from prompt $c$, run $T$ DDPM reverse steps to get a trajectory $\tau = (x_T, x_{T-1}, \dots, x_0)$; compute $R(x_0, c)$.
  2. Update phase — REINFORCE-style gradient estimator:

$$\hat{g} = \frac{1}{N}\sum_{n=1}^{N} \sum_{t=1}^{T} \nabla_\theta \log p_\theta(x_{t-1}^{(n)} \mid x_t^{(n)}, c)\, (R^{(n)} - b)$$

where $b$ is a baseline (typically the batch-mean reward).

2.3 DDPO-IS (Importance Sampling, PPO-style)

The practical recommended DDPO variant uses PPO-clip with per-step importance ratios:

$$\rho_t = \frac{p_\theta(x_{t-1} \mid x_t, c)}{p_{\theta_\text{old}}(x_{t-1} \mid x_t, c)}, \qquad L^\text{CLIP}_t = \min\!\big(\rho_t R, \text{clip}(\rho_t, 1-\epsilon, 1+\epsilon) R\big)$$

Per-step ratio, not trajectory ratio

diffusion PPO uses per-step importance ratios, not the product over $T$ steps. The full-trajectory ratio is the product of $T$ ratios — variance explodes; clipping per-step independently is the only way to stay stable.

2.4 DDPO's two reward experiments

Black et al. run DDPO + SD-1.5 on four rewards:

Reward typeExampleSignal nature
CompressibilityJPEG file sizerule-based scalar
AestheticLAION aesthetic predictortrained MLP
Prompt-image alignmentCLIP-Score / LLaVA judgeVLM-based
Object presenceDETR / OWL-ViT countrule-based

Across all four rewards, DDPO yields large gains over reward-weighted regression (RWR baseline), and can shift styles (e.g. emoji → oil painting) — proof that RL is genuinely exploring, not just mode-seeking.

2.5 DPOK: KL-regularized RL for diffusion

Fan et al. 2023 NeurIPS DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models (arXiv 2305.16381) is contemporaneous with DDPO; the difference is an explicit KL anchor:

$$\boxed{\;\max_\theta\; \mathbb{E}_{\tau \sim p_\theta}\!\left[R(x_0, c)\right] - \beta\, \mathbb{E}_c\!\left[\text{KL}\!\big(p_\theta(\cdot \mid c) \,\big\Vert\, p_\text{ref}(\cdot \mid c)\big)\right]\;}$$

The KL term expanded to per-step:

$$\text{KL}(p_\theta \Vert p_\text{ref}) = \sum_{t=1}^{T} \mathbb{E}\!\left[\text{KL}\!\big(p_\theta(x_{t-1} \mid x_t, c) \,\Vert\, p_\text{ref}(x_{t-1} \mid x_t, c)\big)\right]$$

Since the DDPM $p_\theta(x_{t-1} \mid x_t, c)$ is Gaussian, the KL between two Gaussians has a closed form, so per-step KL is directly computable. DPOK uses policy gradient + this KL penalty, the diffusion analog of RLHF's "$\beta \log(\pi/\pi_\text{ref})$ added to reward".

DPOK vs DDPO

2.6 DDPO failure modes and mitigations

SymptomCauseMitigation
Reward up but FID crashesover-optimization on RM blind spotsKL penalty / LoRA fine-tune (prevent base drift)
Same prompt converges to one compositionmode collapse, policy finds RM high-score modereward ensemble / early stop
High reward but ugly to humansRM scale not aligned with humansweighted multi-RM + human eval calibration
Training unstableper-step ratio accumulates over $T$ stepsper-step PPO-clip $\epsilon = 0.1$ is more stable than the LLM $0.2$

§3 Direct Reward Fine-Tuning: DRaFT / AlignProp / ReFL

3.1 Core idea: treat reward as a differentiable loss

If the reward $R(x_0, c)$ is differentiable in $x_0$ (true for typical CNN/ViT RMs) and the diffusion sampler is differentiable, then we can directly backprop to $\theta$:

$$\theta \leftarrow \theta + \eta\, \nabla_\theta R\!\big(x_0(\theta), c\big), \quad x_0(\theta) = \text{Sample}_\theta^T(c)$$

where $\text{Sample}_\theta^T(c)$ denotes running $T$ reverse steps from $x_T \sim \mathcal{N}(0,I)$ to obtain $x_0$. This treats the entire reverse trajectory as one giant differentiable computation graph and optimizes reward end-to-end.

Advantage

no sampling variance; the gradient signal has far lower variance than REINFORCE-style RL.

Cost

must store $T$ steps of activations; vanilla implementation uses $\mathcal{O}(T \cdot M_\text{UNet})$ memory, which for SDXL UNet is ≈ hundreds of GB, completely untrainable.

3.2 DRaFT (Clark et al. 2024 ICLR, arXiv 2309.17400)

Directly Fine-Tuning Diffusion Models on Differentiable Rewards. Two core tricks:

Trick 1: DRaFT-K, only backprop the last $K$ steps.

The full chain rule (denoise step $\epsilon_\theta(x_t,t)$ both influences the next step $x_{t-1}$ and directly depends on $\theta$):

$$\nabla_\theta R(x_0) = \frac{\partial R}{\partial x_0} \cdot \sum_{t=1}^{K}\left(\prod_{s=1}^{t-1} \frac{\partial x_{s-1}}{\partial x_s}\right) \cdot \frac{\partial x_{t-1}}{\partial \theta}\bigg|_{\text{direct}}$$

where $\partial x_{t-1}/\partial\theta|_\text{direct}$ is the partial derivative of step-$t$ on $\theta$ via $\epsilon_\theta(x_t,t)$ directly (not through the indirect $x_t \to x_t$ path), and $\prod_s$ is the Jacobian product propagation during backward. The first $T-K$ steps run under torch.no_grad(); only the last $K$ steps retain the graph. K=1 (DRaFT-1) already gives a very strong signal — the last step has the largest direct effect on $x_0$. Autograd automatically sums the $\partial/\partial\theta|_\text{direct}$ across all $K$ steps, so the code only needs loss.backward().

Trick 2: LoRA fine-tune + high learning rate.

Only train a LoRA adapter (~1% parameters), with the base UNet frozen. Combined with gradient checkpointing, memory fits within 24 GB on a single GPU.

Pseudo-code:

# DRaFT-K one-step training
x_t = torch.randn(B, C, H, W).to(device)  # x_T
with torch.no_grad():
    for t in range(T-1, K, -1):           # first T-K steps without grad
        x_t = ddim_step(unet_lora, x_t, t, cond)
for t in range(K, 0, -1):                  # last K steps with grad
    x_t = ddim_step(unet_lora, x_t, t, cond)
x_0 = x_t
reward = image_rm(x_0, prompt)             # ImageReward / HPSv2
loss = -reward.mean()                       # note negative sign — we maximize reward
loss.backward()                             # memory O(K)
optimizer.step()
Is DRaFT-1 equivalent to REINFORCE?

No. DRaFT-1 is a true reparameterized gradient (pathwise estimator); REINFORCE is a score-function estimator. The former has very low variance but requires reward to be differentiable; the latter is general but high variance. $\nabla \log p$ vs $\partial x / \partial \theta$ are two different gradient estimator classes.

3.3 AlignProp (Prabhudesai et al. arXiv 2310.03739, 2023-10; ICLR 2024 venue; the arXiv version was later superseded/withdrawn)

Aligning Text-to-Image Diffusion Models with Reward Backpropagation — almost parallel to DRaFT (both appeared on arXiv in late 2023), with the same core idea: reward backprop through denoising. Differences:

DimensionDRaFTAlignProp
TruncationDRaFT-K, last $K$ steps retain gradrandomly selected $K$ steps retain grad (randomized truncated BPTT)
Memory optimizationgradient checkpointinggradient checkpointing + LoRA
Recommended rewardHPSv1, PickScore, AestheticImageReward, HPSv2, PickScore
Mode collapse mitigationsimple KL anchorLoRA scale annealing + early stop

AlignProp's key contribution is theorizing "why reward backprop works" — proving that under fixed-point assumptions, the truncated BPTT gradient is a biased but low-variance estimator of the true gradient.

3.4 ReFL (Xu et al. 2023 NeurIPS, arXiv 2304.05977)

ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation. This is the original ImageReward paper, which also proposes the ReFL (Reward Feedback Learning) algorithm.

ReFL is close to DRaFT in spirit but predates it. It treats reward directly as a loss and supervises only at one randomly selected intermediate step, an early implementation of the DRaFT idea:

$$\mathcal{L}_\text{ReFL} = \mathcal{L}_\text{simple} - \lambda \cdot \mathbb{E}_{t' \sim [t_\text{min}, t_\text{max}]}\!\big[R\big(\hat{x}_0(x_{t'}, t')\big)\big]$$

where $\hat{x}_0(x_{t'}, t') = (x_{t'} - \sqrt{1-\bar\alpha_{t'}}\epsilon_\theta(x_{t'}, t'))/\sqrt{\bar\alpha_{t'}}$ is the $x_0$ estimate from a single-step $\epsilon$-prediction (one-step Tweedie unfolding).

Key difference: ReFL is "single-step backprop + $L_\text{simple}$ trained jointly", while DRaFT/AlignProp is "multi-step backprop + pure reward loss". ReFL is more stable but reward gains are smaller, because it only sees the one-step $x_0$ estimate rather than the true sampling trajectory.

3.5 Three-way comparison

AlgorithmBackprop strategyMemoryReward gainStability
ReFL (Xu 2023)single-step $\hat{x}_0$ + $L_\text{simple}$ hybrid$\mathcal{O}(1)$smallhigh
DRaFT-K (Clark 2024)last $K$ steps BPTT$\mathcal{O}(K)$largemedium (large $K$ prone to over-optimization)
AlignProp (Prabhudesai 2024)random $K$ steps BPTT$\mathcal{O}(K)$largemedium
Memory back-of-the-envelope

SDXL UNet ~2.6B parameters, one forward in fp16 needs ~6–8 GB activations; $K = 5$ ~ 30–40 GB; $K = T = 50$ > 300 GB, only feasible with multi-machine sharding. This is why $K=1$ is the most common in practice — negligible precision loss, engineering-friendly.

3.6 Reward hacking in direct reward backprop

Direct backprop is more prone to hacking than RL because the gradient signal is "too precise":

SymptomExample
Over-saturationHPSv2 prefers high contrast → trained images become hyper-saturated
Style monotonicityImageReward training data is biased → all prompts produce the same style
Trypophobia patternssome RMs prefer "dense textures", and the model learns trypophobic patterns
Mode collapsemultiple samples from the same prompt are nearly identical

Mitigations: reward ensemble (mean or min of HPSv2 + PickScore + ImageReward), KL anchor, early stop, small LoRA scale.

§4 Preference Optimization: the Diffusion-DPO family

4.1 Diffusion-DPO (Wallace et al. 2024 CVPR, arXiv 2311.12908)

Diffusion Model Alignment Using Direct Preference Optimization. The key challenge in porting LLM DPO to diffusion: diffusion's $\log p_\theta(x_0 \mid c)$ has no closed form — must use an ELBO substitute.

4.1.1 Derivation (key steps)

The LLM DPO core is the KL-regularized RL optimal solution:

$$\pi^*(y \mid x) \propto \pi_\text{ref}(y \mid x) \exp\!\big(r(x, y) / \beta\big)$$

Solving for $r = \beta \log(\pi^*/\pi_\text{ref}) + \beta \log Z$ and substituting into Bradley-Terry, the $\log Z$ cancels.

For diffusion, replace "sample $y$" with "sample trajectory $(x_T, \dots, x_0)$"; the optimal form is the same, but $\log p$ is the full-trajectory likelihood:

$$\log p_\theta(x_{0:T} \mid c) = \log p(x_T) + \sum_{t=1}^{T} \log p_\theta(x_{t-1} \mid x_t, c)$$

This trajectory log-likelihood is analytical (each term is a Gaussian log-prob). However: if every update step requires a full trajectory run, the compute cost explodes.

Wallace et al.'s trick: use an ELBO surrogate.

DDPM's $L_\text{simple}$ is one of the (negative) ELBO terms of $\log p_\theta(x_0)$; specifically:

$$-\log p_\theta(x_0 \mid c) \le L_\text{simple}(x_0, c, \theta) = \mathbb{E}_{t, \epsilon}\!\left[\|\epsilon - \epsilon_\theta(x_t, t, c)\|^2\right] + \text{const}$$

Use $-L_\text{simple}$ as a single-sample estimate of $\log p_\theta(x_0 \mid c)$ (Jensen's inequality strictly gives a lower bound, but numerically it is usable as a DPO implicit-reward proxy) and substitute into the DPO framework:

$$\boxed{\;\mathcal{L}_\text{Diff-DPO}(\theta) = -\mathbb{E}_{(x_0^w, x_0^l, c, t, \epsilon)}\log\sigma\!\left(-\beta T\!\left[\|\epsilon^w - \epsilon_\theta(x_t^w, t, c)\|^2 - \|\epsilon^w - \epsilon_\text{ref}(x_t^w, t, c)\|^2 - \|\epsilon^l - \epsilon_\theta(x_t^l, t, c)\|^2 + \|\epsilon^l - \epsilon_\text{ref}(x_t^l, t, c)\|^2\right]\right)\;}$$

Intuitive reading

the inside of the sigmoid is "for $y_w$, policy denoises better than ref" minus "for $y_l$, policy denoises better than ref". If the policy is more accurate on $y_w$ and less accurate on $y_l$, the difference is positive and the loss decreases.

4.1.2 Implementation details

Crucial: shared noise

the paper emphasizes that $x_t^w$ and $x_t^l$ must use the same $\epsilon$ (paired noise); otherwise $\beta$ becomes incomparable and loss variance explodes. This is the most common Diffusion-DPO footgun.

4.1.3 Results (on SDXL)

4.2 D3PO (Yang et al. 2024 CVPR, arXiv 2311.13231)

Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model. Posted to arXiv around the same time as Diffusion-DPO (2023-11); the difference is in derivation path:

The final D3PO loss form is nearly the same as Diffusion-DPO's:

$$\mathcal{L}_\text{D3PO}(\theta) = -\mathbb{E}_{(\tau^w, \tau^l)} \log\sigma\!\left(\beta \sum_{t=1}^{T}\!\left[\log\frac{p_\theta(x_{t-1}^w \mid x_t^w, c)}{p_\text{ref}(x_{t-1}^w \mid x_t^w, c)} - \log\frac{p_\theta(x_{t-1}^l \mid x_t^l, c)}{p_\text{ref}(x_{t-1}^l \mid x_t^l, c)}\right]\right)$$

What's needed is a full trajectory pair $(\tau^w, \tau^l)$; if the preference pair only has the final image $(x_0^w, x_0^l)$, reconstruct the trajectory using $q(x_{1:T} \mid x_0)$ (with DDPM's forward q-sample).

Diffusion-DPO vs D3PO in practice

the two are mathematically equivalent (under the ELBO surrogate, D3PO's trajectory log-ratio reduces to Diffusion-DPO's single-step $\epsilon$-distance difference). In practice:

4.3 SPO (Liang et al. 2024, arXiv 2406.04314)

Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step. Key observation: different denoising steps are responsible for different aspects of the image:

Diffusion-DPO's "single-$t$ sampling" treats all steps equally — but human preferences have different "importance" across steps.

4.3.1 The two SPO modifications

Modification 1: In-step preference — for the same $x_t$, independently sample two $x_{t-1}^w, x_{t-1}^l$, and use a step-wise reward model to judge "which $x_{t-1}$ is better at step $t$".

Modification 2: Step-aware weighting — the SPO loss is weighted along the step dimension:

$$\mathcal{L}_\text{SPO}(\theta) = -\mathbb{E}_{t \sim w(t),\; x_t}\!\left[\log\sigma\!\left(\beta\!\log\frac{p_\theta(x_{t-1}^w \mid x_t, c)}{p_\text{ref}(x_{t-1}^w \mid x_t, c)} - \beta\!\log\frac{p_\theta(x_{t-1}^l \mid x_t, c)}{p_\text{ref}(x_{t-1}^l \mid x_t, c)}\right)\right]$$

where $w(t)$ is the step sampling distribution (typically uniform or biased toward middle $t$).

4.3.2 The in-step reward model

To obtain the in-step preference $(x_{t-1}^w, x_{t-1}^l)$, SPO trains a step-wise reward model $R(x_{t-1}, x_t, c, t)$ that judges "given $x_t$, is $x_{t-1}$ a good transition at step $t$?" It does not directly score pixel quality of $x_{t-1}$ but rather estimates whether this transition at step $t$ leads to a high-quality $x_0$.

The key SPO win

given the same preference data, SPO's effective signal is $\times T$ (each prompt yields a pair at every one of $T$ steps). Empirically SPO outperforms Diffusion-DPO on PickScore / HPSv2 by 1–3 points.

4.4 Diffusion-KTO (Li et al. 2024 NeurIPS, arXiv 2404.04465)

Aligning Diffusion Models by Optimizing Human Utility. The diffusion version of LLM KTO (Ethayarajh 2024, arXiv 2402.01306).

LLM KTO core idea: replace the BT preference model with Kahneman-Tversky prospect theory, requiring only per-sample binary feedback (thumbs up/down), no pairs needed:

$$L_\text{KTO} = \mathbb{E}_{x, y}\!\left[\lambda_y v\!\big(\beta \log\frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)} - z_0(x)\big)\right]$$

where $v(\cdot)$ is the prospect-theoretic value function (thumbs up uses $1 - \sigma(\cdot)$, thumbs down uses $\sigma(\cdot)$), and $z_0$ is the reference utility.

Diffusion-KTO replaces $\log(\pi_\theta/\pi_\text{ref})$ with the Diffusion-DPO $\epsilon$-distance ELBO surrogate:

$$L_\text{Diff-KTO} = \mathbb{E}_{x_0, c, \text{label}}\!\left[\lambda_\text{label}\, v\!\left(\beta T \left[\|\epsilon - \epsilon_\text{ref}\|^2 - \|\epsilon - \epsilon_\theta\|^2\right] - z_0(c)\right)\right]$$

Practical value of Diffusion-KTO

in industrial settings, binary feedback (like/dislike) vastly outnumbers paired comparisons; KTO lets you use that data directly.

4.5 MaPO (Hong et al. 2024, arXiv 2406.06424)

Margin-aware Preference Optimization for Aligning Diffusion Models without Reference. Core idea: drop the reference model entirely — similar in spirit to LLM SimPO.

MaPO loss simultaneously optimizes two things:

  1. Likelihood margin: $\log p_\theta(x_0^w) - \log p_\theta(x_0^l)$ (estimated via the ELBO surrogate $\|\epsilon - \epsilon_\theta\|^2$).
  2. Likelihood of preferred: $\log p_\theta(x_0^w)$ itself must be high (to prevent "both sides dropping").

$$\mathcal{L}_\text{MaPO}(\theta) = -\mathbb{E}\!\left[\log\sigma\!\big(\beta(\hat{\ell}_w - \hat{\ell}_l) - \gamma\big) + \alpha \hat{\ell}_w\right]$$

where $\hat{\ell} = -\|\epsilon - \epsilon_\theta(x_t, t, c)\|^2$ is the likelihood surrogate, $\gamma$ is the margin, and $\alpha$ is the likelihood-term weight.

Advantages:

4.6 DPO family overview

MethodNeeds ref?Preference typeMemoryWhen to use
Diffusion-DPO (Wallace 2024)pairedgeneral alignment
D3PO (Yang 2024)paired or thumbswhen no RM available
SPO (Liang 2024)✅ + step-RMper-step paired2× + small step-RMextracting maximum step-level signal
Diffusion-KTO (Li 2024)unpaired binarylarge volume of thumbs data
MaPO (Hong 2024)pairedstyle fine-tune / memory-constrained

§5 Flow-GRPO: RL for Flow Matching

5.1 Why Flow Matching also needs post-training

SD3 / FLUX / Lumina all moved to Flow Matching (Rectified Flow); post-training needs are the same:

But Flow Matching is a deterministic ODE ($\dot x_t = v_\theta(t, x, c)$) — the stochastic transition assumed by DDPO/DPOK does not exist. Directly plugging it into the RL framework fails.

5.2 Flow-GRPO's two core tricks

Liu et al. 2025 Flow-GRPO: Training Flow Matching Models via Online RL (arXiv 2505.05470) solves two fundamental Flow + RL problems:

Trick 1: ODE → SDE equivalent conversion

For the Rectified Flow ODE $\dot x_t = v_\theta(t, x_t, c)$, construct an equivalent SDE:

$$dx_t = \big[v_\theta(t, x_t, c) + \tfrac{1}{2}\sigma(t)^2 \nabla_x \log p_t(x_t)\big]\,dt + \sigma(t)\,dW_t$$

Key property (Song et al. 2021 score SDE framework): this SDE has exactly the same marginal $p_t$ as the original ODE. The difference is that the SDE provides stochastic exploration (the $dW_t$ noise), allowing RL to sample different trajectories.

For Flow Matching, $\nabla_x \log p_t = -\epsilon/\sigma_t$ (under a Gaussian path), and the score can be derived from $v_\theta$. Setting $\sigma(t)$ as a schedule (typically $\sigma(t) = \sqrt{1-t}$) gives the SDE sampler used by Flow-GRPO during training.

Physical meaning

adding $\sigma\,dW$ lets the particle "jitter" out multiple trajectories while keeping the marginal unchanged, so $G$ samples from the same prompt actually differ → group statistics for GRPO are well-defined.

Trick 2: Denoising reduction

GRPO needs a group of $G$ trajectories per sample, typically $G = 16$–$32$. Flow Matching inference normally uses 25–50 steps, so a single training sample needs $\approx 25G$ forwards — too expensive.

Flow-GRPO uses fewer steps during training (e.g. 10 steps); inference still uses 25–50 steps. The SDE is more uniform on the schedule, so the "exploration quality" of few-step training is enough:

$$\text{Training}: T_\text{train} = 10, \quad \text{Inference}: T_\text{infer} = 28$$

Empirically, no quality drop on GenEval / OCR / Aesthetic.

5.3 Flow-GRPO advantage computation

Completely parallel to GRPO for LLM — for the same prompt $c$, sample $G$ final images $\{x_0^{(1)}, \dots, x_0^{(G)}\}$, score each with reward $r_i$, and normalize within the group:

$$\hat{A}_i = \frac{r_i - \text{mean}_{j}(r_j)}{\text{std}_j(r_j) + \epsilon}$$

All steps within a trajectory share the same $\hat{A}_i$ (same as per-token sharing in LLM GRPO).

5.4 Flow-GRPO loss

Let the SDE Euler step transition log-prob be $\log p_\theta(x_{t-1} \mid x_t, c)$ (Gaussian), with importance ratio $\rho_{i,t} = p_\theta / p_{\theta_\text{old}}$. PPO-clip:

$$L^\text{Flow-GRPO} = \mathbb{E}\!\left[\frac{1}{G}\sum_i\!\frac{1}{T_\text{train}}\!\sum_t \min\!\big(\rho_{i,t}\hat{A}_i, \text{clip}(\rho_{i,t}, 1-\epsilon, 1+\epsilon)\hat{A}_i\big) - \beta\, \text{KL}_{i,t}(p_\theta \Vert p_\text{ref})\right]$$

KL still uses the K3 estimator (same as GRPO for LLM).

5.5 Geometric meaning of advantage on the vector field

L3 understanding

5.6 Flow-GRPO empirical results

The paper reports for SD3.5-M:

BenchmarkSD3.5-M baseFlow-GRPO
GenEval overall63%95%
Visual text rendering59%92%
Aesthetic (Schuhmann)5.86.1
GenEval 95% looks too perfect

the paper does claim this number, but note that GenEval measures rule-verifiable object count / color / spatial-relation tasks, which are inherently RL-friendly (highly regularized rewards). On more subjective benchmarks like PartiPrompt / DPG, gains are 5–10 points, which is more realistic.

§6 Code Patterns (readable pseudo-code)

6.1 DDPO REINFORCE-style update

import torch
import torch.nn.functional as F

def ddpo_step(unet, ref_unet, scheduler, prompts, reward_fn,
              T=20, B=4, lr=1e-5, beta=0.0):
    """
    DDPO-SF one-step training (REINFORCE + optional KL anchor).
    prompts: list of B text prompts
    reward_fn: callable (x0_batch, prompts) -> [B] scalar
    """
    # ── 1. Rollout: sample G=B trajectories ──
    x = torch.randn(B, 4, 64, 64, device=device)
    traj_log_probs = []
    with torch.set_grad_enabled(False):                   # rollout needs no grad
        x_t = x
        for t in reversed(range(T)):
            # predict noise + sample x_{t-1}
            eps_pred = unet(x_t, t, prompts)
            mean, std = scheduler.step_mean_std(x_t, eps_pred, t)
            x_tm1 = mean + std * torch.randn_like(mean)   # stochastic transition
            traj_log_probs.append((mean.detach(), std.detach(), x_tm1.detach()))
            x_t = x_tm1
        x_0 = x_t

    # ── 2. Reward ──
    R = reward_fn(x_0, prompts)                            # [B]
    A = (R - R.mean()) / (R.std() + 1e-8)                  # batch baseline

    # ── 3. Policy gradient: re-forward to get log p_θ ──
    x_t = x.detach()
    loss_pg = 0.0
    for t, (mean_old, std_old, x_tm1) in zip(reversed(range(T)), traj_log_probs):
        eps_pred = unet(x_t, t, prompts)                   # grad needed
        mean, std = scheduler.step_mean_std(x_t, eps_pred, t)
        # Gaussian log-prob
        log_p = -0.5 * (((x_tm1 - mean) / std) ** 2).sum([1, 2, 3])
        log_p -= std.log().sum([1, 2, 3])
        loss_pg = loss_pg - (log_p * A).mean()             # REINFORCE
        if beta > 0:
            ref_eps = ref_unet(x_t, t, prompts).detach()
            mean_ref, std_ref = scheduler.step_mean_std(x_t, ref_eps, t)
            # Gaussian-Gaussian KL closed form
            kl = ((mean - mean_ref) ** 2 / (2 * std_ref ** 2)
                  + (std / std_ref) ** 2 / 2
                  - 0.5 - (std / std_ref).log()).sum([1, 2, 3])
            loss_pg = loss_pg + beta * kl.mean()
        x_t = x_tm1.detach()
    return loss_pg
DDPO implementation footguns

6.2 Diffusion-DPO loss

def diffusion_dpo_loss(unet, ref_unet, scheduler,
                       x0_w, x0_l, prompt_embeds, beta=2000.0):
    """
    Diffusion-DPO (Wallace 2024) one-step training.
    x0_w, x0_l: [B, 4, H, W]  preferred / dispreferred latents
    beta: paper uses 2000~5000 (note: this is the combined β·T coefficient, larger than LLM DPO)
    """
    B = x0_w.shape[0]
    t = torch.randint(0, scheduler.num_train_timesteps, (B,), device=x0_w.device)
    noise = torch.randn_like(x0_w)                          # paired noise!

    xt_w = scheduler.add_noise(x0_w, noise, t)
    xt_l = scheduler.add_noise(x0_l, noise, t)

    # ── policy ε-prediction ──
    eps_w = unet(xt_w, t, prompt_embeds)
    eps_l = unet(xt_l, t, prompt_embeds)

    # ── reference ε-prediction (frozen) ──
    with torch.no_grad():
        ref_eps_w = ref_unet(xt_w, t, prompt_embeds)
        ref_eps_l = ref_unet(xt_l, t, prompt_embeds)

    # ── ELBO surrogate: -‖ε - ε_θ‖² is a proxy for log p_θ ──
    err_w_pol = ((noise - eps_w) ** 2).mean([1, 2, 3])      # [B]
    err_w_ref = ((noise - ref_eps_w) ** 2).mean([1, 2, 3])
    err_l_pol = ((noise - eps_l) ** 2).mean([1, 2, 3])
    err_l_ref = ((noise - ref_eps_l) ** 2).mean([1, 2, 3])

    # DPO log-ratio: smaller err = better likelihood
    #   log(π_θ/π_ref)(y_w) ≈ -(err_w_pol - err_w_ref)
    diff_w = -(err_w_pol - err_w_ref)
    diff_l = -(err_l_pol - err_l_ref)

    inner = beta * (diff_w - diff_l)
    loss = -F.logsigmoid(inner).mean()

    with torch.no_grad():
        margin = inner.mean()
        accuracy = (inner > 0).float().mean()
    return loss, {"margin": margin.item(), "acc": accuracy.item()}
β magnitude note

Diffusion-DPO's $\beta$ is several orders of magnitude larger than LLM DPO's, because it absorbs the $T$-fold accumulation ($\beta T$ is the true "temperature"). Paper uses $\beta \in [2000, 5000]$; LLM DPO uses $\beta \in [0.05, 0.5]$.

6.3 AlignProp / DRaFT backprop with checkpointing

def alignprop_step(unet_lora, scheduler, prompts, reward_fn,
                   T=50, K=1, B=4, lr=1e-5):
    """
    DRaFT-K / AlignProp one-step training.
    K: last K steps retain gradient
    Memory O(K); K=1 is single-forward magnitude on SDXL
    """
    x = torch.randn(B, 4, 64, 64, device=device)

    # ── First T - K steps without grad ──
    with torch.no_grad():
        for t in reversed(range(K, T)):
            eps = unet_lora(x, t, prompts)
            x = scheduler.step_ddim(x, eps, t)             # deterministic DDIM

    # ── Last K steps with grad ──
    for t in reversed(range(K)):
        eps = unet_lora(x, t, prompts)                     # gradient ON
        x = scheduler.step_ddim(x, eps, t)

    x_0 = x
    # ── Backprop ──
    reward = reward_fn(x_0, prompts)                       # [B]
    loss = -reward.mean()                                  # max reward = min -reward
    return loss

# Memory analysis:
#   K=1:  ~24 GB on SDXL (single forward + grad)
#   K=5:  ~60 GB
#   K=10: ~120 GB (needs multi-GPU)
#   K=T=50: ~600 GB (completely infeasible)
Is K=1 enough?

Yes. Intuition: the last step $x_1 \to x_0$ has the largest effect on $x_0$ (the variance of the previous 49 steps is compressed), so the 1-step backprop signal already dominates. Clark 2024 empirically confirms $K=1$ and $K=5$ have negligible difference.

6.4 SPO step-aware preference loss

def spo_loss(unet, ref_unet, scheduler, step_rm,
             x_t, t, prompt_embeds, beta=500.0):
    """
    SPO (Liang 2024) in-step preference.
    Given x_t and t, independently sample two x_{t-1}, let step_rm judge winner.
    """
    # ── 1. Use policy to sample two x_{t-1} candidates (must no_grad,
    #      otherwise DPO log-prob backprop would flow back into the sampler) ──
    with torch.no_grad():
        eps_sample = unet(x_t, t, prompt_embeds)
        mean_s, std_s = scheduler.step_mean_std(x_t, eps_sample, t)
        noise_a, noise_b = torch.randn_like(mean_s), torch.randn_like(mean_s)
        x_a = mean_s + std_s * noise_a
        x_b = mean_s + std_s * noise_b

    # ── 2. step-wise reward model judges winner ──
    with torch.no_grad():
        r_a = step_rm(x_a, x_t, t, prompt_embeds)          # [B]
        r_b = step_rm(x_b, x_t, t, prompt_embeds)
        winner = (r_a > r_b).long()                        # [B], 1 if a wins
    x_w = torch.where(winner.bool()[:, None, None, None], x_a, x_b).detach()
    x_l = torch.where(winner.bool()[:, None, None, None], x_b, x_a).detach()

    # ── 3. compute log p_θ / log p_ref for x_w, x_l (grad-aware forward) ──
    eps = unet(x_t, t, prompt_embeds)
    mean, std = scheduler.step_mean_std(x_t, eps, t)
    log_p_w = -0.5 * ((x_w - mean) / std).pow(2).sum([1, 2, 3])
    log_p_l = -0.5 * ((x_l - mean) / std).pow(2).sum([1, 2, 3])
    with torch.no_grad():
        ref_eps = ref_unet(x_t, t, prompt_embeds)
        ref_mean, ref_std = scheduler.step_mean_std(x_t, ref_eps, t)
        log_pref_w = -0.5 * ((x_w - ref_mean) / ref_std).pow(2).sum([1, 2, 3])
        log_pref_l = -0.5 * ((x_l - ref_mean) / ref_std).pow(2).sum([1, 2, 3])

    inner = beta * ((log_p_w - log_pref_w) - (log_p_l - log_pref_l))
    return -F.logsigmoid(inner).mean()

6.5 Flow-GRPO group-relative advantage

def flow_grpo_step(flow_net, ref_flow, prompts, reward_fn,
                   G=16, T_train=10, sigma_fn=lambda t: (1 - t) ** 0.5,
                   eps_clip=0.2, beta=0.04):
    """
    Flow-GRPO one-step training.
    G: sample G trajectories per prompt.
    T_train: SDE step count for training (inference uses 28-50).
    """
    P = len(prompts)
    # Repeat each prompt G times
    prompts_rep = sum([[p] * G for p in prompts], [])      # [P*G]

    # ── 1. SDE rollout: ODE→SDE equivalent conversion ──
    x_t = torch.randn(P * G, 4, 64, 64, device=device)
    log_probs_old = []                                     # for PPO importance ratio
    trajectory = [x_t.clone()]
    with torch.no_grad():
        for i in range(T_train):
            t_now = 1.0 - i / T_train
            t_next = 1.0 - (i + 1) / T_train
            dt = t_next - t_now
            sigma = sigma_fn(t_now)
            v = flow_net(x_t, t_now, prompts_rep)
            # SDE Euler: drift = v + 0.5 σ² ∇log p (PF-ODE → SDE conversion, Song 2021)
            # !!! IMPORTANT: the following drift is a simplified pedagogical placeholder.
            #     Production implementations follow Flow-GRPO paper Eq.(6) and derive score
            #     correctly from score = (data_pred - x_t)/σ_t² with the specific
            #     Rectified Flow / EDM schedule. Refer to paper + official repo for deployment;
            #     the -v/σ here is only illustrative.
            drift = v + 0.5 * sigma ** 2 * (-v / (sigma + 1e-6))  # placeholder, see paper Eq.(6)
            noise = torch.randn_like(x_t)
            x_next = x_t + drift * dt + sigma * noise * abs(dt) ** 0.5
            # Gaussian log-prob (transition)
            mean = x_t + drift * dt
            std = sigma * abs(dt) ** 0.5
            log_p = -0.5 * ((x_next - mean) / std).pow(2).sum([1, 2, 3])
            log_probs_old.append(log_p)
            x_t = x_next
            trajectory.append(x_t.clone())
        x_0 = x_t

    # ── 2. Group-relative advantage ──
    R = reward_fn(x_0, prompts_rep)                        # [P*G]
    R = R.view(P, G)
    mean_R = R.mean(dim=1, keepdim=True)
    std_R = R.std(dim=1, keepdim=True) + 1e-8
    A = ((R - mean_R) / std_R).view(P * G)                 # [P*G]

    # ── 3. PPO-clip loss with KL ──
    loss = 0.0
    x_t = trajectory[0]
    for i in range(T_train):
        t_now = 1.0 - i / T_train
        v = flow_net(x_t, t_now, prompts_rep)              # grad ON
        sigma = sigma_fn(t_now)
        drift = v + 0.5 * sigma ** 2 * (-v / (sigma + 1e-6))
        dt = -1.0 / T_train
        mean = x_t + drift * dt
        std = sigma * abs(dt) ** 0.5
        log_p_new = -0.5 * ((trajectory[i+1] - mean) / std).pow(2).sum([1, 2, 3])
        ratio = (log_p_new - log_probs_old[i]).exp()
        surr1 = ratio * A
        surr2 = ratio.clamp(1 - eps_clip, 1 + eps_clip) * A
        loss = loss - torch.min(surr1, surr2).mean()

        # K3 KL estimator
        with torch.no_grad():
            v_ref = ref_flow(x_t, t_now, prompts_rep)
            drift_ref = v_ref + 0.5 * sigma ** 2 * (-v_ref / (sigma + 1e-6))
            mean_ref = x_t + drift_ref * dt
            log_p_ref = -0.5 * ((trajectory[i+1] - mean_ref) / std).pow(2).sum([1,2,3])
        delta = log_p_ref - log_p_new
        kl_k3 = (delta.exp() - delta - 1)
        loss = loss + beta * kl_k3.mean()

        x_t = trajectory[i + 1].detach()
    return loss

6.6 Combined reward signal

def combined_reward(images, prompts, weights=None):
    """
    Weighted multi-reward combination — mitigates single-RM hacking.
    """
    weights = weights or {"image_reward": 0.4, "hps_v2": 0.3,
                          "pickscore": 0.2, "clip_score": 0.1}
    rewards = {}
    rewards["image_reward"] = image_reward_model(images, prompts)        # [-1, 4]
    rewards["hps_v2"] = hps_v2(images, prompts)                          # [0, 1]
    rewards["pickscore"] = pickscore(images, prompts)                    # logits
    rewards["clip_score"] = clip_cosine(images, prompts)                 # [-1, 1]

    # ── z-score each individually (different rewards have very different scales) ──
    normed = {k: (v - v.mean()) / (v.std() + 1e-8) for k, v in rewards.items()}

    # ── weighted + length / safety penalty ──
    R = sum(weights[k] * normed[k] for k in weights)

    # NSFW penalty (rule-based)
    nsfw_score = nsfw_detector(images)                                   # [0, 1]
    R = R - 5.0 * nsfw_score

    return R
Multi-reward practical experience

§7 Reward Design & failure modes

7.1 Reward model choices

RMSourceDataScalePreference characteristics
CLIP-ScoreOpenAI/LAION4B image-text pairs$[-1, 1]$ cosineweak text-image alignment signal; prefers literal caption matching
ImageReward (Xu 2023 NeurIPS)137K human pairsreal prompts$[-1, 4]$aesthetic + alignment combined; prefers high contrast
HPSv2 (Wu 2023 arXiv 2306.09341)798K human pairsDiffusionDB-style$[0, 1]$comprehensive human preference; prefers saturated colors
PickScore (Kirstain 2023 NeurIPS)Pick-a-Pic 1M pairsreal userslogitscomprehensive; biased toward trained-on-SDXL style
PiCaR (rule)OpenAIcounting/OCRbinaryrule-based, hack-proof
SymptomVisual featureCause
Over-saturationcolor saturation > 100%HPSv2 / aesthetic prefers vivid
Center biassubject always centeredRM training data is centered-subject heavy
Monotone compositiondifferent prompts use the same compositionmode collapse to RM high-score mode
Tryphobia-like patternsdense dot / hole texturessome RMs prefer "texture richness"
Watermark hallucinationfake watermarks appear in cornersRM training data contains watermarks → learned "watermark = real photo"
Cartoon shiftphoto-realistic prompts output animeRM annotators prefer anime
Lighting overcookedHDR post-processing exaggeratedaesthetic predictor prefers heavy post-processing

7.3 Step-level vs trajectory-level reward

DimensionTrajectory-levelStep-level
Signal locationonly at $x_0$at every $t$
Data acquisitioneasy (one image)hard (need step-wise RM or rollout)
Learning efficiencylow (sparse)high (dense)
RepresentativeDDPO, Diffusion-DPOSPO (Liang 2024)
Engineering difficultylowhigh (either train a step-RM or PRM-shepherd-style rollout)
How to train a step-RM

SPO uses a binary RM "given $x_t$ at step $t$, is $x_{t-1}$ a good transition?". Training data: run multiple trajectories from the base UNet, and assign per-step rewards by backing out from final-$x_0$ rewards (similar to Math-Shepherd's rollout-based PRM).

7.4 Core mechanisms to mitigate reward hacking

  1. Reward ensemble: take min or mean across multiple RMs (HPSv2 + PickScore + ImageReward is the mainstream combo).
  2. KL anchor: DPO's $\beta$, DPOK's explicit KL, Flow-GRPO's K3 KL term.
  3. LoRA scale: full fine-tune drifts fast; LoRA scale caps reward-hacking ceiling.
  4. Early stop on reward plateau: reward up + FID up = hacking signal.
  5. Composite reward: rule-based (object count, OCR) + neural RM (aesthetic, alignment) weighted.
  6. Adversarial RM: include hacking samples as negatives when training RM.

§8 Production landscape: what SD3 / FLUX actually use

8.1 What the public papers / reports actually say

ModelPost-training?Disclosed content
SD 1.5partial community DPO / DDPO LoRAbase is pure LDM; many community fine-tunes
SDXLStability AI did not explicitly describe post-trainingbase + refiner; Pick-a-Pic + Diffusion-DPO community LoRAs are popular
SDXL Turbo / ADDdistillation-focusedAdversarial Diffusion Distillation (2311.17042); essentially 1-step distill, not RL post-training
SD3 (Stable Diffusion 3, Esser et al. 2024 ICML)base uses Rectified Flow + MM-DiTpaper does not disclose post-training; community suspects internal DPO
SD3.5 / SD3.5 Turbodistill is present; post-training not disclosedspeculated DPO + distill hybrid
FLUX.1 dev / pro (Black Forest Labs 2024)undisclosedcommunity speculates DPO + distill; pro API is closed-source
DALL-E 3 (OpenAI 2023)"recaptioning + RLHF"publicly emphasizes prompt-faithful RLHF
Imagen 3 (Google 2024)undisclosedinternal alignment pipeline
DeepFloyd IFno post-trainingacademic base model

8.2 Did SD3 / FLUX use post-training?

Honest answer

public papers / tech reports do not explicitly say RL / DPO post-training was used. But there is indirect evidence:

Industry consensus (from HuggingFace community + Reddit r/StableDiffusion): closed-source large models (FLUX pro, DALL-E 3, Midjourney v6+) have reward-based fine-tuning but the specifics are undisclosed; open-source bases (SD3.5 base, FLUX dev base) have public training pipelines without RL, but Stability AI's internal dev branches may have it.

8.3 An industrial-grade pipeline hypothesis

                    ┌─────────────────┐
                    │  LDM Pretrain   │  several 100M / B images, $L_simple$
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │  SFT on Curated │  high-quality prompt-image pairs
                    │     dataset     │  (Aesthetic > 6.0, no watermark)
                    └────────┬────────┘
                             │
            ┌────────────────┼────────────────┐
            │                                  │
   ┌────────▼────────┐              ┌─────────▼────────┐
   │ Diffusion-DPO   │              │  DRaFT / AlignProp│
   │ on Pick-a-Pic    │              │   on multi-RM    │
   └────────┬────────┘              └─────────┬────────┘
            │                                  │
            └────────────────┬─────────────────┘
                             │
                    ┌────────▼────────┐
                    │   Distillation  │  4-step / 1-step turbo
                    │   (ADD / LCM)   │
                    └────────┬────────┘
                             │
                       Production
Conclusion

post-training most likely happens between "SFT → distill"; in terms of method, industry leans Diffusion-DPO (offline, stable, no sampling needed); DDPO/DPOK have large academic impact but limited production deployment.

§9 vs LLM RLHF comparison

9.1 The full table

DimensionLLM RLHF (RLHF + DPO + GRPO)Diffusion Post-Training
Trajectory$L$ tokens$T$ denoising steps
Action spacediscrete vocabcontinuous $\mathbb{R}^d$
Reward sourcetrained BT-RM / rule (math, code)trained image RM (HPSv2/PickScore/ImageReward) / rule (count, OCR)
Reward sparsityterminal only (end of response)terminal only ($x_0$)
On-policy cost$L$ forwards$T$ forwards + image RM forward
Offline methodsDPO / IPO / KTO / SimPO / ORPODiffusion-DPO / D3PO / SPO / KTO / MaPO
On-policy methodsPPO / GRPO / RLOODDPO / DPOK / Flow-GRPO
Direct reward backprop❌ (tokens not differentiable)✅ (DRaFT / AlignProp / ReFL)
Memory bottleneck4 copies (policy + ref + RM + V)1 copy (DPO) / $O(K)$ (DRaFT-K) / $O(T)$ (vanilla backprop)
Typical $\beta$$0.05 \sim 0.5$$2000 \sim 5000$ (absorbs $T$-fold coefficient)
Typical trajectory length$L \sim 10^3$ tokens$T \sim 20$–$50$ steps
Mode collapse severitymedium (vocab is large)high (continuous space easily falls into local modes)
Reward hacking difficultymedium (depends on RM quality)high (visual RMs are easier to hack than BT-RMs)

9.2 Shared lessons

  1. KL anchor is mandatory: pure RL without ref policy always reward-hacks (LLMs produce vacuous fluff, diffusion produces over-saturated monotonous compositions).
  2. DPO family >> on-policy RL (engineering-wise): no sampling, no value model, offline — true for both LLM and diffusion.
  3. Reward ensemble counters hacking: min-of-K RMs is the common mitigation across domains.
  4. Group-based advantage: LLM's GRPO/RLOO and diffusion's Flow-GRPO both bypass the value model via group statistics.

9.3 Diffusion-specific differences

§10 25 frequently-asked interview questions

Three difficulty tiers: L1 = multimodal/diffusion-role basics; L2 = research/alignment-focused; L3 = top-lab hardcore.

L1 must-know (10 questions)

Q1. Why does a diffusion model need post-training? Isn't SFT enough?
  • SFT can only imitate positive examples ("what good looks like"), and cannot learn the contrastive signal (A is better than B).
  • Post-training provides contrastive signals via reward / preference, lifting alignment, aesthetic, prompt-faithfulness all at once.
  • Empirically Diffusion-DPO yields +5-10 points on PickScore, far exceeding continued SFT.

Saying only "improves image quality" is shallow; must articulate the "contrastive vs imitative signal" distinction.

Q2. What MDP does DDPO treat diffusion as? State/action/reward definitions?
  • State: $s_t = (x_t, t, c)$
  • Action: $a_t = x_{t-1}$ (sampled from $p_\theta(\cdot \mid x_t, c)$)
  • Transition: deterministic $s_{t-1} = (x_{t-1}, t-1, c)$
  • Reward: $r_t = 0$ for $t > 1$, $r_1 = R(x_0, c)$ (terminal-only)

Saying "per-step reward" (wrong, only at the terminal); or not knowing the transition is deterministic (noise is built into the action).

Q3. What does Diffusion-DPO substitute for $\log \pi_\theta(y|x)$?
  • Uses an ELBO surrogate: $-\|\epsilon - \epsilon_\theta(x_t, t, c)\|^2$ (DDPM's $L_\text{simple}$) as a proxy for $\log p_\theta(x_0)$.
  • This is a (negative) lower-bound term of $\log p_\theta$ — directionally correct.
  • Requires paired noise $\epsilon$ ($y_w$ and $y_l$ share the same $\epsilon$) for stability.

Saying the closed-form $\log p(x_0)$ is used (wrong, diffusion has no closed form); or forgetting paired noise.

Q4. AlignProp and DRaFT core idea? Why is memory $\mathcal{O}(K)$?
  • Core: reward $R(x_0)$ is differentiable → directly backprop to $\theta$, skipping RL.
  • The $T$-step sampler is a differentiable computation graph; vanilla backprop must store $T$-step activations → $\mathcal{O}(T)$.
  • DRaFT-K / AlignProp: the first $T-K$ steps use no_grad, only the last $K$ steps retain gradient — memory compressed to $\mathcal{O}(K)$.
  • Typically $K=1$ already gives a strong signal.

Saying K=1 equals REINFORCE (wrong, K=1 is reparameterized gradient with far lower variance than REINFORCE).

Q5. Why is Diffusion-DPO's $\beta$ 1000× larger than LLM DPO's?
  • LLM DPO: $\beta \in [0.05, 0.5]$
  • Diffusion-DPO: $\beta \in [2000, 5000]$
  • Reason: the diffusion "trajectory log-likelihood" is the sum of $T$ Gaussian log-probs; the single-step $\epsilon$-distance difference absorbs the $T$-fold coefficient. The effective temperature is $\beta T$.
  • Some implementations explicitly factor out $T$, in which case $\beta$ looks the same order as LLM.

Saying "diffusion has more noise so β is larger" (wrong, it's the cumulative effect of trajectory length).

Q6. Difference between ImageReward / HPSv2 / PickScore?
ImageRewardHPSv2PickScore
Data scale137K pairs798K pairs1M pairs (Pick-a-Pic)
BackboneBLIP fine-tunedCLIP fine-tunedCLIP fine-tuned
Scale$[-1, 4]$$[0, 1]$logits
Preferenceaesthetic + alignmenthigh contrast + alignmentSDXL style

Industrial best practice: use an ensemble (at least two).

Saying they're all the same (wrong, scales and preferences differ substantially).

Q7. Does DDPO sample with DDPM or DDIM? Why?
  • DDPM (or DDIM-eta=1) — stochastic transition needed.
  • DDIM-eta=0 is deterministic with no noise term, so no action to optimize → policy gradient is zero.
  • Analogy: LLM RL must use sampling (temperature > 0), cannot use greedy.

Saying DDIM is fine (wrong, eta > 0 is needed); or not knowing stochasticity is a prerequisite for RL.

Q8. Typical reward-hacking symptoms in diffusion?
  • Over-saturation (color saturation explodes) — HPSv2/aesthetic prefers vivid.
  • Center bias (subject always centered) — RM training data is centered.
  • Monotone composition (same composition across prompts) — mode collapse.
  • Watermark hallucination (fake watermarks in corners) — RM training data contains watermarks.
  • Cartoon shift (photo-realistic prompts produce anime) — RM annotator bias.

Saying just "over-optimization" is non-specific; you must name at least 3 specific visual symptoms.

Q9. Why is Flow-GRPO's ODE→SDE conversion necessary?
  • The Flow Matching ODE $\dot x = v_\theta$ is deterministic — given $x_T$, $x_0$ is uniquely determined.
  • RL needs a stochastic policy to explore; the ODE has no sampling dimension.
  • ODE→SDE adds the $\sigma\, dW$ noise term; the marginal $p_t$ is unchanged (Anderson 1982), but each sample path is different → enables exploration.

Not knowing the marginal is preserved (wrong, may think SDE changes the distribution).

Q10. How are $y_w$ and $y_l$ noises handled in Diffusion-DPO training?
  • Paired noise: $x_t^w$ and $x_t^l$ use the same $\epsilon$ (i.e. $x_t^w = \sqrt{\bar\alpha_t}x_0^w + \sqrt{1-\bar\alpha_t}\epsilon$, with $x_t^l$ similarly using the same $\epsilon$).
  • Without pairing, the $\beta$ scale becomes incomparable, loss variance rises sharply, and training is unstable.
  • This is the most overlooked implementation detail of Diffusion-DPO.

Not knowing about paired noise (wrong); or thinking $\epsilon$ refers to $\epsilon_\theta$'s prediction (wrong — here $\epsilon$ is the q-sample noise).

L2 advanced (10 questions)

Q11. Derive Diffusion-DPO loss (starting from the KL-regularized optimal solution).
  1. KL-regularized objective: $\max_p \mathbb{E}[R] - \beta\, \text{KL}(p \Vert p_\text{ref})$, optimal $p^* \propto p_\text{ref} \exp(R/\beta)$.
  2. Solve for implicit reward: $R(x_0, c) = \beta \log(p^*/p_\text{ref}) + \beta \log Z(c)$.
  3. Plug into BT: $P(y_w \succ y_l) = \sigma(R_w - R_l)$; $\log Z$ cancels in the difference.
  4. Replace $p^* \to p_\theta$; substitute the ELBO surrogate for $\log p_\theta$: $-L_\text{simple} = -\|\epsilon - \epsilon_\theta\|^2$ (at $x_t = q\text{-sample}(x_0, t, \epsilon)$).
  5. Expectation over $t \sim U(1, T)$; the loss becomes $-\log\sigma(\beta T [\Delta_w - \Delta_l])$, $\Delta_y = \|\epsilon - \epsilon_\text{ref}\|^2 - \|\epsilon - \epsilon_\theta\|^2$.

Reciting the final formula but unable to explain why $\log Z$ cancels; or not knowing the source of the ELBO surrogate.

Q12. How does AlignProp's $K$-step backprop trade memory against performance?
  • Memory: $\mathcal{O}(K \cdot M_\text{UNet})$. SDXL single forward ~ 8 GB activation; $K=1 \to 24$ GB (incl. weights + grad); $K=5 \to 60$ GB; $K=10 \to 120$ GB.
  • Performance: $K=1$ empirically reaches 95% of $K=5$; $K \ge 5$ shows no significant gain on most rewards.
  • Intuition: the last step $x_1 \to x_0$ has the largest effect on the terminal; the variance from the first 49 steps is compressed.
  • Industrial standard is $K=1$ (24 GB, single-GPU trainable).

Saying "the larger K the better" (wrong, performance curve saturates); or not knowing the memory magnitude.

Q13. Difference between REINFORCE and PPO in DDPO? Which is preferred in production?
  • DDPO-SF (REINFORCE): $\hat g = \sum_t \nabla \log p_\theta \cdot (R - b)$, simple but high variance.
  • DDPO-IS (PPO-clip): importance ratio $\rho_t = p_\theta/p_{\theta_\text{old}}$, multiple updates per batch, clip $\rho_t$.
  • Per-step ratio, not trajectory ratio (avoids variance blow-up from multiplying $T$ ratios).
  • Production prefers PPO: more stable, better sample efficiency.

Saying "trajectory-level ratio" (wrong, per-step); or not knowing both are DDPO variants.

Q14. How does SPO obtain the in-step preference pair? Why is a step-RM needed?
  • In-step: given $x_t$, independently sample two $x_{t-1}^a, x_{t-1}^b$ (use the policy's stochastic transition $p_\theta(\cdot \mid x_t)$ twice).
  • Use a step-wise reward model $R_\text{step}(x_{t-1}, x_t, c, t)$ to choose the winner.
  • step-RM training data: roll out many trajectories from the base UNet; the per-step reward is back-derived from the terminal reward (similar to Math-Shepherd's rollout-based PRM).
  • Could use a terminal RM instead, but then rolling out to $x_0$ for scoring is $T$× more expensive.

Not knowing how to obtain the in-step pair (wrong, must sample twice); or not knowing step-RM is SPO-specific.

Q15. Substantive difference between Diffusion-DPO and D3PO?
  • Derivation path: Diffusion-DPO uses the ELBO surrogate (single-step $\epsilon$-distance); D3PO uses the full trajectory log-ratio.
  • Mathematical equivalence: under the ELBO lower bound and expectation over $t$, D3PO's trajectory form reduces to Diffusion-DPO's single-step form.
  • Practical differences:

    • Diffusion-DPO computes UNet forward once per step (policy + ref), cheap.
    • D3PO strictly requires forward over the full trajectory $T$ times.
  • Production mostly uses Diffusion-DPO (cheap + stable).

Saying "completely different" (wrong, they are theoretically equivalent); or not knowing D3PO is also a DPO-family method.

Q16. What is Flow-GRPO's denoising reduction? Why doesn't it degrade?
  • During training, the SDE uses few steps ($T_\text{train} = 10$); inference still uses full steps ($T_\text{infer} = 28$–$50$).
  • Empirically little degradation, because (note: this is empirical observation + approximation, not strict equivalence):

    • The SDE's continuous marginal $p_t$ is independent of step count; but the discrete sampler's actual distribution depends on step count — fewer steps yields a higher-discretization-error approximation. Strictly, "same marginal" only holds in the continuous limit.
    • RL learns the direction correction of $v_\theta$; the direction signal is loosely coupled to step count (this is empirical observation).
    • No degradation on rule-based rewards like GenEval/OCR; slight but acceptable degradation on more subjective rewards.
  • Saves sampling cost: each prompt needs $G \cdot T_\text{train}$ forwards → 1/3 the cost.

Not knowing the marginal is invariant (wrong); or thinking train/inference step counts must match.

Q17. How does MaPO remove the reference model? What does the loss look like?
  • Does not use $\log(p_\theta/p_\text{ref})$; uses an absolute likelihood margin directly:

$$\mathcal{L}_\text{MaPO} = -\log\sigma\!\big(\beta(\hat\ell_w - \hat\ell_l) - \gamma\big) + \alpha \hat\ell_w$$

  • $\hat\ell = -\|\epsilon - \epsilon_\theta\|^2$ is the likelihood surrogate.
  • $\gamma$ is the margin (similar to SimPO); the $\alpha\hat\ell_w$ term prevents "both sides dropping".
  • Saves half the memory (no ref UNet), trains 15% faster, and solves the reference-mismatch problem (stable when fine-tuning to a stylistically distant target).

Saying just "remove the ref" (wrong — need the likelihood term to prevent degeneration); not knowing about reference mismatch.

Q18. Why is reward-ensemble min better than mean?
  • mean: can still be hacked when one high-scoring RM dominates.
  • min: requires all RMs to agree on "good" before the reward is high → hacking must simultaneously fool all RMs, exponentially harder.
  • Equivalent to conservative aggregation (Coste 2024 ICLR for LLM); same idea applies to diffusion.
  • Cost: reward becomes conservative, gains are smaller.
  • Industry often uses R = mean - k * std (includes an uncertainty penalty) as a compromise.

Saying only "prevents hacking" without explaining why min; or not knowing this is also an LLM ensemble strategy.

Q19. What unique advantage does Diffusion-KTO have over Diffusion-DPO?
  • Only needs per-image binary feedback (thumbs up/down), no paired comparison required.
  • Industrial scenarios collect far more user reactions (likes/dislikes) than paired comparisons → KTO makes this data usable.
  • The prospect-theoretic value function $v(\cdot)$ is asymmetric over positive vs negative feedback (loss aversion).
  • No "which is better" annotation cost.

Not knowing KTO is unpaired (wrong — that's the whole point of KTO); or not knowing the prospect-theory origin.

Q20. Why is diffusion's KL anchor trajectory-level rather than token-level?
  • LLM's KL is per-token: $\sum_t \log(\pi_\theta(y_t)/\pi_\text{ref}(y_t))$.
  • Diffusion's KL is per-step (per-denoising-step), not per-pixel: $\sum_t \text{KL}(p_\theta(\cdot \mid x_t) \Vert p_\text{ref}(\cdot \mid x_t))$.
  • Two-Gaussian KL has a closed form: $\text{KL} = \frac{1}{2}\big[(\mu_\theta - \mu_\text{ref})^2/\sigma^2 + (\sigma_\theta/\sigma_\text{ref})^2 - 1 - 2\log(\sigma_\theta/\sigma_\text{ref})\big]$.
  • Pixels are not independent (convolution/attention), so the KL is per-image-level rather than per-pixel.

Saying "per-pixel KL" (wrong, per-step); not knowing the closed-form Gaussian KL.

L3 top-lab questions (5 questions)

Q21. Derive Diffusion-DPO loss from the reverse ELBO, and explain why the ELBO surrogate works.
  1. DDPM ELBO: $\log p_\theta(x_0) \ge -\sum_{t=2}^T \text{KL}(q(x_{t-1}|x_t,x_0) \Vert p_\theta(x_{t-1}|x_t)) + \log p_\theta(x_0|x_1) - \text{KL}(q(x_T|x_0) \Vert p(x_T))$
  2. Simplify (Ho 2020): $-\log p_\theta(x_0) \le L_\text{simple} + C$, $L_\text{simple} = \mathbb{E}_{t,\epsilon}\|\epsilon - \epsilon_\theta(x_t,t)\|^2$.
  3. KL-regularized optimum $p^* \propto p_\text{ref}\exp(R/\beta)$; solve for $R = \beta\log(p^*/p_\text{ref}) + \beta\log Z$.
  4. Plug into BT; $\log Z$ cancels.
  5. Use the ELBO surrogate for $\log p$: $\log p_\theta(x_0) \approx -L_\text{simple}$ (note: this is the negation of the upper bound, used as a single-sample estimate — strictly, it is one term of the lower bound, not $\log p$ itself, but it works numerically as a DPO implicit-reward proxy).
  6. Take expectation over $t$ to obtain the final loss.

Deeper reason the ELBO surrogate works: DPO's implicit reward is $\beta\log(p_\theta/p_\text{ref})$ — it only depends on relative likelihood. The ELBO surrogate's constant term ($C$) cancels between $p_\theta$ and $p_\text{ref}$ (both use the same architecture); only the $-\|\epsilon - \epsilon_\theta\|^2$ difference remains. So even though ELBO isn't a tight bound for $\log p$, the difference is cancellable.

Only able to write the final formula but cannot trace the derivation chain; or not knowing constant cancellation is the key.

Q22. Is AlignProp's $\mathcal{O}(K)$ memory truly unavoidable for $K$-step backprop?

In theory yes, but expensive in practice:

  1. Gradient checkpointing: trade activation storage for recompute. Each forward step does not store activations; on backward, redo forward to compute gradient.

    • Memory: from $\mathcal{O}(K \cdot M)$ down to $\mathcal{O}(\sqrt{K} \cdot M)$ + $\mathcal{O}(K \cdot \text{state})$.
    • Cost: backward is 2-3× slower.
  2. Reversible ResNet: if the UNet has a reversible architecture (i-RevNet style), backward can derive input from output without storing activations.

    • But Stable Diffusion / SDXL UNet is not reversible.
  3. Implicit gradient: use the implicit function theorem under a fixed-point assumption.

    • Requires the sampler to converge to a fixed point, which diffusion does not satisfy.
  4. Truncated backprop with control variates: DRaFT-K is already in this direction; theoretically, adding control variates can further reduce variance but not memory.

Practical answer: $K=1$ + gradient checkpointing + LoRA is the engineering optimum. The fundamental reason $\mathcal{O}(K)$ is unavoidable is that the sampler is not reversible computation.

Only saying gradient checkpointing is incomplete; not knowing about the reversibility assumption.

Q23. Geometric meaning of advantage on the vector field $v_\theta$ in Flow-GRPO?

GRPO's advantage acts on the vector field space as follows:

  1. Group statistics: for the same prompt $c$, sample $G$ SDE trajectories; each yields a different $x_0^{(i)}$; the reward $r_i$ gives the entire trajectory a single advantage $\hat A_i = (r_i - \bar r)/\sigma_r$.
  2. Gradient along trajectory: $\nabla_\theta L = \sum_t \nabla_\theta \log p_\theta(x_{t-1}^{(i)} \mid x_t^{(i)}) \cdot \hat A_i$. Under a Gaussian transition, $\log p \propto -(x_{t-1} - \mu_\theta)^2/(2\sigma^2)$, so $\nabla_\theta \log p \propto (x_{t-1} - \mu_\theta)\nabla_\theta \mu_\theta / \sigma^2$.
  3. Physical meaning of $\mu_\theta$: under the Flow Matching SDE, $\mu_\theta = x_t + (v_\theta + \frac{1}{2}\sigma^2 s_\theta) dt$; $\nabla_\theta \mu_\theta \approx dt \cdot \nabla_\theta v_\theta$ (ignoring the score term).
  4. Geometric meaning: when $\hat A_i > 0$, push $v_\theta(t, x_t^{(i)})$ toward the direction $(x_{t-1}^{(i)} - x_t^{(i)})/dt$ (the direction the trajectory actually took); when advantage $< 0$, push the opposite way.
  5. vs ODE perspective: equivalent to "group-relative direction reweighting" in vector field space — good trajectories make $v_\theta$ point along their direction at that $(t, x_t)$, bad ones the opposite.

This is "reward-weighted importance sampling" in vector field space: each SDE trajectory is one "proposal direction" for $v_\theta$, and the advantage decides whether to follow it.

Failing to articulate any geometry is 0 points; saying "reweighting" without specifying the space is half credit.

Q24. Did SD3 / FLUX really use RL post-training? How can you tell?

Honest answer: public papers / tech reports do not explicitly say RL / DPO was used. But there are clues:

  1. SD3 paper (arXiv 2403.03206): discusses Rectified Flow + MM-DiT + reflow only; no reward fine-tuning mentioned.
  2. FLUX: no paper at all; the model card just says "trained on a large image-text dataset".
  3. DALL-E 3 (OpenAI 2023): explicitly says caption-faithful RLHF (rewrite caption + RM) is used.
  4. Industry consensus: closed-source large models (FLUX pro, DALL-E 3, Midjourney v6+) almost certainly have reward-based fine-tuning, but the method is undisclosed.

How to test (black-box):

  • Generate 100 images for the same prompt; low FID-100 / multi-mode diversity → likely RL/DPO (mode collapse signal).
  • GenEval high but portrait style monolithic → reward over-optimization signal.
  • Excessive response to "vibrant"/"colorful" prompts → traces of HPSv2/aesthetic RM.

Conclusion: FLUX likely has internal DPO + distill hybrid; SD3.5 likely has SFT + possibly DPO. But no public evidence — the key to this question is to answer "undisclosed but with indirect evidence", and avoid making up technical details.

Saying directly "SD3 uses Diffusion-DPO" is wrong (paper says no such thing); the answer should be "undisclosed but community-inferred with evidence list".

Q25. If you were to design a diffusion post-training pipeline from scratch, how would you choose?

Depends on constraints. A generic recommendation:

Phase 1: preference data collection

  • Collect paired preference (Pick-a-Pic style): expensive but DPO-ready.
  • Collect binary feedback (thumbs up/down): cheap, use Diffusion-KTO.
  • Collect rule-based ground truth (GenEval-style prompts + automated verifier): cheap, use Flow-GRPO.

Phase 2: algorithm choice

  • Default Diffusion-DPO: offline, stable, cheap, mature community code (HuggingFace diffusers supports it directly).
  • If the base is Flow Matching (SD3/FLUX): use Flow-GRPO, with rule-based rewards prioritized.
  • If fine-tuning to a new style / memory constrained: use MaPO (no ref, saves half the memory).
  • If reward is differentiable and you want to extract maximum signal: DRaFT-1 + LoRA, paired with HPSv2 + PickScore ensemble.
  • NOT preferred: DDPO: on-policy sampling is too expensive, engineering complexity is high, no clear advantage over DPO.

Phase 3: reward design

  • Multi-RM ensemble (min or mean - k·std): HPSv2 + PickScore + ImageReward.
  • Add rule-based safety: NSFW detector hard penalty.
  • Add rule-based alignment: GenEval automated verifier (object count, OCR).
  • Z-score each RM separately.

Phase 4: monitoring and early stop

  • Compute reward + FID-100k every N steps; reward up + FID up = hacking signal.
  • KL budget monitoring: stop if $\text{KL}(p_\theta \Vert p_\text{ref}) > K_\text{target}$.
  • Human blind A/B (base vs RL) every 1000 steps.

Phase 5: distillation handoff

  • After post-training, do ADD / LCM distillation to 4-step / 1-step.
  • Note: distillation can eliminate some RL gains; consider distill-aware post-training.

Just answering "use Diffusion-DPO" is shallow; the answer should articulate "phase decomposition + multi-reward + monitoring + distill handoff".

§A Appendix

A.1 Key papers (with arXiv IDs)

PaperOne-line summaryarXivVenue
DDPOtreat diffusion as an MDP, train with REINFORCE/PPO2305.13301ICLR 2024
DPOKKL-regularized RL for diffusion2305.16381NeurIPS 2023
DRaFTdirect reward backprop, last $K$ steps2309.17400ICLR 2024
AlignPropreward backprop with randomized truncation2310.03739ICLR 2024
ImageReward / ReFL137K human pair RM + single-step reward fine-tune2304.05977NeurIPS 2023
HPSv2798K human pair RM2306.09341arXiv 2023
PickScore (Pick-a-Pic)1M user pairs, CLIP RM2305.01569NeurIPS 2023
Diffusion-DPOELBO surrogate + DPO loss2311.12908CVPR 2024
D3POtrajectory-level DPO for diffusion2311.13231CVPR 2024
SPOstep-aware preference + step-RM2406.04314arXiv 2024
Diffusion-KTOunpaired binary feedback (KTO for diffusion)2404.04465NeurIPS 2024
MaPOmargin-aware, no ref2406.06424arXiv 2024
Flow-GRPOGRPO for Flow Matching via ODE→SDE2505.05470arXiv 2025
SD3 (Rectified Flow + MM-DiT)base model2403.03206ICML 2024
Constitutional AI (origin of RLAIF)AI feedback replaces human2212.08073arXiv 2022
KTO (LLM)prospect theory alignment2402.01306arXiv 2024

A.2 Common reward model resources

A.3 Open-source training code

A.4 Engineering footgun checklist

FootgunFix
Diffusion-DPO without paired noise$\epsilon$ for $x_t^w$ and $x_t^l$ must be shared
DDPO with DDIM-eta=0must use eta>0 or DDPM, otherwise gradient is zero
AlignProp memory explosion$K=1$ + gradient checkpoint + LoRA
Reward scales not normalizedz-score each RM separately
FID crashes after RLadd KL anchor or reward ensemble
$\beta$ won't moveDiffusion-DPO uses $\beta \in [2000, 5000]$, not the LLM 0.1
Flow-GRPO trains slowlyuse denoising reduction ($T_\text{train} < T_\text{infer}$)
MaPO blows upthe $\alpha\hat\ell_w$ term must be large enough to prevent likelihood collapse
Step-RM won't trainuse rollout-based auto-labeling (Math-Shepherd-style)
Reward hacking undetectedsimultaneously monitor reward + FID + human blind A/B

A.5 Mapping back to §0 TL;DR

TL;DR pointSee section
1. Why it's hard§1
2. Three main lines§1.2
3. DDPO§2.1–2.4
4. DRaFT / AlignProp§3.2–3.3
5. Diffusion-DPO§4.1
6. D3PO§4.2
7. SPO§4.3
8. Flow-GRPO§5
9. Reward hacking§3.6 + §7
Learning checkpoint