Theoretical Perspectives on Flow Matching & Diffusion Models (3)
前言:经过前面两篇文章,我们定义了flow matchin model和diffusion model的前向过程,并且构建了训练目标,从本文开始,我们将从高斯路径出发,推导实际的loss函数,然后着手分别训练unconditional flow model和unconditional diffusion model
训练一个生成式模型(上)
flow model
回顾flow matching model的定义:
$$
\begin{align}
X_0 \sim p_{init} , \quad dX_t = u^{\theta}_t(X_t)dt
\end{align}
$$
直觉上说,我们应该定义loss函数:
式中$p_t(x) = \int p_t(x|z)p_{data}(z)dz$。这个loss做了几件事:
- 1、时间步t满足[0,1]的均匀分布
- 2、我们从$p_{data}$中采样z,添加噪声,计算$u^{\theta}_t(x)$
- 3、计算$u^{\theta}_t(x)$和$u^{target}_t(x)$
如前文所说我们没法直接计算$u^{target}_t(x)$,因为:
而我们没法得到$p_t(x)$,于是我们尝试将目光转向条件概率路径,我们定义conditional flow matching loss:
$$ \begin{align} \mathcal{L}_{CFM}(\theta) = \mathbb{E}_{t\sim Unif,z \sim p_{data}, x\sim p_t(\cdot | z)}\left[ \| u^{\theta}_t(x) - u^{target}_t(x|z)\|^2\right] \end{align} $$虽然$\mathcal{L}_{CFM}(\theta)$是可以计算的,但是计算出来是否有用,毕竟我们的最终目的是要最小化$\mathcal{L}_{FM}(\theta)$。答案是等价的,我们先放出结论:
$$ \begin{align} \mathcal{L}_{FM}(\theta) = \mathcal{L}_{CFM}(\theta) + C \end{align} $$下面给出证明:
$$ \begin{align} \mathcal{L}_{FM}(\theta) &= \mathbb{E}_{t\sim Unif,x \sim p_t}\left[\|u^{\theta}_t(x) - u^{target}_t(x)\|^2 \right]\\ &= \mathbb{E}_{t\sim Unif,x \sim p_t}\left[\| u^{\theta}_t(x)\|^2 - 2u^{\theta}_t(x)^Tu^{target}_t(x) + \|u^{target}_t(x)\|^2\right]\\ &=\mathbb{E}_{t\sim Unif,x \sim p_t}\left[\| u^{\theta}_t(x)\|^2\right] - 2\mathbb{E}_{t\sim Unif,x \sim p_t}\left[u^{\theta}_t(x)^Tu^{target}_t(x)\right] + \underbrace{\mathbb{E}_{t\sim Unif,x \sim p_t}\left[\|u^{target}_t(x)\|^2\right]}_\text{C1}\\ &= \mathbb{E}_{t\sim Unif,z\sim p_{data},x \sim p_t(\cdot|z)}\left[\| u^{\theta}_t(x)\|^2\right] - 2\mathbb{E}_{t\sim Unif,x \sim p_t}\left[u^{\theta}_t(x)^Tu^{target}_t(x)\right] + C1 \end{align} $$
其中$\mathbb{E}_{t\sim Unif,x \sim p_t}\left[\|u^{target}_t(x)\|^2\right]$是整个目标分布的向量场,与我们训练的神经网络无关,所以可以将其化简为一个常数C1,现在还有一项$\mathbb{E}_{t\sim Unif,x \sim p_t}\left[u^{\theta}_t(x)^Tu^{target}_t(x)\right]$需要我们解决:
$$ \begin{align} \mathbb{E}_{t\sim Unif,x \sim p_t}\left[u^{\theta}_t(x)^Tu^{target}_t(x)\right] &= \int^1_0\int p_t(x)u^{\theta}_t(x)^Tu^{target}_t(x)\mathrm{d}x\mathrm{d}t\\ &= \int^1_0\int p_t(x)u^{\theta}_t(x)^T\left[\int u^{target}_t(x|z) \frac{p_t(x|z)p_{data}(z)}{p_t(x)} \mathrm{d}z\right]\mathrm{d}x\mathrm{d}t\\ &= \int^1_0\int\int u^{\theta}_t(x)^Tu^{target}_t(x|z)p_t(x|z)p_{data}(z)\mathrm{d}z\mathrm{d}x\mathrm{d}t\\ &= \mathbb{E}_{t\sim Unif,z\sim p_{data},x \sim p_t(\cdot|z)}\left[u^{\theta}_t(x)^Tu^{target}_t(x|z)\right] \end{align} $$我们将这一项放回到$\mathcal{L}_{FM}(\theta)$中再做一个整理:
$$ \begin{align} \mathcal{L}_{FM}(\theta) &= \mathbb{E}_{t\sim Unif,z\sim p_{data},x \sim p_t(\cdot|z)}\left[\| u^{\theta}_t(x)\|^2\right] - 2\mathbb{E}_{t\sim Unif,z\sim p_{data},x \sim p_t(\cdot|z)}\left[u^{\theta}_t(x)^Tu^{target}_t(x|z)\right] + C1\\ &= \mathbb{E}_{t\sim Unif,z\sim p_{data},x \sim p_t(\cdot|z)}\left[\| u^{\theta}_t(x)\|^2 -2u^{\theta}_t(x)^Tu^{target}_t(x|z)\right] + C1\\ &= \mathbb{E}_{t\sim Unif,z\sim p_{data},x \sim p_t(\cdot|z)}\left[\| u^{\theta}_t(x)\|^2 -2u^{\theta}_t(x)^Tu^{target}_t(x|z) - \|u^{target}_t(x|z)\|^2 + \|u^{target}_t(x|z)\|^2\right] + C1\\ &= \mathbb{E}_{t\sim Unif,z\sim p_{data},x \sim p_t(\cdot|z)}\left[\| u^{\theta}_t(x) - u^{target}_t(x|z)\|^2\right] - \underbrace{\mathbb{E}_{t\sim Unif,z\sim p_{data},x \sim p_t(\cdot|z)}\left[ \|u^{target}_t(x|z)\|^2\right]}_\text{C2} + C1\\ &= \mathcal{L}_{CFM}(\theta) + C1 + C2 \end{align} $$同样的$\mathbb{E}_{t\sim Unif,z\sim p_{data},x \sim p_t(\cdot|z)}\left[ \|u^{target}_t(x|z)\|^2\right]$也和我们的神经网络无关,对于整个目标分布来说,其条件向量场也是不变的,所以我们将其化为常数C2
ok我们成功的获得了一个可以实际计算的loss函数,现在,我们把它使用在高斯路径下,进一步梳理一下式子的形式:
记$\epsilon \sim \mathcal{N}(0,I_d)$,我们有$x_t = \alpha_t z + \beta_t \epsilon \sim \mathcal{N}(\alpha_t z , \beta^2_t I_d) = p_t(\cdot|z)$,根据我们构建的训练目标$u^{target}_t(x|z) = (\dot{\alpha_t} - \frac{\dot{\beta}_t}{\beta_t})z + \frac{\dot{\beta}_t}{\beta_t}x$,代入loss函数:
对于我们的超参数$\alpha$和$\beta$来说,我们只需要满足:
- $\alpha_0 = 0$ , $\alpha_1 = 1$
- $\beta = 1$ , $\beta_1 = 0$
我们不妨令: - $\alpha_t = t$
- $\beta_t = 1 - t$
代入loss函数:
这是一个非常简洁的形式了,我们解读一下:对于我们训练的神经网络来说,采样一个目标分布的样本z,为其添加噪声输入神经网络中,要求其输出样本与”所添加噪声“的差距
ok,数学太多了,在马不停蹄进入sde之前,我们似乎先训练一个简单的flow model会更加振奋人心,现在我们来做一个flow model吧
1 | import torch |
下图是我们简单训练了5个epoch得到的结果:
值得注意的是:
- 我们需要显式地将时间步t喂给model,我们的$u^{target}_t(x)$也体现了这一点,是一个关于x和t的函数,可以写成$u^{target}(x,t)$会比较直观
- 使用mlp没法有效地表达图像数据(我们尝试的结果是,使用mlp不仅收敛速度慢,而且效果很差),我们使用一个简单的unet结构来表达。


