导读:ML Theory 太魔怔了!!!!!.从微积分课上我们学到.对一个 \(\mathscr C^2\) 函数,其二阶泰勒展开的皮亚诺余项形式.\[f(\bm w’) = f(\bm w) + \langle \nabla f(\bm w), \bm w’ - \bm w\rang
ML Theory 太魔怔了!!!!!
从微积分课上我们学到
\[f(\bm w’) = f(\bm w) + \langle \nabla f(\bm w), \bm w’ - \bm w\rangle + o(|\bm w’ - \bm w|) \]
这说明只要 \(\bm w’\) 和 \(\bm w\) 挨得足够接近,我们就可以用 \(f(\bm w) + \langle \nabla f(\bm w), \bm w’ - \bm w \rangle\) 来逼近 \(f(\bm w’)\)。
现在我们想定量描述这个逼近过程,来说明梯度下降 (gredient descent, GD) 的收敛性及其速率。因此考虑其拉格朗日余项
\[f(\bm w’) = f(\bm w) + \langle \nabla f(\bm w), \bm w’ - \bm w\rangle + \frac{g(\bm \xi)}2 |\bm w’ - \bm w|^2 \]
我们来定量描述 \(g\) 的性质。
由于梯度下降要执行多轮,因此会有不同的 \(\bm w, \bm w’\),所以性质需要适用于所有位置。
定义 平滑性假设 (Smoothness assumption) \(\exists L, \text{ s.t. } \forall \bm w, \bm w’, |g(\bm \xi)| \leq L\)。换句话说,
\[|f(\bm w’) - f(\bm w) - \langle \nabla f(\bm w), \bm w’ - \bm w\rangle| \leq \frac{L}2 |\bm w’ - \bm w|^2 \]
这个假设是非常自然的,其略强于 \(\mathscr C^2\)。在有界闭集上两者等价。
平滑性是说一个函数在每个点被一个二次函数 bound 住,在梯度的视角下,这等价于其 Lipschitz 连续,在 Hessian 矩阵的视角下,这等价于矩阵的 norm 被 bound 住。
命题 梯度 \(L\)-Lipschitz 连续等价于 \(|\nabla^2 f(x)| \leq L\),其中 \(|\nabla^2 f(x)|\) 表示 Hessian 矩阵的 Euclidean norm,即 \(\max{|x| = 1} |Hx| = |\lambda|{\max}\)。梯度 \(L\)-Lipschitz 连续表示
\[|\nabla f(\bm w) - \nabla f(\bm w’)| \leq L |\bm w - \bm w’| \]
证明
\[\begin{aligned} |\nabla f(\bm w’) - \nabla f(\bm w)| &= \left|\int^1_0 \nabla^2 f(\bm w + \tau(\bm w’ - \bm w))(\bm w’ - \bm w) \mathrm d \tau\right| \\ &= \left|\int^1_0 \nabla^2 f(\bm w + \tau(\bm w’ - \bm w)) \mathrm d \tau \cdot (\bm w’ - \bm w)\right| \\ &\leq \left|\int^1_0 \nabla^2 f(\bm w + \tau(\bm w’ - \bm w))\mathrm d \tau\right| \cdot |\bm w’ - \bm w| \\ &\leq \int^1_0 |\nabla^2 f(\bm w + \tau(\bm w’ - \bm w))|\mathrm d \tau \cdot |\bm w’ - \bm w| \\ &\leq L|\bm w’ - \bm w| \end{aligned}\]
\[|\nabla^2 f(\bm w)| = \max{|\bm x| = 1} |H\bm x| \leq \lim{\alpha \to 0^+}\frac{|\nabla f(\bm w + \alpha \bm v) - \nabla f(\bm w)|}{\alpha} \leq L \]
其中 \(|\bm v| = 1\)。
命题 \(L\)-平滑等价于梯度 \(L\)-Lipschitz 连续。
证明
\[\begin{aligned} f(\bm w’) &= f(\bm w) + \int^1_0 \langle \nabla f(\bm w +
\tau(\bm w’ - \bm w)), \bm w’ - \bm w \rangle \mathrm d \tau \\ &= f(\bm w) +
\langle \nabla f(\bm w), \bm w’ - \bm w \rangle + \int^1_0 \langle \nabla
f(\bm w + \tau(\bm w’ - \bm w)) - \nabla f(\bm w), \bm w’ - \bm w \rangle
\mathrm d \tau \\ &\leq f(\bm w) + \langle \nabla f(\bm w), \bm w’ - \bm w
\rangle + \int^1_0 |\nabla f(\bm w + \tau(\bm w’ - \bm w)) - \nabla f(\bm
w)| \cdot |\bm w’ - \bm w | \mathrm d \tau && \text{Cauchy-Schwarz} \
&\leq f(\bm w) + \langle \nabla f(\bm w), \bm w’ - \bm w \rangle + \int^1_0
L\tau|\bm w’ - \bm w| \cdot |\bm w’ - \bm w | \mathrm d \tau \\ &= f(\bm
w) + \langle \nabla f(\bm w), \bm w’ - \bm w \rangle + \frac L2|\bm w’ - \bm
w|^2 \\ \end{aligned}\]
\[f(\bm w’) = f(\bm w) + \langle f(\bm w), \bm w’ - \bm w \rangle + \frac 12 \langle\nabla^2 f(\bm \xi)(\bm w’ - \bm w), \bm w’ - \bm w \rangle \]
\[|f(\bm w’)- f(\bm w) - \langle \nabla f(\bm w), \bm w’ - \bm w\rangle| = \frac 12 |\langle \nabla^2 f(\bm \xi)(\bm w’ - \bm w), \bm w’ - \bm w \rangle|\leq \frac L2|\bm w’ - \bm w|^2 \]
令 \(\bm w’ = \bm w + t \bm v, |\bm v| = 1\),有
\[|\langle \nabla^2 f(\bm w + t \bm v) \bm v, \bm v\rangle| \leq L \]
令 \(t \to 0^+\),由于 \(f\) 是 \(\mathscr C^2\) 函数,可得
\[|\langle \nabla^2 f(\bm w) \bm v, \bm v\rangle| \leq L \]
注意到 \(\nabla^2 f(\bm w)\) 是一个 self-adjoint 的矩阵,因此
\[\max_{\bm v} |\nabla^2 f(\bm w)\bm v|2 = \max{\bm v} \langle \nabla^2 f(\bm w) \bm v, \bm v\rangle = |\lambda|_{\max} \]
根据上一条命题,该命题得证。
回到梯度下降中。对平滑的 \(f\),有
\[\begin{cases} f(\bm w’) \leq f(\bm w) + \langle \nabla f(\bm w), \bm w’ - \bm w \rangle + \frac L2 | \bm w’ - \bm w |^2 \\ f(\bm w’) \geq f(\bm w) + \langle \nabla f(\bm w), \bm w’ - \bm w \rangle - \frac L2 | \bm w’ - \bm w |^2 \\ \end{cases}\]
这给出了一个从 \(\bm w\) 出发,走到某个 \(\bm w’\) 的 \(f\) 的上下界,就像这样(灵魂画手 yy)
下界并不重要,我们关心的是上界。在 \(\bm w, \bm w’\) 足够接近时,\(f\) 总是下降的,定量地,假设在梯度下降中采取学习速率 \(\eta\),\(\bm w’ = \bm w - \eta \nabla f(\bm w)\),
\[\begin{aligned} f(\bm w’) - f(\bm w) &\leq \langle \nabla f(\bm w), \bm w’
- \bm w \rangle + \frac L2 |\bm w’ - \bm w|^2 \\ &= \langle \nabla f(\bm
w), - \eta \nabla f(\bm w)\rangle + \frac{L\eta^2}2 |\nabla f(\bm w)|^2 \
&= -\eta\left(1 - \frac{L\eta}2\right) |\nabla f(\bm w)|^2 \end{aligned}\]
因此当 \(\eta < \frac 2L\) 时,式子总是 \(< 0\) 的,这保证我们每次梯度下降都会有进步。
但是这个假设还是不够。首先它可能会落入局部最优,其次虽然每次都有进步,但是全局的收敛速度没有保证。考虑 \(f(x) = \mathrm{sigmoid}(x)\),从 \(x\) 很大的开始向中间靠拢,速度是负指数级的。这要求我们给函数更多的整体性质。
定义 一个函数 \(f\) 是凸的,如果 \(f(t\bm x_1 + (1 - t)\bm x_2) \leq tf(\bm x_1) + (1 - t)f(\bm x_2),\ t \in [0, 1]\)。
其有若干个等价定义,这是微积分课上讲过的。
命题 若 \(f\) 是 \(\mathscr C^2\) 函数,则凸等价于 \(\nabla^2 f(\bm w)\) 半正定。
也就是说,凸性和平滑性一个保证的是 \(|\lambda|{\max}\) 的界,一个保证的是 \(\lambda{\min}\) 的符号。
凸性能够保证收敛速度。
命题 \(\bm w^* = \operatorname{argmin}_{\bm w} f(\bm w)\),采用学习速率 \(\eta \leq \frac 1L\) 进行 \(t\) 轮梯度下降时,有
\[f(\bm w_t) \leq f(\bm w^) + \frac 1{2\eta t}|\bm w_0 - \bm w^|^2 \]
证明 考虑裂项法
\[\begin{aligned} f(\bm w_{i+1}) &\leq f(\bm w_i) - \eta\left(1 - \frac{L\eta}2\right) |\nabla f(\bm w_i)|^2 && \text{Smoothness}\\ &\leq f(\bm w_i) - \frac \eta 2|\nabla f(\bm w_i)|^2 \\ &\leq f(\bm w^) + \langle \nabla f(\bm w_i), \bm w_i - \bm w^\rangle - \frac \eta 2 |\nabla f(\bm wi)|^2 && \text{Convexity} \\ &= f(\bm w^*) - \frac 1{\eta} \langle \bm w{i+1} - \bm w_i, \bm w_i - \bm w^\rangle - \frac 1{2\eta} |\bm wi - \bm w{i+1}|^2 && \text{梯度下降} \\ &= f(\bm w^) + \frac 1{2 \eta} |\bm w_i - \bm w^|^2 - \frac 1{2\eta}(|\bm w_i - \bm w^|^2 - 2 \langle \bm wi - \bm w{i+1}, \bm w_i - \bm w^* \rangle + |\bm wi - \bm w{i+1}|^2) && 配方 \\ &= f(\bm w^) + \frac 1{2 \eta} |\bm w_i - \bm w^|^2 - \frac 1{2\eta} |(\bm wi - \bm w{i+1}) - (\bm w_i - \bm w^)|^2 \\ &= f(\bm w^) + \frac 1{2 \eta} (|\bm wi - \bm w^*|^2 - |\bm w{i+1} - \bm w^*|^2) \end{aligned}\]
\[\sum{i=0}^{t - 1} (f(\bm w{i+1}) - f(\bm w^)) \leq \frac 1{2\eta} (|\bm w_0 - \bm w^|^2 - |\bm w_t - \bm w^|^2) \leq \frac 1{2\eta} |\bm w_0 - \bm w^|^2 \]
由于 \(f(\bm w_i)\) 不升,
\[f(\bm w_t) \leq f(\bm w^) + \frac 1{2\eta t}|\bm w_0 - \bm w^|^2 \]
令总训练轮数
\[T = \frac{L|\bm w_0 - \bm w^*|^2}{2 \epsilon} \]
即可得到 \(f(\bm w_t) \leq f(\bm w^*) + \epsilon\)。
接下来考虑一个很常用的技巧,随机梯度下降 (stochastic gradient descent, SGD)。如果我们每次都仅选取小批量数据计算梯度,那么便要考虑收敛性的问题。
\[\bm w_{t+1} = \bm w_t - \eta \bm G_t \]
\[\mathbb E[\bm G_t] = \nabla f(\bm w_t) \]
其中
\[\nabla f(\bm w, \bm X, \bm Y) = \frac 1N\sum_i \nabla l(\bm w, x_i, y_i) \]
\[\bm Gt = \frac 1{|S|} \sum{i \in S} \nabla l(\bm w, x_i, y_i) \]
如果采取随机选取 \(S\) 的策略,我们可以不再考虑 \(\bm G_t\) 的由来,而是仅把其当作一个随机变量对待。
命题 \(f\) 是一个凸的 \(L\)-平滑函数,\(\bm w^* = \operatorname{argmin}_{\bm w} f(\bm w)\),采用学习速率 \(\eta \leq \frac 1L\) 且使得 \(\mathrm{Var}(\bm G_t) \leq \sigma^2\) 进行 \(t\) 轮梯度下降时,有
\[\mathbb E[f(\overline{\bm w_t})] \leq f(\bm w^) + \frac{|\bm w_0 - \bm w^|^2}{2t\eta} + \eta \sigma^2 \]
其中 \(\overline {\bm wi} = \frac 1t \sum{i=1}^t \bm w_i\)。
证明 考虑转化为和 GD 类似的形式。一次项用期望的线性性,二次项用方差 \(\mathrm{Var}(\bm G_t) = \mathbb E|\bm G_t|^2 - (\mathbb E \bm G_t)^2 = \mathbb E|\bm G_t|^2 - |\nabla f(\bm w_i)|^2\)。由此不断转化 \(\bm G_i\) 和 \(\nabla f(\bm w_i)\),分离固定部分和随机部分。
\[\begin{aligned} E[f(\bm w_{i+1})] &\leq f(\bm w_i) + \mathbb E\langle
\nabla f(\bm wi), \bm w{i+1} - \bm wi \rangle + \frac L2\mathbb E|\bm
w{i+1} - \bm w_i|^2 && \text{Smoothness} \\ &= f(\bm w_i) + \langle \nabla
f(\bm wi), \mathbb E(\bm w{i+1} - \bm wi) \rangle + \frac L2\mathbb E|\bm
w{i+1} - \bm w_i|^2 && 期望的线性性 \\ &= f(\bm w_i) - \eta \langle \nabla f(\bm
w_i), \nabla f(\bm w_i) \rangle + \frac{L\eta^2}2 \mathbb E|\bm G_i|^2 \
&= f(\bm w_i) - \eta \langle \nabla f(\bm w_i), \nabla f(\bm w_i) \rangle +
\frac{L\eta^2}2(|\nabla f(\bm w_i) |^2 + \mathrm{Var}(\bm G_i)) &&
\mathrm{Var}(\bm G_t) = \mathbb E|\bm G_t|^2 - |\nabla f(\bm w_i)|^2 \
&\leq f(\bm w_i) - \eta \left(1 - \frac{L \eta}2\right) |\nabla f(\bm
w_i)|^2 + \frac{L\eta^2}2 \sigma^2 \\ &\leq f(\bm w_i) - \frac \eta 2
|\nabla f(\bm w_i)|^2 + \frac \eta 2 \sigma^2 && \eta < \frac 1L \\ &\leq
f(\bm w^) + \langle \nabla f(\bm w_i), \bm w_i - \bm w^ \rangle - \frac \eta
2|\nabla f(\bm wi)|^2 + \frac \eta 2\sigma^2 \\ &\leq f(\bm w^*) - \frac 1
\eta\mathbb E\langle \bm w{i+1} - \bm w_i, \bm w_i - \bm w^* \rangle - \frac
\eta 2|\bm G_i|^2 + \eta\sigma^2 && |\nabla f(\bm w_i)|^2 = \mathbb E|\bm
G_i|^2 - \mathrm{Var}(\bm G_i) \\ &= \frac 1{2\eta} \mathbb E(|\bm wi -
\bm w^*|^2 - |\bm w{i+1} - \bm w^*|^2) + \eta \sigma^2 && 同 \text{ GD} \
\end{aligned}\]
\[\begin{aligned} \mathbb E[f(\overline{\bm wt})] - f(\bm w^*) &= \mathbb Ef\left(\frac 1t\sum{i=1}^t \bm wt\right) - f(\bm w^*) \\ &\leq \frac 1t\mathbb E\left(\sum{i=1}^t f(\bm wi)\right) - f(\bm w^*) && \text{Jensen’s Ineq} \\ &\leq \frac 1t\sum{i=0}^{t - 1} (\mathbb Ef(\bm w_{i+1}) - f(\bm w^)) \\ &\leq \frac 1{2\eta t}(|\bm w_0 - \bm w^|^2 - \mathbb E|\bm w_t - \bm w^|^2) + \eta \sigma^2 \\ &\leq \frac 1{2\eta t}|\bm w_0 - \bm w^|^2 + \eta \sigma^2 \end{aligned}\]
令
\[T = \frac{2 |\bm w_0 - \bm w^*|^2 \sigma^2}{\epsilon^2}, \eta = \frac \epsilon {2\sigma^2} \]
即可得到 \(\mathbb Ef(\overline{\bm w_t}) \leq f(\bm w^*) + \epsilon\)。
也就是说,误差项是不随 \(t\) 改变的,因此只能通过缩小学习速率降低误差。这导致 GD 有 \(\frac 1T\) 的收敛速率时 SGD 只有 \(\frac 1{\sqrt T}\) 的收敛速率。
上一篇:何时使用GraphQL、gRPC
下一篇:MFAN论文阅读笔记(待复现)