本文是自己学习凸优化的笔记和总结。挂在这里主要是方便自己查。当然,如果能帮到手滑点进来的人也是极好的。

本节关于无约束光滑凸优化的梯度下降法的收敛分析。

上一集传送 : 凸函数,Lipschitz smooth, strongly convex

Gradient descent

梯度下降是一类极为有效的一阶最优化算法。一个无约束的凸优化问题可以写成

minf(x)\min \, f(x)

最原始的梯度下降的迭代公式可以写作:

xt+1=xtγf(xt)x_{t+1} = x_t - \gamma \nabla f(x_t)

其中 α\alpha 为步长,f(xt)\nabla f(x_t) 为梯度。这个迭代公式其实可以解释为对 f(x)f(x) 的二阶近似。

f(x)f(xt)+f(xt)(xxt)+12γxxt2f(x) \approx f(x_t) + \nabla f(x_t)(x-x_t) + \frac{1}{2\gamma} \|x-x_t\|^2

对右侧求导 =0,可推出

f(xt)+1γ(xxt)=0\nabla f(x_t) + \frac{1}{\gamma}(x-x_t)=0

即梯度下降迭代公式。

Convergence analysis

为方便推导,约定如下简写:

ft:=f(xt)f_t := f(x_t) ​

ft:=f(xt)f'_t := \nabla f(x_t)​

rt:=xtxr_t := \| x_t-x_\star \|​

Δt:=ftf\Delta_t := f_t - f_\star​

其中 x:=argminf(x)x_\star := \arg\min \, f(x)​.

首先用下表总结一下原始梯度下降法的收敛速率。

L-smooth convex L-smooth μ\mu-strongly convex
Δt\Delta_t \leq 2Ltr02\frac{2L}{t} r_0^2 (k1k+1)2tr02\big(\frac{k-1}{k+1}\big)^{2t} r_0^2
Δtϵ\Delta_t \leq \epsilon 所需步数 O(1ϵ)O\big(\frac{1}{\epsilon}\big), sub-linear O(klog(1ϵ))O\big(k \log (\frac{1}{\epsilon})\big), linear

其中 k:=Lμk:=\frac{L}{\mu} 又称条件数 (condition number).

L-smooth convex function

首先证明以下几个结论成立:

  1. rt+12rt2γ(2Lγ)ft2r_{t+1}^2 \leq r_t^2 - \gamma(\frac{2}{L} - \gamma) \| f'_t \|^2
  2. Δt+1Δtγ(1Lγ2)ft2\Delta_{t+1} - \Delta_t \leq - \gamma (1-\frac{L\gamma}{2}) \| f'_t\|^2
  3. Δtrtft\Delta_t \leq r_t \| f'_t\|

证明

(1)

rt+12=xt+1x2(GD iteration)=xtγftx2=rt22γft,xtx+γ2ft2(co-coercivity)rt22γLft2+γ2ft2=rt2γ(2Lγ)ft2\begin{aligned} r_{t+1}^2 &= \| x_{t+1}-x_\star \|^2 \\ \href{./#gradient-descent}{\text{(GD iteration)}} \quad &= \| x_t - \gamma f'_t - x_\star \|^2 \\ &= r_t^2 - 2\gamma \langle f'_t, x_t- x_\star \rangle + \gamma^2 \| f'_t\|^2 \\ \href{/posts/20190217-convex-function-lipschitz-smooth-strongly-convex/# 等价条件}{\text{(co-coercivity)}} \quad &\leq r_t^2 - \frac{2\gamma}{L}\| f'_t\|^2 + \gamma^2 \| f'_t\|^2 \\ &= r_t^2 - \gamma(\frac{2}{L} - \gamma) \| f'_t \|^2 \end{aligned}

如果 γ[0,2L]\gamma \in [0, \frac{2}{L}], 则 rt+12rt2r_{t+1}^2 \leq r_t^2. 显然, 最优步长 γ=1L\gamma = \frac{1}{L}.

(2)

(quadratic upper bound)ft+1ft+ft,xt+1xt+L2xt+1xt2(GD iteration)=ftγft2+Lγ22ft2=ftγ(1Lγ2)ft2    Δt+1Δtγ(1Lγ2)ft2\begin{aligned} \href{/posts/20190217-convex-function-lipschitz-smooth-strongly-convex/# 等价条件}{\text{(quadratic upper bound)}} \quad f_{t+1} &\leq f_t + \langle f'_t, x_{t+1} - x_t \rangle + \frac{L}{2} \| x_{t+1} - x_t \|^2 \\ \href{./#gradient-descent}{\text{(GD iteration)}} \quad & = f_t - \gamma \| f'_t\|^2 + \frac{L\gamma^2}{2} \| f'_t\|^2 \\ &= f_t - \gamma (1-\frac{L\gamma}{2}) \| f'_t\|^2 \\ \implies \quad \Delta_{t+1} - \Delta_t &\leq - \gamma (1-\frac{L\gamma}{2}) \| f'_t\|^2 \end{aligned}

(3)

(first-order condition)Δt=ftfft,xtx(Cauchy-Schwartz)rtft\begin{aligned} \text{(first-order condition)} \quad \Delta_t = f_t - f_\star &\leq \langle f'_t, x_t - x_\star \rangle \\ \text{(Cauchy-Schwartz)} \quad &\leq r_t \| f'_t\| \end{aligned}

h=γ(1Lγ2)h=\gamma (1-\frac{L\gamma}{2}). 如果 γ[0,2L]\gamma \in [0, \frac{2}{L}],则 h0h \geq 0. 若 γ=1L\gamma = \frac{1}{L}hh 取到最大值 12L\frac{1}{2L}.

以下开始分析 L-smooth convex 情况下梯度下降的收敛.

(since 2)Δt+1Δthft2(since 3)hrt2Δt2(since 1)hr02Δt21Δt1Δt+1hr02ΔtΔt+1(since 2)hr02\begin{aligned} \text{(since 2)} \quad \Delta_{t+1} - \Delta_t &\leq - h \| f'_t\|^2 \\ \text{(since 3)} \quad &\leq -\frac{h}{r_t^2} \Delta_t^2 \\ \text{(since 1)} \quad &\leq -\frac{h}{r_0^2} \Delta_t^2 \\ \frac{1}{\Delta_{t}} - \frac{1}{\Delta_{t+1}} &\leq -\frac{h}{r_0^2} \frac{\Delta_t}{\Delta_{t+1}} \\ \text{(since 2)} \quad &\leq -\frac{h}{r_0^2} \end{aligned}

相加以上不等式抵消中间项,得

1Δt1Δ0+htr02htr02    Δtr02ht(use max h=1/2L)=2Lr02t\begin{aligned} \frac{1}{\Delta_t} &\geq \frac{1}{\Delta_0} + \frac{ht}{r_0^2} \\ &\geq \frac{ht}{r_0^2} \\ \implies \Delta_t &\leq \frac{r_0^2}{ht} \\ \text{(use max h=1/2L)}\quad &= \frac{2L r_0^2}{t} \end{aligned}

ftf2Ltx0x2f_t - f_\star \leq \frac{2L}{t} \|x_0 - x_\star \|^2

可知达到 ftfϵf_t - f_\star \leq \epsilon 需要 O(2Lϵx0x2)=O(1ϵ)O\big(\frac{2L}{\epsilon} \|x_0 - x_\star \|^2\big) = O\big(\frac{1}{\epsilon}\big) 次迭代。此为 sub-linear convergence.

L-smooth and μ-strongly convex function

首先证明

Proposition 1. Suppose ff is L-smooth and μ\mu-strongly convex. Then we have

fxfy,xyμLμ+Lxy2+1μ+Lfxfy2.\langle f'_x -f'_y, x-y \rangle \geq \frac{\mu L}{\mu+L} \| x-y\|^2 + \frac{1}{\mu+L} \| f'_x -f'_y\|^2.

证明

(1) μ=L\mu = L. 由 L-smooth 的 co-coercivity 和 \mu-strongly convex 的 coercivity 相加可得。

(2) μ<L\mu < L.

首先令 ϕ(x)=f(x)μ2x2\phi(x) = f(x) - \frac{\mu}{2} | x|^2, 易证其为 (L-μ\mu)-smooth.

所以

ϕxϕy,xy1Lμϕxϕy2fxμxfy+μy,xy1Lμfxμxfy+μy2fxfy,xyμxy21Lμfxfy22μLμfxfy,xy+μ2Lμxy2fxfy,xyμLμ+Lxy2+1μ+Lfxfy2\begin{aligned} \langle \phi'_x - \phi'_y, x-y\rangle &\geq \frac{1}{L-\mu} \| \phi'_x - \phi'_y\|^2 \\ \langle f'_x - \mu x - f'_y + \mu y, x-y\rangle &\geq \frac{1}{L-\mu} \| f'_x - \mu x - f'_y + \mu y\|^2 \\ \langle f'_x - f'_y , x-y\rangle - \mu \| x-y\|^2 &\geq \frac{1}{L-\mu} \| f'_x - f'_y \|^2 -\frac{2\mu}{L-\mu}\langle f'_x - f'_y , x-y\rangle + \frac{\mu^2}{L-\mu} \| x-y\|^2 \\ \langle f'_x -f'_y, x-y \rangle &\geq \frac{\mu L}{\mu+L} \| x-y\|^2 + \frac{1}{\mu+L} \| f'_x -f'_y\|^2 \end{aligned}

接下来我们分析 L-smooth and \mu-strongly convex 函数的收敛速率。

xt+1x2=xtγftx2=xtx22γftf,xtx+γ2ft2(Proposition 1)xtx22γμLμ+Lxtx22γμ+Lft2+γft2(let γ=2μ+L)=(LμL+μ)2xtx2(let k=Lμ)=(k1k+1)2xtx2\begin{aligned} \|x_{t+1} - x_\star\|^2 &= \| x_t - \gamma f'_t - x_\star\|^2 \\ &= \| x_t - x_\star\|^2 - 2\gamma \langle f'_t - f'_\star, x_t-x_\star \rangle + \gamma^2\| f'_t\|^2 \\ \text{(Proposition 1)}\quad &\leq \| x_t - x_\star\|^2 - \frac{2\gamma \mu L}{\mu+L}\| x_t - x_\star\|^2 - \frac{2\gamma}{\mu+L} \| f'_t\|^2 + \gamma\| f'_t\|^2 \\ \text{(let}\gamma=\frac{2}{\mu+L}\text{)}\quad &= \bigg(\frac{L-\mu}{L+\mu}\bigg)^2 \| x_t - x_\star\|^2 \\ \text{(let}k=\frac{L}{\mu}\text{)}\quad &= \bigg(\frac{k-1}{k+1}\bigg)^2 \| x_t - x_\star\|^2 \end{aligned}

xtx2(k1k+1)2tx0x2\| x_t - x_\star\|^2 \leq \bigg(\frac{k-1}{k+1}\bigg)^{2t} \| x_0 - x_\star\|^2

由 L-smoothness quadratic upper bound

ftfL2xtx2(k1k+1)2tx0x2f_t - f_\star \leq \frac{L}{2} \| x_t - x_\star\|^2\leq \bigg(\frac{k-1}{k+1}\bigg)^{2t} \| x_0 - x_\star\|^2

kk 又叫 condition number, 表示 ff 的 hession 的最大特征值和最小特征值的比值。当 kk 比较大时 k1k+111k\frac{k-1}{k+1} \approx 1- \frac{1}{k}. 另外有不等式 1aea1-a \leq e^{-a}. 那么可得

ftf(11k)2tx0x2e2tkx0x2 f_t - f_\star\leq \bigg(1- \frac{1}{k}\bigg)^{2t}\| x_0 - x_\star\|^2 \leq e^{-\frac{2t}{k}} \| x_0 - x_\star\|^2

可知达到 ftfϵf_t - f_\star \leq \epsilon 需要 O(klog(1ϵ))O\big(k \log (\frac{1}{\epsilon})\big) 次迭代,此为 linear convergence, 远快于 sub-linear 收敛. 还可以推出条件数 kk 越大,收敛速率越糟糕。