优化方法速通
优化方法速通
函数的平滑性和强凸性
如果函数 \(f(x)\) 是 \(L\)-smoothness 的,意味着它的梯度是 \(L\)-Lipschitz 连续的。对于定义域内的任意 \(x, y\),满足:\(\(||\nabla f(x) - \nabla f(y)|| \le L ||x - y||\)\)等价的二阶泰勒展开上限形式:
还有一个等价条件是 \(\frac{L}{2}||x||^2 - f(x)\) 是凸函数
如果函数 \(f(x)\) 是 \(\mu\)-strong convexity 的,说明它不仅是一个凸函数,而且至少和一个二次函数一样“凸”。对于任意 \(x, y\),满足:\(\(f(y) \ge f(x) + \langle \nabla f(x), y-x \rangle + \frac{\mu}{2}||y-x||^2\)\)或者用梯度的单调性表示:\(\(\langle \nabla f(x) - \nabla f(y), x - y \rangle \ge \mu ||x - y||^2\)\)
Bregman 散度是衡量两个点 \(x, y\) 之间“距离”的泛化表达:
\(\mu\)-强凸性 (\(\mu\)-Strong Convexity) :提供下界。
\(L\)-平滑性 (\(L\)-Smoothness) :提供上界。
同时具备 凸性 与 \(L\)-平滑性 (极其常用!) :
如果 \(f\) 同时是 \(\mu\)-强凸 且 \(L\)-平滑 的,那么梯度差不仅受限于距离,还会被强制加上一个混合界:
随机梯度下降 (SGD) 与方差分解
当我们处理海量数据时,计算全梯度 \(\nabla f(x)\) 太慢了,只能用一小批数据算出一个“带噪声”的随机梯度 \(g(x, \xi)\) 。这个梯度需要满足:
-
无偏性 (Unbiasedness):\(\mathbb{E}[g^k \mid x^k] = \nabla f(x^k)\)
-
有界方差 (Bounded Variance) :噪声必须是被控制的:\(\mathbb{E}[||\xi||^2] \le \sigma^2\)。
把单步距离的平方展开并取条件期望:
The AC Assumption:
存在常数 \(A \ge 0\) 和 \(C \ge 0\),使得对于所有 \(k \ge 0\):
The Convergence Theorem of SGD:
如果步长满足 \(0 < \gamma \le 1/A\),那么:
近端算子与非扩张性
近端算子(Proximal Operator) 是一种用来处理非平滑凸函数的极佳工具。标准数学定义为:对于一个凸函数 \(R(x)\) 和一个参数 \(\gamma > 0\),点 \(v\) 的近端算子定义为:
近端算子 \(\text{prox}_{\gamma R}\) 具有 非扩张性(Non-expansiveness) ,即对于任意 \(u, v\),有 \(||\text{prox}_{\gamma R}(u) - \text{prox}_{\gamma R}(v)|| \le ||u - v||\)。
有一个特殊的性质,若 \(z=\text{prox}_{\gamma R}(v)\),那么有
反过来若 \(v\in z - \gamma \partial R(z)\),则 \(z=\text{prox}_{\gamma R}(v)\)。
解决复合优化问题:\(\min_{x \in \mathbb{R}^d} F(x) = f(x) + R(x)\) 。
公式:\(x^{k+1} = \text{prox}_{\gamma R}(x^k - \gamma \nabla f(x^k))\) 。
利用 \(L\)-平滑和 \(\mu\)-强凸的常规性质:
由 Spring, 2025, C2 有:
NAG
目标:求解 \(\min_{x} F(x) = f(x) + R(x)\),其中 \(f\) 是凸且 \(L\)-平滑的,\(R\) 是闭凸的。
NAG 的核心特征是 “预判未来位置 (Anticipates future position)” 。它先顺着惯性方向超前看一步,在 \(x_k + \beta(x_k - x_{k-1})\) 这个未来的位置上去计算梯度。
具体的迭代公式为 :
NAG 等价写成以下三步更新 :
插值步:\(y_k = (1 - \alpha_{k+1})x_k + \alpha_{k+1}z_k\)
梯度步:\(x_{k+1} = \text{prox}_{\gamma R}(y_k - \gamma \nabla f(y_k))\)
辅助步:\(z_{k+1} = x_k + \frac{1}{\alpha_{k+1}}(x_{k+1} - x_k)\)
这里的 \(\alpha_{k}\) 定义为 \(\beta_k = \frac{\alpha_{k+1}(1-\alpha_k)}{\alpha_k}\) 。
由 \(x_{k+1}\) 的 Proximal 定义,我们有 \(y_k-\gamma \nabla f(y_k) \in x_{k+1} + \gamma \partial R(x_{k+1})\)。结合 \(R\) 的凸性,我们可以得到以下不等式:
即
同时,因为 \(f\) 是 \(L\)-平滑的,令步长 \(\gamma \le 1/L\):
两式相加得:
对向量 \(a = y_k - x_{k+1}\) 和 \(b = x - x_{k+1}\) 用等式:
可以得到:
代回原式有:
再代入 \(x = (1 - \alpha_{k+1})x_k + \alpha_{k+1}x^\star\)。由凸性我们有
全部代回有:
两边除以 \(\alpha_{k+1}^2\) 并令 \(\frac{1-\alpha_{k+1}}{\alpha_{k+1}^2} = \frac{1}{\alpha_k^2}\),得到:
对上式求和即得
再令 \(\alpha_1 = 1\),得到:
另一方面归纳易证
最后我们得到:
例题(Spring, 2025, C2) 假设 \(f(x)\) 是连续可导的 \(L\)-平滑且 \(\mu\)-强凸函数。
-
证明 \(g(x) = f(x) - \frac{\mu}{2}||x||^2\) 是连续可导的、凸的,并且是 \((L-\mu)\)-平滑的 。
-
利用性质 \(\frac{1}{L-\mu}||\nabla g(x)-\nabla g(y)||^2 \le \langle \nabla g(x)-\nabla g(y), x-y \rangle\) ,证明以下不等式成立 :
解答: \((1)\): 首先,计算 \(g(x)\) 的梯度:
由于 \(f(x)\) 是连续可导的,且 \(\mu x\) 也是连续可导的,因此 \(g(x)\) 也是连续可导的。
证明 \(g(x)\) 是凸的只需 \(\langle \nabla g(x) - \nabla g(y), x - y \rangle \ge 0\),而
其中最后一步利用了 \(f(x)\) 的 \(\mu\)-强凸性质。
接下来证明 \(g(x)\) 是 \((L-\mu)\)-平滑的。只需证明 \(\frac{L-\mu}{2}||x||^2-g(x)\) 是凸函数。而
由于 \(f(x)\) 是 \(L\)-平滑的,所以 \(\frac{L}{2}||x||^2 - f(x)\) 是凸函数。
\((2)\):纯计算。
例题(Spring, 2025, C3) 对于 \(g(x,\xi)=\nabla f(x) + \xi\),其中 \(\xi\) 为零均值的噪声满足 \(\mathbb{E}[||\xi||^2]\le \sigma^2\),SGD 的更新规则为 \(x^{k+1} = \text{prox}_{\gamma R}(x^k - \gamma g(x^k, \xi^k))\) 。利用近端算子的非扩张性(non-expansiveness)和 \(\text{prox}_{\gamma R}\) 的最优性条件,推导 \(||x^{k+1}-x^*||^2\) 的递推关系 :
解答: 对于最优解 \(x^*\),它满足原问题 \(\min_x f(x) + R(x)\) 的一阶最优性条件:
这等价于不动点形式 \(x^* = \text{prox}_{\gamma R}(x^* - \gamma \nabla f(x^*))\)。于是我们有:
例题(Spring, 2025, C4) 解释什么是方差分解?说明如何将其应用于 \(||x^{k+1}-x^*||^2\) 并推导以下不等式 :
其中 \(\mathbb{E}_k[\cdot]\) 表示给定 \(\xi^k, ..., \xi^0\) 的条件期望 。
解答: 方差分解是概率论中的基本恒等式。对于任意随机向量 \(Z\),它的二阶矩可以分解为:
在我们的例子中,设 \(Z = x^{k+1} - x^*\),则:
例题(Spring, 2025, C5) 利用期望的塔式性质(tower property),证明当步长 \(\gamma=\frac{2}{\mu+L}\) 时,C.4 的递推式可以简化为:
其中 \(\rho=\frac{4\mu L}{(\mu+L)^2}\) 。
解答: 由 \(C4\) 我们有 \(LHS\le ||x^k - x^* - \gamma(\nabla f(x^k) - \nabla f(x^*))||^2 + \gamma^2\sigma^2\)。而
当 \(\gamma = \frac{2}{\mu + L}\) 时,\(1 - \gamma(2 - \gamma L) \mu = 1 - \frac{4\mu L}{(\mu + L)^2} = 1 - \rho\),代入根据全期望公式(塔式性质)\(\mathbb{E}[\cdot] = \mathbb{E}[\mathbb{E}_k[\cdot]]\) 即得。
例题(Spring, 2025, C6) 证明对于任意精度 \(\epsilon>0\),存在一个步长使得
时,满足:
解答: 由 \(C5\) 递推可得:
一方面 \(\frac{\gamma^2\sigma^2}{\rho} = \frac{4\sigma^2}{\mu L}\cdot \frac{1}{(\mu + L)^2} \le \frac{\gamma\sigma^2}{\mu}\),另一方面要让 \((1-\rho)^k \le \epsilon\),只需 \(k \ge \frac{\log(1/\epsilon)}{\log(1/(1-\rho))}\),又 \(\frac{\log(1/\epsilon)}{\log(1/(1-\rho))} \le \frac{1}{\rho}\log\frac{1}{\epsilon}\),而 \(\frac{1}{\rho} = \frac{(\mu + L)^2}{4\mu L} = \frac{1}{4}(\frac{L}{\mu} + 2 + \frac{\mu}{L}) \le \frac{L/\mu + 3}{4}\)。
(Autumn, 2025, C1) Consider the regularized finite-sum problem
where each \(f_{i}:\mathbb{R}^{d}\rightarrow\mathbb{R}\) is convex and \(L_{i}\)-smooth, the aggregate \(f\) is \(L\)-smooth and \(\mu\)-strongly convex (\(\mu>0\)), and \(R:\mathbb{R}^{d}\rightarrow(-\infty,+\infty]\) is proper, closed, and convex. For a probability vector \(q=(q_{1},...,q_{n})\) with \(q_{i}>0\) and \(\sum_{i}q_{i}=1\), define a categorical random variable \(s\in\{1,...,n\}\) with \(\mathbb{P}(s=i)=q_{i}\). Fix a minibatch size \(\tau\in\{1,2,...\}\) and draw i.i.d. copies \(s_{1},...,s_{\tau}\) of \(s\). Define the multisampling gradient estimator
and the proximal SGD iteration
We denote the Bregman divergence of \(f\) by
[3 pts.] Basic Definitions: Give the definitions of \(L\)-smoothness and \(\mu\)-strong convexity.
(Autumn, 2025, C2) [2 pts.] Uniqueness of the Minimizer: Show that \(F\) admits a unique minimizer \(x^{*}\).
解答:由于 \(f\) 是 \(\mu\)-强凸的,且 \(R\) 是凸的,所以 \(F(x) = f(x) + R(x)\) 也是 \(\mu\)-强凸的。
存在性:\(F\) 在闭集上连续且趋于无穷大,因此存在最小值。
唯一性:假设存在两个不同的最小值 \(x_1^*\) 和 \(x_2^*\),则:
矛盾。
(Autumn, 2025, C3) [5 pts.] Unbiasedness of the Gradient Estimator: Prove that \(g(x)\) is unbiased, i.e., \(\mathbb{E}[g(x)] = \nabla f(x)\).
解答:
(Autumn, 2025, C4) [8 pts.] Expected Smoothness Bound: Assuming each \(f_{i}\) is \(L_{i}\) smooth and convex, and \(f\) is \(L\)-smooth, show that for all \(x, y \in \mathbb{R}^{d}\),\(\(\mathbb{E}[||g(x)-g(y)||^{2}] \le 2A''(\tau,q)D_{f}(x,y)\)\) where the expected-smoothness constant \(A''\) (depending on \(\tau\) and \(q\)) is
Hint: Expand the square, separate diagonal and cross terms using independence, and use smoothness to bound gradient differences via Bregman divergences.
解答:我们先来证明 \(\| \nabla f(x)-\nabla f(y) \|^2 \le 2L D_f(x,y)\)。
设 \(h(z)=f(z)-f(y)-\langle \nabla f(y), z-y \rangle\),则 \(h(z)\ge 0\) 且 \(h(y)=0\)。又 \(h(z)\) 是 \(L\)-平滑的,故
取 \(z^*=x-\frac{1}{L}\nabla h(x)\)(使右边最小),得
因此 \(||\nabla f(x)-\nabla f(y)||^2 = ||\nabla h(x)||^2 \le 2L h(x) = 2L D_f(x,y)\)。
回到原题。设 \(Z_t = \frac{1}{n q_{s_{t}}}\nabla f_{s_{t}}(x) - \frac{1}{n q_{s_{t}}}\nabla f_{s_{t}}(y)\),则 \(g(x)-g(y)=\frac{1}{\tau}\sum_{t=1}^{\tau} Z_t\)。由于 \(Z_t\) 是独立同分布的,我们有:
(Autumn, 2025, C5) [4 pts.] Extremes, Monotonicity, and Interpolation:
(a) Evaluate \(A''(\tau,q)\) at \(\tau=1\) and as \(\tau\rightarrow+\infty\). Identify the limiting algorithms (SGD-NS vs. GD) and the corresponding constants .
(b) Show that \(A''(\tau,q)\) is nonincreasing in \(\tau\), and interpret how the minibatch size interpolates between SGD-NS and GD .
Hint: Use \(\frac{1}{n}\sum_{i=1}^{n}L_{i} \ge L\).
解答: \((a)\) 当 \(\tau=1\) 时,\(A''(1,q) = \max_{i}\frac{L_{i}}{nq_{i}} + (1-1)L = \max_{i}\frac{L_{i}}{nq_{i}}\),对应 SGD-NS(非均匀采样的随机梯度下降)。当 \(\tau\rightarrow+\infty\) 时,\(A''(\tau,q) \rightarrow L\),对应 GD(全批梯度下降)。
\((b)\):只需证明 \(\frac{1}{\tau}(\max_{i}\frac{L_{i}}{nq_{i}}) + (1-\frac{1}{\tau})L\) 是 \(\tau\) 的非增函数。计算导数:
而
因此导数非正,\(A''(\tau,q)\) 是 \(\tau\) 的非增函数。随着 \(\tau\) 从 \(1\) 增加到无穷大,算法从 SGD-NS 逐渐过渡到 GD,期望平滑常数也从 \(\max_{i}\frac{L_{i}}{nq_{i}}\) 逐渐降低到 \(L\)。
(Autumn, 2025, C6) [4 pts.] Design of Importance Sampling \(q\): For a fixed \(\tau\), minimize the first term of \(A''(\tau,q)\), i.e.,
Derive the optimal \(q^{*}\) and the attained value of \(\max\). Compare with uniform sampling \(q_{i}^{uni}=\frac{1}{n}\) .
Hint: Use \(\frac{1}{n}\sum_{i=1}^{n}L_{i} \le \max_{i}L_{i}\).
解答:显然在 \(\frac{L_i}{n q_i}=C\) 时达到最小值。由 \(\sum_{i=1}^{n} q_i = 1\) 可得 \(q_i = \frac{L_i}{nC}\),代入求解 \(C\):
因此最优采样分布为 \(q_i^* = \frac{L_i}{\sum_{j=1}^{n} L_j}\),此时 \(\max_{i}\frac{L_{i}}{nq_{i}^*} = \frac{1}{n}\sum_{i=1}^{n} L_i\)。相比于均匀采样 \(q_i^{uni} = \frac{1}{n}\),其对应的常数为 \(\max_{i}\frac{L_i}{n q_i^{uni}} = \max_{i} L_i\)。由于 \(\frac{1}{n}\sum_{i=1}^{n} L_i \le \max_{i} L_i\),重要性采样的期望平滑常数更小,从而有潜力加速收敛。
(Autumn, 2025, C7) [3 pts.] Variance at the Optimum and Minibatch Scaling: Let \(\xi(x) := g(x) - \nabla f(x)\). Show that
i.e., the variance at the optimum scales as \(1/\tau\). Express your answer in terms of \(q\) and \(\{\nabla f_{i}(x^{*})\}_{i=1}^{n}\).
解答:注意到 \(\text{Var}(\xi(x^*))=\mathbb{E}[||\xi(x^*)||^2]-||\mathbb{E}[\xi(x^*)]||^2=\mathbb{E}[||\xi(x^*)||^2]\),因为 \(\xi(x^*)\) 是零均值的。又 \(\xi(x^*) = g(x^*) - \nabla f(x^*) = \frac{1}{\tau}\sum_{t=1}^{\tau} \left( \frac{1}{n q_{s_t}} \nabla f_{s_t}(x^*) - \nabla f(x^*) \right)\),由于 \(s_t\) 是独立同分布的,我们有:
(Autumn, 2025, C8) [2 pts.] AC inequality and computing (A, C): Consider the following classical results:AC inequality: There exist constants \(A \ge 0\) and \(C \ge 0\) such that for all \(k \ge 0\),
Implication from expected smoothness: If \(g(x)\) is an unbiased estimator of \(\nabla f(x)\) and, for all \(x, y\),
then for \(G(x,y) := \mathbb{E}||g(x)-\nabla f(y)||^{2}\) one has the AC inequality \(G(x,y) \le 2AD_{f}(x,y) + C\), where \(A=2A''\) and \(C=2(\text{Var}[g(y)] + C''(y))\) . Specialize the above implication to obtain the AC constants \((A, C)\), and write an explicit formula for \(\text{Var}[g(x^{*})]\).
解答:在 C4 中我们证明了 \(\mathbb{e}[||g(x)-g(y)||^2] \le 2A'' D_f(x,y)\),其中 \(A'' = A''(\tau,q)\)。又在 C7 中我们证明了 \(\text{Var}[g(x^*)] = \frac{1}{\tau}\sum_{i=1}^n \frac{1}{n^2 q_i} ||\nabla f_i(x^*)||^2\)。因此根据 AC 不等式的推导,我们有:
(Autumn, 2025, C9) [2 pts.] Stepsize and convergence: Given the classical convergence theorem of SGD:Assume \(f\) is \(\mu\)-convex, \(g^{k}\) is unbiased, and the AC inequality holds with constants \((A,C)\). Then for any stepsize \(0 < \gamma \le \frac{1}{A}\) the iterates satisfy
Using the \((A, C)\) obtained in question 8 to compute:the admissible stepsize range;the explicit convergence bound with the noise floor written in closed form .
解答:定理要求 \(\gamma \le \frac{1}{A}\),我们有:
将算出的 \(C\) 代入定公式 \(\mathbb{E}||x^{k}-x^{*}||^{2} \le (1-\gamma\mu)^{k}||x^{0}-x^{*}||^{2} + \frac{\gamma C}{\mu}\) :