跳转至

优化方法速通

优化方法速通


函数的平滑性和强凸性


如果函数 \(f(x)\)\(L\)-smoothness 的,意味着它的梯度是 \(L\)-Lipschitz 连续的。对于定义域内的任意 \(x, y\),满足:\(\(||\nabla f(x) - \nabla f(y)|| \le L ||x - y||\)\)等价的二阶泰勒展开上限形式:

\[ f(y) \le f(x) + \langle \nabla f(x), y-x \rangle + \frac{L}{2}||y-x||^2 \]

还有一个等价条件是 \(\frac{L}{2}||x||^2 - f(x)\) 是凸函数


如果函数 \(f(x)\)\(\mu\)-strong convexity 的,说明它不仅是一个凸函数,而且至少和一个二次函数一样“凸”。对于任意 \(x, y\),满足:\(\(f(y) \ge f(x) + \langle \nabla f(x), y-x \rangle + \frac{\mu}{2}||y-x||^2\)\)或者用梯度的单调性表示:\(\(\langle \nabla f(x) - \nabla f(y), x - y \rangle \ge \mu ||x - y||^2\)\)


Bregman 散度是衡量两个点 \(x, y\) 之间“距离”的泛化表达:

\[ D_f(x, y) = f(x) - f(y) - \langle \nabla f(y), x - y \rangle \]

\(\mu\)-强凸性 (\(\mu\)-Strong Convexity) :提供下界。

\[ \frac{\mu}{2} ||x - y||^2 \le D_f(x, y) \le \frac{1}{2\mu} ||\nabla f(x) - \nabla f(y)||^2 \]

\(L\)-平滑性 (\(L\)-Smoothness) :提供上界。

\[ D_f(x, y) \le \frac{L}{2} ||x - y||^2 \]

同时具备 凸性 与 \(L\)-平滑性 (极其常用!) :

\[ \frac{1}{2L} ||\nabla f(x) - \nabla f(y)||^2 \le D_f(x, y) \]

如果 \(f\) 同时是 \(\mu\)-强凸 且 \(L\)-平滑 的,那么梯度差不仅受限于距离,还会被强制加上一个混合界:

\[ \mu ||x - y||^2 + \frac{1}{L} ||\nabla f(x) - \nabla f(y)||^2 \le \left(1 + \frac{\mu}{L}\right) \langle \nabla f(x) - \nabla f(y), x - y \rangle \]

随机梯度下降 (SGD) 与方差分解


当我们处理海量数据时,计算全梯度 \(\nabla f(x)\) 太慢了,只能用一小批数据算出一个“带噪声”的随机梯度 \(g(x, \xi)\) 。这个梯度需要满足:

  • 无偏性 (Unbiasedness)\(\mathbb{E}[g^k \mid x^k] = \nabla f(x^k)\)

  • 有界方差 (Bounded Variance) :噪声必须是被控制的:\(\mathbb{E}[||\xi||^2] \le \sigma^2\)


把单步距离的平方展开并取条件期望:

\[ \mathbb{E}[||x^{k+1} - x^*||^2 \mid x^k] \le ||x^k - x^* - \gamma \nabla f(x^k)||^2 + \gamma^2 \mathbb{E}[||\xi^k||^2 \mid x^k] \]

The AC Assumption

存在常数 \(A \ge 0\)\(C \ge 0\),使得对于所有 \(k \ge 0\)

\[ \mathbb{E}[||g^k - \nabla f(x^*)||^2 \mid x^k] \le 2A D_f(x^k, x^*) + C \]

The Convergence Theorem of SGD

如果步长满足 \(0 < \gamma \le 1/A\),那么:

\[ \mathbb{E}||x^k - x^*||^2 \le (1 - \gamma\mu)^k ||x^0 - x^*||^2 + \frac{\gamma C}{\mu} \]

近端算子与非扩张性


近端算子(Proximal Operator) 是一种用来处理非平滑凸函数的极佳工具。标准数学定义为:对于一个凸函数 \(R(x)\) 和一个参数 \(\gamma > 0\),点 \(v\) 的近端算子定义为:

\[ \text{prox}_{\gamma R}(v) = \arg\min_x \left( R(x) + \frac{1}{2\gamma} ||x - v||^2 \right) \]

近端算子 \(\text{prox}_{\gamma R}\) 具有 非扩张性(Non-expansiveness) ,即对于任意 \(u, v\),有 \(||\text{prox}_{\gamma R}(u) - \text{prox}_{\gamma R}(v)|| \le ||u - v||\)

有一个特殊的性质,若 \(z=\text{prox}_{\gamma R}(v)\),那么有

\[ 0\in \partial R(z) + \frac{1}{\gamma}(z - v) \Longleftrightarrow v\in z + \gamma \partial R(z) \]

反过来若 \(v\in z - \gamma \partial R(z)\),则 \(z=\text{prox}_{\gamma R}(v)\)


解决复合优化问题:\(\min_{x \in \mathbb{R}^d} F(x) = f(x) + R(x)\)

公式:\(x^{k+1} = \text{prox}_{\gamma R}(x^k - \gamma \nabla f(x^k))\)

利用 \(L\)-平滑和 \(\mu\)-强凸的常规性质:

\[ ||x^{k+1} - x^*||^2 \le (1 - \gamma\mu)||x^k - x^*||^2 - 2\gamma(1 - \gamma L)D_f(x^k, x^*) \]

由 Spring, 2025, C2 有:

\[ ||x^{k+1} - x^*||^2 \le \left(1 - \frac{2\gamma\mu L}{L + \mu}\right)||x^k - x^*||^2 - \gamma\left(\frac{2L + \mu}{L} - \gamma\right)||\nabla f(x^k) - \nabla f(x^*)||^2 \]

NAG


目标:求解 \(\min_{x} F(x) = f(x) + R(x)\),其中 \(f\) 是凸且 \(L\)-平滑的,\(R\) 是闭凸的。

NAG 的核心特征是 “预判未来位置 (Anticipates future position)” 。它先顺着惯性方向超前看一步,在 \(x_k + \beta(x_k - x_{k-1})\) 这个未来的位置上去计算梯度。

具体的迭代公式为 :

\[ v_{k+1} = \beta(x_k - x_{k-1}) - \gamma \nabla f(x_k + \beta(x_k - x_{k-1})) \]
\[ x_{k+1} = \text{prox}_{\gamma R}(x_k + v_{k+1}) \]

NAG 等价写成以下三步更新 :

插值步\(y_k = (1 - \alpha_{k+1})x_k + \alpha_{k+1}z_k\)

梯度步\(x_{k+1} = \text{prox}_{\gamma R}(y_k - \gamma \nabla f(y_k))\)

辅助步\(z_{k+1} = x_k + \frac{1}{\alpha_{k+1}}(x_{k+1} - x_k)\)

这里的 \(\alpha_{k}\) 定义为 \(\beta_k = \frac{\alpha_{k+1}(1-\alpha_k)}{\alpha_k}\)


\(x_{k+1}\) 的 Proximal 定义,我们有 \(y_k-\gamma \nabla f(y_k) \in x_{k+1} + \gamma \partial R(x_{k+1})\)。结合 \(R\) 的凸性,我们可以得到以下不等式:

\[ \gamma R(x) \ge \gamma R(x_{k+1}) + \langle y_k - \gamma \nabla f(y_k) - x_{k+1}, x - x_{k+1} \rangle \]

\[ R(x_{k+1}) \le R(x) - \frac{1}{\gamma} \langle y_k - \gamma \nabla f(y_k) - x_{k+1}, x - x_{k+1} \rangle \]

同时,因为 \(f\)\(L\)-平滑的,令步长 \(\gamma \le 1/L\)

\[ f(x_{k+1}) \le f(y_k) + \langle \nabla f(y_k), x_{k+1} - y_k \rangle + \frac{1}{2\gamma}||x_{k+1} - y_k||^2 \]

两式相加得:

\[ \begin{aligned} F(x_{k+1}) \le& f(y_k) + \left(\langle \nabla f(y_k), x_{k+1} - y_k \rangle+ \langle \nabla f(y_k), x - x_{k+1} \rangle\right) + \frac{1}{2\gamma} ||x_{k+1} - y_k||^2 \\ & + R(x) - \frac{1}{\gamma} \langle y_k - x_{k+1}, x - x_{k+1} \rangle \\ =& \left[ f(y_k) + \langle \nabla f(y_k), x - y_k \rangle \right] + R(x) + \frac{1}{2\gamma} ||x_{k+1} - y_k||^2 - \frac{1}{\gamma} \langle y_k - x_{k+1}, x - x_{k+1} \rangle\\ \le & f(x)+ R(x) + \frac{1}{2\gamma} ||x_{k+1} - y_k||^2 - \frac{1}{\gamma} \langle y_k - x_{k+1}, x - x_{k+1} \rangle\\ =& F(x) + \frac{1}{2\gamma} ||x_{k+1} - y_k||^2 - \frac{1}{\gamma} \langle y_k - x_{k+1}, x - x_{k+1} \rangle \end{aligned} \]

对向量 \(a = y_k - x_{k+1}\)\(b = x - x_{k+1}\) 用等式:

\[ 2\langle a, b \rangle = ||a||^2 + ||b||^2 - ||a - b||^2 \]

可以得到:

\[ -\frac{1}{\gamma} \langle y_k - x_{k+1}, x - x_{k+1} \rangle = -\frac{1}{2\gamma} \left( ||y_k - x_{k+1}||^2 + ||x - x_{k+1}||^2 - ||y_k - x||^2 \right) \]

代回原式有:

\[ F(x_{k+1}) \le F(x) + \frac{1}{2\gamma} ||y_k - x||^2 - \frac{1}{2\gamma} ||x_{k+1} - x||^2 \]

再代入 \(x = (1 - \alpha_{k+1})x_k + \alpha_{k+1}x^\star\)。由凸性我们有

\[ F(x) = F((1 - \alpha_{k+1})x_k + \alpha_{k+1}x^\star) \le (1 - \alpha_{k+1})F(x_k) + \alpha_{k+1}F(x^\star) \]
\[ y_k - x = \alpha_{k+1}(z_k - x^\star) \]
\[ x_{k+1} - x = \alpha_{k+1}(z_{k+1} - x^\star) \]

全部代回有:

\[ F(x_{k+1}) - F(x^\star) \le (1 - \alpha_{k+1})(F(x_k) - F(x^\star)) + \frac{\alpha_{k+1}^2}{2\gamma} (||z_k - x^\star||^2 - ||z_{k+1} - x^\star||^2) \]

两边除以 \(\alpha_{k+1}^2\) 并令 \(\frac{1-\alpha_{k+1}}{\alpha_{k+1}^2} = \frac{1}{\alpha_k^2}\),得到:

\[ \frac{F(x_{k+1}) - F(x^\star)}{\alpha_{k+1}^2} \le \frac{F(x_k) - F(x^\star)}{\alpha_k^2} + \frac{1}{2\gamma} ||z_k - x^\star||^2 - \frac{1}{2\gamma} ||z_{k+1} - x^\star||^2 \]

对上式求和即得

\[ \frac{F(x_K) - F^\star}{\alpha_K^2} \le \frac{1 - \alpha_1}{\alpha_1^2}(F(x_0) - F^\star) + \frac{1}{2\gamma} ||z_0 - x^\star||^2 - \frac{1}{2\gamma} ||z_K - x^\star||^2 \]

再令 \(\alpha_1 = 1\),得到:

\[ F(x_K) - F(x^\star) \le \frac{\alpha_K^2}{2\gamma} ||x_0 - x^\star||^2 \]

另一方面归纳易证

\[ \alpha_K \le \frac{2}{K+1} \]

最后我们得到:

\[ F(x_K) - F(x^\star) \le \frac{2}{\gamma(K+1)^2} ||x_0 - x^\star||^2 \]

例题(Spring, 2025, C2) 假设 \(f(x)\) 是连续可导的 \(L\)-平滑且 \(\mu\)-强凸函数。

  • 证明 \(g(x) = f(x) - \frac{\mu}{2}||x||^2\) 是连续可导的、凸的,并且是 \((L-\mu)\)-平滑的 。

  • 利用性质 \(\frac{1}{L-\mu}||\nabla g(x)-\nabla g(y)||^2 \le \langle \nabla g(x)-\nabla g(y), x-y \rangle\) ,证明以下不等式成立 :

\[ \mu||x-y||^2 + \frac{1}{L}||\nabla f(x)-\nabla f(y)||^2 \le (1+\frac{\mu}{L})\langle \nabla f(x)-\nabla f(y), x-y \rangle \]

解答: \((1)\): 首先,计算 \(g(x)\) 的梯度:

\[ \nabla g(x) = \nabla f(x) - \mu x \]

由于 \(f(x)\) 是连续可导的,且 \(\mu x\) 也是连续可导的,因此 \(g(x)\) 也是连续可导的。

证明 \(g(x)\) 是凸的只需 \(\langle \nabla g(x) - \nabla g(y), x - y \rangle \ge 0\),而

\[ \begin{aligned} \langle \nabla g(x) - \nabla g(y), x - y \rangle &= \langle \nabla f(x) - \nabla f(y) - \mu (x - y), x - y \rangle\\ &= \langle \nabla f(x) - \nabla f(y), x - y \rangle - \mu ||x - y||^2\\ &\ge 0 \end{aligned} \]

其中最后一步利用了 \(f(x)\)\(\mu\)-强凸性质。

接下来证明 \(g(x)\)\((L-\mu)\)-平滑的。只需证明 \(\frac{L-\mu}{2}||x||^2-g(x)\) 是凸函数。而

\[ \begin{aligned} \frac{L-\mu}{2}||x||^2 - g(x) &= \frac{L-\mu}{2}||x||^2 - f(x) + \frac{\mu}{2}||x||^2\\ &= \frac{L}{2}||x||^2 - f(x) \end{aligned} \]

由于 \(f(x)\)\(L\)-平滑的,所以 \(\frac{L}{2}||x||^2 - f(x)\) 是凸函数。

\((2)\):纯计算。


例题(Spring, 2025, C3) 对于 \(g(x,\xi)=\nabla f(x) + \xi\),其中 \(\xi\) 为零均值的噪声满足 \(\mathbb{E}[||\xi||^2]\le \sigma^2\),SGD 的更新规则为 \(x^{k+1} = \text{prox}_{\gamma R}(x^k - \gamma g(x^k, \xi^k))\) 。利用近端算子的非扩张性(non-expansiveness)和 \(\text{prox}_{\gamma R}\) 的最优性条件,推导 \(||x^{k+1}-x^*||^2\) 的递推关系 :

\[ ||x^{k+1}-x^*||^2 \le ||x^k - x^* - \gamma(\nabla f(x^k) - \nabla f(x^*) + \xi^k)||^2 \]

解答: 对于最优解 \(x^*\),它满足原问题 \(\min_x f(x) + R(x)\) 的一阶最优性条件:

\[ 0 \in \nabla f(x^*) + \partial R(x^*)\implies x^* - \gamma \nabla f(x^*) \in x^* + \gamma \partial R(x^*) \]

这等价于不动点形式 \(x^* = \text{prox}_{\gamma R}(x^* - \gamma \nabla f(x^*))\)。于是我们有:

\[ \begin{aligned} ||x^{k+1} - x^*|| &= ||\text{prox}_{\gamma R}(x^k - \gamma g(x^k, \xi^k)) - \text{prox}_{\gamma R}(x^* - \gamma \nabla f(x^*))||\\ &\le ||(x^k - \gamma g(x^k, \xi^k)) - (x^* - \gamma \nabla f(x^*))||\\ &= ||x^k - x^* - \gamma(\nabla f(x^k) - \nabla f(x^*) + \xi^k)||\\ \end{aligned} \]

例题(Spring, 2025, C4) 解释什么是方差分解?说明如何将其应用于 \(||x^{k+1}-x^*||^2\) 并推导以下不等式 :

\[ \mathbb{E}_k[||x^{k+1}-x^*||^2] \le ||x^k - x^* - \gamma(\nabla f(x^k) - \nabla f(x^*))||^2 + \gamma^2\sigma^2 \]

其中 \(\mathbb{E}_k[\cdot]\) 表示给定 \(\xi^k, ..., \xi^0\) 的条件期望 。


解答: 方差分解是概率论中的基本恒等式。对于任意随机向量 \(Z\),它的二阶矩可以分解为:

\[ \mathbb{E}[||Z||^2] = ||\mathbb{E}[Z]||^2 + \mathbb{E}[||Z - \mathbb{E}[Z]||^2] \]

在我们的例子中,设 \(Z = x^{k+1} - x^*\),则:

\[ \begin{aligned} \mathbb{E}_k[||x^{k+1}-x^*||^2] &= ||\mathbb{E}_k[x^{k+1}-x^*]||^2 + \mathbb{E}_k[||x^{k+1}-x^* - \mathbb{E}_k[x^{k+1}-x^*]||^2]\\ &= ||x^k - x^* - \gamma(\nabla f(x^k) - \nabla f(x^*))||^2 + \mathbb{E}_k[||\gamma \xi^k||^2]\\ &= ||x^k - x^* - \gamma(\nabla f(x^k) - \nabla f(x^*))||^2 + \gamma^2 \mathbb{E}_k[||\xi^k||^2]\\ &\le ||x^k - x^* - \gamma(\nabla f(x^k) - \nabla f(x^*))||^2 + \gamma^2\sigma^2 \end{aligned} \]

例题(Spring, 2025, C5) 利用期望的塔式性质(tower property),证明当步长 \(\gamma=\frac{2}{\mu+L}\) 时,C.4 的递推式可以简化为:

\[ \mathbb{E}[||x^{k+1}-x^*||^2] \le (1-\rho)\mathbb{E}[||x^k-x^*||^2] + \gamma^2\sigma^2 \]

其中 \(\rho=\frac{4\mu L}{(\mu+L)^2}\)


解答:\(C4\) 我们有 \(LHS\le ||x^k - x^* - \gamma(\nabla f(x^k) - \nabla f(x^*))||^2 + \gamma^2\sigma^2\)。而

\[ \begin{aligned} &||x^k - x^* - \gamma(\nabla f(x^k) - \nabla f(x^*))||^2\\ =& ||x^k - x^*||^2 - 2\gamma \langle \nabla f(x^k) - \nabla f(x^*), x^k - x^* \rangle + \gamma^2 ||\nabla f(x^k) - \nabla f(x^*)||^2\\ \le& ||x^k - x^*||^2 - 2\gamma \langle \nabla f(x^k) - \nabla f(x^*), x^k - x^* \rangle + \gamma^2 L \langle \nabla f(x^k) - \nabla f(x^*), x^k - x^* \rangle\\ =& ||x^k - x^*||^2 - \gamma(2 - \gamma L) \langle \nabla f(x^k) - \nabla f(x^*), x^k - x^* \rangle\\ \le& ||x^k - x^*||^2 - \gamma(2 - \gamma L) \mu ||x^k - x^*||^2\\ =& (1 - \gamma(2 - \gamma L) \mu) ||x^k - x^*||^2 \end{aligned} \]

\(\gamma = \frac{2}{\mu + L}\) 时,\(1 - \gamma(2 - \gamma L) \mu = 1 - \frac{4\mu L}{(\mu + L)^2} = 1 - \rho\),代入根据全期望公式(塔式性质)\(\mathbb{E}[\cdot] = \mathbb{E}[\mathbb{E}_k[\cdot]]\) 即得。


例题(Spring, 2025, C6) 证明对于任意精度 \(\epsilon>0\),存在一个步长使得

\[ k \ge \frac{L/\mu+3}{4}\log\frac{1}{\epsilon} \]

时,满足:

\[ \mathbb{E}[||x^k-x^*||^2] \le \epsilon||x^0-x^*||^2 + \frac{\gamma\sigma^2}{\mu} \]

解答:\(C5\) 递推可得:

\[ \begin{aligned} \mathbb{E}[||x^k-x^*||^2] &\le (1-\rho)^k \mathbb{E}[||x^0-x^*||^2] + \frac{\gamma^2\sigma^2}{\rho}\\ &\le \epsilon||x^0-x^*||^2 + \frac{\gamma^2\sigma^2}{\rho} \end{aligned} \]

一方面 \(\frac{\gamma^2\sigma^2}{\rho} = \frac{4\sigma^2}{\mu L}\cdot \frac{1}{(\mu + L)^2} \le \frac{\gamma\sigma^2}{\mu}\),另一方面要让 \((1-\rho)^k \le \epsilon\),只需 \(k \ge \frac{\log(1/\epsilon)}{\log(1/(1-\rho))}\),又 \(\frac{\log(1/\epsilon)}{\log(1/(1-\rho))} \le \frac{1}{\rho}\log\frac{1}{\epsilon}\),而 \(\frac{1}{\rho} = \frac{(\mu + L)^2}{4\mu L} = \frac{1}{4}(\frac{L}{\mu} + 2 + \frac{\mu}{L}) \le \frac{L/\mu + 3}{4}\)


(Autumn, 2025, C1) Consider the regularized finite-sum problem

\[ \min_{x\in\mathbb{R}^{d}}F(x) := f(x) + R(x) \]
\[ f(x) := \frac{1}{n}\sum_{i=1}^{n}f_{i}(x) \]

where each \(f_{i}:\mathbb{R}^{d}\rightarrow\mathbb{R}\) is convex and \(L_{i}\)-smooth, the aggregate \(f\) is \(L\)-smooth and \(\mu\)-strongly convex (\(\mu>0\)), and \(R:\mathbb{R}^{d}\rightarrow(-\infty,+\infty]\) is proper, closed, and convex. For a probability vector \(q=(q_{1},...,q_{n})\) with \(q_{i}>0\) and \(\sum_{i}q_{i}=1\), define a categorical random variable \(s\in\{1,...,n\}\) with \(\mathbb{P}(s=i)=q_{i}\). Fix a minibatch size \(\tau\in\{1,2,...\}\) and draw i.i.d. copies \(s_{1},...,s_{\tau}\) of \(s\). Define the multisampling gradient estimator

\[ g(x) := \frac{1}{\tau}\sum_{t=1}^{\tau}\frac{1}{n q_{s_{t}}}\nabla f_{s_{t}}(x) \]

and the proximal SGD iteration

\[ x^{k+1} = \text{prox}_{\gamma R}(x^{k} - \gamma g^{k}) \quad \text{, } g^{k} := g(x^{k}) \]

We denote the Bregman divergence of \(f\) by

\[ D_{f}(x,y) := f(x) - f(y) - \langle\nabla f(y), x - y\rangle \]

[3 pts.] Basic Definitions: Give the definitions of \(L\)-smoothness and \(\mu\)-strong convexity.


(Autumn, 2025, C2) [2 pts.] Uniqueness of the Minimizer: Show that \(F\) admits a unique minimizer \(x^{*}\).


解答:由于 \(f\)\(\mu\)-强凸的,且 \(R\) 是凸的,所以 \(F(x) = f(x) + R(x)\) 也是 \(\mu\)-强凸的。

存在性:\(F\) 在闭集上连续且趋于无穷大,因此存在最小值。

唯一性:假设存在两个不同的最小值 \(x_1^*\)\(x_2^*\),则:

\[ F\left(\frac{x_1^* + x_2^*}{2}\right) \le \frac{F(x_1^*) + F(x_2^*)}{2} - \frac{\mu}{8}||x_1^* - x_2^*||^2 < F(x_1^*) = F(x_2^*) \]

矛盾。


(Autumn, 2025, C3) [5 pts.] Unbiasedness of the Gradient Estimator: Prove that \(g(x)\) is unbiased, i.e., \(\mathbb{E}[g(x)] = \nabla f(x)\).


解答

\[ \begin{aligned} \mathbb{E}[g(x)] &= \mathbb{E}\left[\frac{1}{\tau}\sum_{t=1}^{\tau}\frac{1}{n q_{s_{t}}}\nabla f_{s_{t}}(x)\right]\\ &= \frac{1}{\tau}\sum_{t=1}^{\tau}\mathbb{E}_{s_t}\left[\frac{1}{n q_{s_{t}}}\nabla f_{s_{t}}(x)\right]\\ &= \frac{1}{\tau}\sum_{t=1}^{\tau}\sum_{i=1}^{n} q_i \cdot \frac{1}{n q_i} \nabla f_i(x)\\ &= \frac{1}{\tau}\sum_{t=1}^{\tau}\frac{1}{n}\sum_{i=1}^{n} \nabla f_i(x)\\ &= \nabla f(x) \end{aligned} \]

(Autumn, 2025, C4) [8 pts.] Expected Smoothness Bound: Assuming each \(f_{i}\) is \(L_{i}\) smooth and convex, and \(f\) is \(L\)-smooth, show that for all \(x, y \in \mathbb{R}^{d}\),\(\(\mathbb{E}[||g(x)-g(y)||^{2}] \le 2A''(\tau,q)D_{f}(x,y)\)\) where the expected-smoothness constant \(A''\) (depending on \(\tau\) and \(q\)) is

\[ A''(\tau,q) := \frac{1}{\tau}(\max_{i}\frac{L_{i}}{nq_{i}}) + (1-\frac{1}{\tau})L \]

Hint: Expand the square, separate diagonal and cross terms using independence, and use smoothness to bound gradient differences via Bregman divergences.


解答:我们先来证明 \(\| \nabla f(x)-\nabla f(y) \|^2 \le 2L D_f(x,y)\)

\(h(z)=f(z)-f(y)-\langle \nabla f(y), z-y \rangle\),则 \(h(z)\ge 0\)\(h(y)=0\)。又 \(h(z)\)\(L\)-平滑的,故

\[ h(z)\le h(x)+\langle \nabla h(x), z-x \rangle + \frac{L}{2}||z-x||^2 \]

\(z^*=x-\frac{1}{L}\nabla h(x)\)(使右边最小),得

\[ 0\le h(z^*)\le h(x) - \frac{1}{2L}||\nabla h(x)||^2 \]

因此 \(||\nabla f(x)-\nabla f(y)||^2 = ||\nabla h(x)||^2 \le 2L h(x) = 2L D_f(x,y)\)

回到原题。设 \(Z_t = \frac{1}{n q_{s_{t}}}\nabla f_{s_{t}}(x) - \frac{1}{n q_{s_{t}}}\nabla f_{s_{t}}(y)\),则 \(g(x)-g(y)=\frac{1}{\tau}\sum_{t=1}^{\tau} Z_t\)。由于 \(Z_t\) 是独立同分布的,我们有:

\[ \begin{aligned} \mathbb{E}[||g(x)-g(y)||^2] &= \mathbb{E}\left[||\frac{1}{\tau}\sum_{t=1}^{\tau} Z_t||^2\right]\\ &= \frac{1}{\tau^2}\sum_{t=1}^{\tau} \mathbb{E}[||Z_t||^2] + \frac{1}{\tau^2}\sum_{t\neq t'} \mathbb{E}[\langle Z_t, Z_{t'} \rangle]\\ &= \frac{1}{\tau} \mathbb{E}[||Z_1||^2] + \frac{\tau(\tau-1)}{\tau^2} ||\mathbb{E}[Z_1]||^2\\ &= \frac{1}{\tau} \sum_{i=1}^{n} q_i \cdot \left\|\frac{1}{n q_i}\nabla f_i(x) - \frac{1}{n q_i}\nabla f_i(y)\right\|^2 + (1-\frac{1}{\tau}) ||\nabla f(x)-\nabla f(y)||^2\\ &\le \frac{1}{\tau} \sum_{i=1}^{n} q_i \cdot \frac{L_i}{n q_i} 2 D_f(x,y) + (1-\frac{1}{\tau}) 2L D_f(x,y)\\ &= 2\left(\frac{1}{\tau}(\max_{i}\frac{L_{i}}{nq_{i}}) + (1-\frac{1}{\tau})L\right) D_f(x,y)\\ &= 2A''(\tau,q) D_f(x,y) \end{aligned} \]

(Autumn, 2025, C5) [4 pts.] Extremes, Monotonicity, and Interpolation:

(a) Evaluate \(A''(\tau,q)\) at \(\tau=1\) and as \(\tau\rightarrow+\infty\). Identify the limiting algorithms (SGD-NS vs. GD) and the corresponding constants .

(b) Show that \(A''(\tau,q)\) is nonincreasing in \(\tau\), and interpret how the minibatch size interpolates between SGD-NS and GD .

Hint: Use \(\frac{1}{n}\sum_{i=1}^{n}L_{i} \ge L\).


解答\((a)\)\(\tau=1\) 时,\(A''(1,q) = \max_{i}\frac{L_{i}}{nq_{i}} + (1-1)L = \max_{i}\frac{L_{i}}{nq_{i}}\),对应 SGD-NS(非均匀采样的随机梯度下降)。当 \(\tau\rightarrow+\infty\) 时,\(A''(\tau,q) \rightarrow L\),对应 GD(全批梯度下降)。

\((b)\):只需证明 \(\frac{1}{\tau}(\max_{i}\frac{L_{i}}{nq_{i}}) + (1-\frac{1}{\tau})L\)\(\tau\) 的非增函数。计算导数:

\[ \frac{d}{d\tau} \left( \frac{1}{\tau}(\max_{i}\frac{L_{i}}{nq_{i}}) + (1-\frac{1}{\tau})L \right) = \frac{1}{\tau^2}(L - \max_{i}\frac{L_{i}}{nq_{i}}) \]

\[ \max_{i}\frac{L_{i}}{nq_{i}} \ge \frac{1}{n}\sum_{i=1}^{n} \frac{L_i}{n q_i} = \frac{1}{n}\sum_{i=1}^{n} L_i \ge L \]

因此导数非正,\(A''(\tau,q)\)\(\tau\) 的非增函数。随着 \(\tau\)\(1\) 增加到无穷大,算法从 SGD-NS 逐渐过渡到 GD,期望平滑常数也从 \(\max_{i}\frac{L_{i}}{nq_{i}}\) 逐渐降低到 \(L\)


(Autumn, 2025, C6) [4 pts.] Design of Importance Sampling \(q\): For a fixed \(\tau\), minimize the first term of \(A''(\tau,q)\), i.e.,

\[ \min_{q\in\Delta_{n}}\max_{i}\frac{L_{i}}{nq_{i}} \quad \text{where } \Delta_{n}=\{q\in\mathbb{R}_{++}^{n}:\sum_{i}q_{i}=1\} \]

Derive the optimal \(q^{*}\) and the attained value of \(\max\). Compare with uniform sampling \(q_{i}^{uni}=\frac{1}{n}\) .

Hint: Use \(\frac{1}{n}\sum_{i=1}^{n}L_{i} \le \max_{i}L_{i}\).


解答:显然在 \(\frac{L_i}{n q_i}=C\) 时达到最小值。由 \(\sum_{i=1}^{n} q_i = 1\) 可得 \(q_i = \frac{L_i}{nC}\),代入求解 \(C\)

\[ \sum_{i=1}^{n} q_i = \sum_{i=1}^{n} \frac{L_i}{nC} = \frac{1}{nC} \sum_{i=1}^{n} L_i = 1 \implies C = \frac{1}{n}\sum_{i=1}^{n} L_i \]

因此最优采样分布为 \(q_i^* = \frac{L_i}{\sum_{j=1}^{n} L_j}\),此时 \(\max_{i}\frac{L_{i}}{nq_{i}^*} = \frac{1}{n}\sum_{i=1}^{n} L_i\)。相比于均匀采样 \(q_i^{uni} = \frac{1}{n}\),其对应的常数为 \(\max_{i}\frac{L_i}{n q_i^{uni}} = \max_{i} L_i\)。由于 \(\frac{1}{n}\sum_{i=1}^{n} L_i \le \max_{i} L_i\),重要性采样的期望平滑常数更小,从而有潜力加速收敛。


(Autumn, 2025, C7) [3 pts.] Variance at the Optimum and Minibatch Scaling: Let \(\xi(x) := g(x) - \nabla f(x)\). Show that

\[ \mathbb{E}[||\xi(x^{*})||^{2}] = \frac{1}{\tau}\text{Var}\left(\frac{1}{nq_{s}}\nabla f_{s}(x^{*})\right) \]

i.e., the variance at the optimum scales as \(1/\tau\). Express your answer in terms of \(q\) and \(\{\nabla f_{i}(x^{*})\}_{i=1}^{n}\).


解答:注意到 \(\text{Var}(\xi(x^*))=\mathbb{E}[||\xi(x^*)||^2]-||\mathbb{E}[\xi(x^*)]||^2=\mathbb{E}[||\xi(x^*)||^2]\),因为 \(\xi(x^*)\) 是零均值的。又 \(\xi(x^*) = g(x^*) - \nabla f(x^*) = \frac{1}{\tau}\sum_{t=1}^{\tau} \left( \frac{1}{n q_{s_t}} \nabla f_{s_t}(x^*) - \nabla f(x^*) \right)\),由于 \(s_t\) 是独立同分布的,我们有:

\[ \begin{aligned} \mathbb{E}[||\xi(x^*)||^2] &= \text{Var}\left(\frac{1}{\tau}\sum_{t=1}^{\tau} \left( \frac{1}{n q_{s_t}} \nabla f_{s_t}(x^*) - \nabla f(x^*) \right)\right)\\ &= \frac{1}{\tau^2} \sum_{t=1}^{\tau} \text{Var}\left( \frac{1}{n q_{s_t}} \nabla f_{s_t}(x^*) - \nabla f(x^*) \right)\\ &= \frac{1}{\tau} \text{Var}\left( \frac{1}{n q_{s}} \nabla f_{s}(x^*) - \nabla f(x^*) \right)\\ &= \frac{1}{\tau} \text{Var}\left( \frac{1}{n q_{s}} \nabla f_{s}(x^*) \right)\\ &= \mathbb{E}_s \left[ \left|\left| \frac{1}{n q_s}\nabla f_s(x^*) \right|\right|^2 \right] \\ &= \sum_{i=1}^n q_i \left|\left| \frac{1}{n q_i}\nabla f_i(x^*) \right|\right|^2\\ &= \frac{1}{\tau} \sum_{i=1}^n \frac{1}{n^2 q_i} ||\nabla f_i(x^*)||^2 \end{aligned} \]

(Autumn, 2025, C8) [2 pts.] AC inequality and computing (A, C): Consider the following classical results:AC inequality: There exist constants \(A \ge 0\) and \(C \ge 0\) such that for all \(k \ge 0\),

\[ \mathbb{E}[||g^{k}-\nabla f(x^{*})||^{2}|x^{k}] \le 2AD_{f}(x^{k},x^{*}) + C \]

Implication from expected smoothness: If \(g(x)\) is an unbiased estimator of \(\nabla f(x)\) and, for all \(x, y\),

\[ \mathbb{E}[||g(x)-g(y)||^{2}] \le 2A''D_{f}(x,y) + C''(y) \]

then for \(G(x,y) := \mathbb{E}||g(x)-\nabla f(y)||^{2}\) one has the AC inequality \(G(x,y) \le 2AD_{f}(x,y) + C\), where \(A=2A''\) and \(C=2(\text{Var}[g(y)] + C''(y))\) . Specialize the above implication to obtain the AC constants \((A, C)\), and write an explicit formula for \(\text{Var}[g(x^{*})]\).


解答:在 C4 中我们证明了 \(\mathbb{e}[||g(x)-g(y)||^2] \le 2A'' D_f(x,y)\),其中 \(A'' = A''(\tau,q)\)。又在 C7 中我们证明了 \(\text{Var}[g(x^*)] = \frac{1}{\tau}\sum_{i=1}^n \frac{1}{n^2 q_i} ||\nabla f_i(x^*)||^2\)。因此根据 AC 不等式的推导,我们有:

\[ A = 2A''(\tau,q) = 2\left(\frac{1}{\tau}(\max_{i}\frac{L_{i}}{nq_{i}}) + (1-\frac{1}{\tau})L\right) \]
\[ C = 2 \text{Var}[g(x^*)] = \frac{2}{\tau} \sum_{i=1}^n \frac{1}{n^2 q_i} ||\nabla f_i(x^*)||^2 \]

(Autumn, 2025, C9) [2 pts.] Stepsize and convergence: Given the classical convergence theorem of SGD:Assume \(f\) is \(\mu\)-convex, \(g^{k}\) is unbiased, and the AC inequality holds with constants \((A,C)\). Then for any stepsize \(0 < \gamma \le \frac{1}{A}\) the iterates satisfy

\[ \mathbb{E}||x^{k}-x^{*}||^{2} \le (1-\gamma\mu)^{k}||x^{0}-x^{*}||^{2} + \frac{\gamma C}{\mu} \]

Using the \((A, C)\) obtained in question 8 to compute:the admissible stepsize range;the explicit convergence bound with the noise floor written in closed form .


解答:定理要求 \(\gamma \le \frac{1}{A}\),我们有:

\[ \gamma \le \frac{1}{\frac{2}{\tau}(\max_{i}\frac{L_{i}}{nq_{i}}) + 2(1-\frac{1}{\tau})L} \]

将算出的 \(C\) 代入定公式 \(\mathbb{E}||x^{k}-x^{*}||^{2} \le (1-\gamma\mu)^{k}||x^{0}-x^{*}||^{2} + \frac{\gamma C}{\mu}\)

\[ \mathbb{E}||x^{k}-x^{*}||^{2} \le (1-\gamma\mu)^{k}||x^{0}-x^{*}||^{2} + \frac{2\gamma}{\mu \tau} \sum_{i=1}^n \frac{1}{n^2 q_i} ||\nabla f_i(x^*)||^2 \]