DSC243 Note (2): From Local Convergence to Global Convergence
Probabilistic Definition of Solution
What is a solution?
-
Problem: $min_{x\in A}f(x)$
-
Solution: $|f(x)-f(x^*)|\le \epsilon$, with prob $1-\delta$
This is a probabilistic definition, because we usually use stochastic algorithm. As the algorithm runs longer, it asymptotically converges ($\epsilon$ and $\delta$ go to 0).
Local Solution
Finding a global optimum is difficult (as discussed in the last lecture), we can talk about local solutions:
Imagine we are using gradient descent, it will stop when it reaches a local optimum.
First order stationary point (FOSP): $\nabla f(x)=0$
It doesn’t specify the stability of the solution.
Second order stationary point: $\nabla^2 f(x)\succ 0$ and $\nabla f(x)=0$. (The Hessian matrix is positive definite + first order stationary.)
A more popular definition of second-order stationary point actually only requires positive semi-definite, but you cannot guarentee it’s stable then.
Around FOSP, we can apply Taylor series (approximated to second order):
\[f(x)\approx f(x^*) + <\nabla f(x^*), x-x^*>+\frac{1}{2}(x-x^*)^T\nabla^2f(x^*)(x-x^*) = \phi(x)\]The gradient is zero, so we only need to worry about the second-order. How would gradient descent behave?
$H:=\nabla^2 f(x)$, $h$ is the learning rate.
\[x_{t+1}=x_t - h\nabla \phi(x)=x_t - hH(x_t - x^*)\]\[x_{t+1}-x^* = (I-hH)(x_t - x^*)\]Quick review: $ \frac{1}{2}x^TAx = \frac{1}{2}\sum_{i\ne j}2a_{ij}x_ix_j + \frac{1}{2}\sum_{i}a_{ii}x_i^2$. Looking at the gradient for $a_i$, it’s $\sum_ja_{ij}x_j$, which is the $i$ th element of $Ax$.
$N$ is the optimization step
$H=u\Lambda u^T, \Lambda$ is a diagnoal matrix composed of all eigen values.
\[x_T - x^* = (I-hH)^N(x_0-x^*)=u(I-h\Lambda)^Nu^T(x_0-x^*)\]- Case 1: If one of the eigen values is negative, $I-h\lambda_i$ is larger than 1, so it would explode. This is a saddle point.
- Case 2: $\lambda_1 > … \lambda_d> 0$.
\[\begin{align*} ||x_{T}-x^*||_2 &= ||u(I-h\Lambda)^N u^T (x_0-x^*)|| \\ &\le ||u(I-h\Lambda)^N u^T ||_2 \ ||(x_0-x^*)||_2 \end{align*}\]Quick review: The norm of a matrix: \(||A||_p=\max_{||x||_p<1} ||Ax||{_p}\). Geometrically speaking, one can imagine a p-norm unit ball, then apply the linear map to the ball. It would end up becoming a distorted convex shape and p-norm measures the longest “radius” of the distorted convex shape. According to the definition, we have $||Ax||\le ||A||\ ||x||$
We need to have $h\le \frac{1}{\lambda}$, otherwise it will explode. (*) Choose it to be $\lambda_1$, the max eigen value. So, in the last direction is the slowest direction, because $1-\frac{\lambda_d}{\lambda_1}$ is close to one.
\[||x_T-x^*||_2 \le (I-\frac{\lambda_d}{\lambda_1})^T||x_0 - x^*||_2 \le \epsilon\]It takes $T=\frac{\lambda_1}{\lambda_d}\log \frac{||x_0 - x^*||_2}{\epsilon}$ to converge. Here, $\frac{\lambda_1}{\lambda_d}$ is the condition number.
(*) Note that, only the first-order information is visible to GD. We make use of the Hessian matrix only to find what’s the best GD can achieve.
Towards global solution
We need stronger regularization.
Convexity
-
Second-order condition: \(\nabla^2f(x)\succeq 0\)
-
First-order condition: Predicting function value with tangent, the actual function is always above the projection.
\[<\nabla f(x), y-x> \le f(y) - f(x)\]Local information tells a lot about global information: You can ignore a direction if the gradient is larger than 0, because you know it will never decrease.
-
Zero-order condition: The intersect line is always above the actual functions
\[f(\lambda x + (1-\lambda) y) \le \lambda f(x) + (1-\lambda)f(y)\]
Proof
Hint: Think about the geometry. Increasing order need taking derivation, otherwise integration.
-
0 -> 1:
\[f(y+\lambda(x-y)) \le \lambda f(x) + (1-\lambda)f(y)\]Taking derivative of $\lambda$:
Quick review: Think of the limit when $\lambda \rightarrow 0$, and ignore smaller items in the Taylor expansion.
Setting $\lambda =0$:
\[<\nabla f(y), x-y> \le f(x) - f(y)\] -
1 -> 2:
(a) Starting with the simple case where $d=1$.
\[f(y) \ge f(x) + f'(x) (y-x)\] \[f(x) \ge f(y) + f'(y) (x-y)\]From these two inequation, we can get:
\[f'(y)(x-y) \le f(x)-f(y) \le f'(x)(x-y)\]Assuming $x-y>0$,
\[\frac{f'(x)-f'(y)}{x-y}\ge 0\](b) For any dimensions, we want to make it into 1-d:
\[g(\alpha) = f(x+\alpha v)\]We can show it’s also a convex function by expanding it. So, applying the $1-d$ conclusion, we have:
\[g''(\alpha)\ge 0\] \[v^T\nabla f(x+\alpha v)v \ge 0\]Taking $\alpha=0$, we get the definition of semi-positive matrix.
Strongly convex
Intuition: $x^2$ is the standard convex function. If a function is “more convex” than $mx^2$…
Definition of $m$-strong convex: \(f(x) - \frac{m}{2} ||x||^2 \text{ is convex.}\)
- Second-order: \(\nabla^2f(x)\succeq m\cdot I\) \(\lambda_{\min}(\nabla^2f(x))\ge m\)
- First-order: \(f(y)-f(x)\ge <\nabla f(x), y-x> + \frac{m}{2} \|\|y-x\|\|^2\)
- Zero-order: Skipped. (simply apply the definition)
In the next class, we will show how to use these conditions to derive global convergence.