DSC243 Note (2): From Local Convergence to Global Convergence
What is a solution?

Problem: $min_{x\in A}f(x)$

Solution: $f(x)f(x^*)\le \epsilon$, with prob $1\delta$
This is a probabilistic definition, because we usually use stochastic algorithm. As the algorithm runs longer, it asymptotically converges ($\epsilon$ and $\delta$ go to 0).
Local Solution
Finding a global optimum is difficult (as discussed in the last lecture), we can talk about local solutions:
Imagine we are using gradient descent, it will stop when it reaches a local optimum.
First order stationary point (FOSP): $\nabla f(x)=0$
It doesn’t specify the stability of the solution.
Second order stationary point: $\nabla^2 f(x)\succ 0$ and $\nabla f(x)=0$. (The Hessian matrix is positive definite + first order stationary.)
A more popular definition of secondorder stationary point actually only requires positive semidefinite, but you cannot guarentee it’s stable then.
Around FOSP, we can apply Taylor series (approximated to second order):
\[f(x)\approx f(x^*) + <\nabla f(x^*), xx^*>+\frac{1}{2}(xx^*)^T\nabla^2f(x^*)(xx^*) = \phi(x)\]The gradient is zero, so we only need to worry about the secondorder. How would gradient descent behave?
$H:=\nabla^2 f(x)$, $h$ is the learning rate.
\[x_{t+1}=x_t  h\nabla \phi(x)=x_t  hH(x_t  x^*)\]\[x_{t+1}x^* = (IhH)(x_t  x^*)\]Quick review: $ \frac{1}{2}x^TAx = \frac{1}{2}\sum_{i\ne j}2a_{ij}x_ix_j + \frac{1}{2}\sum_{i}a_{ii}x_i^2$. Looking at the gradient for $a_i$, it’s $\sum_ja_{ij}x_j$, which is the $i$ th element of $Ax$.
$N$ is the optimization step
$H=u\Lambda u^T, \Lambda$ is a diagnoal matrix composed of all eigen values.
\[x_T  x^* = (IhH)^N(x_0x^*)=u(Ih\Lambda)^Nu^T(x_0x^*)\] Case 1: If one of the eigen values is negative, $Ih\lambda_i$ is larger than 1, so it would explode. This is a saddle point.
 Case 2: $\lambda_1 > … \lambda_d> 0$.
\[\begin{align*} x_{T}x^*_2 &= u(Ih\Lambda)^N u^T (x_0x^*) \\ &\le u(Ih\Lambda)^N u^T _2 \ (x_0x^*)_2 \end{align*}\]Quick review: The norm of a matrix: \(A_p=\max_{x_p<1} Ax{_p}\). Geometrically speaking, one can imagine a pnorm unit ball, then apply the linear map to the ball. It would end up becoming a distorted convex shape and pnorm measures the longest “radius” of the distorted convex shape. According to the definition, we have $Ax\le A\ x$
We need to have $h\le \frac{1}{\lambda}$, otherwise it will explode. (*) Choose it to be $\lambda_1$, the max eigen value. So, in the last direction is the slowest direction, because $1\frac{\lambda_d}{\lambda_1}$ is close to one.
\[x_Tx^*_2 \le (I\frac{\lambda_d}{\lambda_1})^Tx_0  x^*_2 \le \epsilon\]It takes $T=\frac{\lambda_1}{\lambda_d}\log \frac{x_0  x^*_2}{\epsilon}$ to converge. Here, $\frac{\lambda_1}{\lambda_d}$ is the condition number.
(*) Note that, only the firstorder information is visible to GD. We make use of the Hessian matrix only to find what’s the best GD can achieve.
Towards global solution
We need stronger regularization.
Convexity

Secondorder condition: \(\nabla^2f(x)\succeq 0\)

Firstorder condition: Predicting function value with tangent, the actual function is always above the projection.
\[<\nabla f(x), yx> \le f(y)  f(x)\]Local information tells a lot about global information: You can ignore a direction if the gradient is larger than 0, because you know it will never decrease.

Zeroorder condition: The intersect line is always above the actual functions
\[f(\lambda x + (1\lambda) y) \le \lambda f(x) + (1\lambda)f(y)\]
Proof
Hint: Think about the geometry. Increasing order need taking derivation, otherwise integration.

0 > 1:
\[f(y+\lambda(xy)) \le \lambda f(x) + (1\lambda)f(y)\]Taking derivative of $\lambda$:
Quick review: Think of the limit when $\lambda \rightarrow 0$, and ignore smaller items in the Taylor expansion.
Setting $\lambda =0$:
\[<\nabla f(y), xy> \le f(x)  f(y)\] 
1 > 2:
(a) Starting with the simple case where $d=1$.
\[f(y) \ge f(x) + f'(x) (yx)\] \[f(x) \ge f(y) + f'(y) (xy)\]From these two inequation, we can get:
\[f'(y)(xy) \le f(x)f(y) \le f'(x)(xy)\]Assuming $xy>0$,
\[\frac{f'(x)f'(y)}{xy}\ge 0\](b) For any dimensions, we want to make it into 1d:
\[g(\alpha) = f(x+\alpha v)\]We can show it’s also a convex function by expanding it. So, applying the $1d$ conclusion, we have:
\[g''(\alpha)\ge 0\] \[v^T\nabla f(x+\alpha v)v \ge 0\]Taking $\alpha=0$, we get the definition of semipositive matrix.
Strongly convex
Intuition: $x^2$ is the standard convex function. If a function is “more convex” than $mx^2$…
Definition of $m$strong convex: \(f(x)  \frac{m}{2} x^2 \text{ is convex.}\)
 Secondorder: \(\nabla^2f(x)\succeq m\cdot I\) \(\lambda_{\min}(\nabla^2f(x))\ge m\)
 Firstorder: \(f(y)f(x)\ge <\nabla f(x), yx> + \frac{m}{2} \\yx\\^2\)
 Zeroorder: Skipped. (simply apply the definition)
In the next class, we will show how to use these conditions to derive global convergence.