DSC243 Note (2): From Local Convergence to Global Convergence

Probabilistic Definition of Solution

What is a solution?

Problem: $min_{x\in A}f(x)$
Solution: $|f(x)-f(x^*)|\le \epsilon$, with prob $1-\delta$

This is a probabilistic definition, because we usually use stochastic algorithm. As the algorithm runs longer, it asymptotically converges ($\epsilon$ and $\delta$ go to 0).

Local Solution

Finding a global optimum is difficult (as discussed in the last lecture), we can talk about local solutions:

Imagine we are using gradient descent, it will stop when it reaches a local optimum.

First order stationary point (FOSP): $\nabla f(x)=0$

It doesn’t specify the stability of the solution.

Second order stationary point: $\nabla^2 f(x)\succ 0$ and $\nabla f(x)=0$. (The Hessian matrix is positive definite + first order stationary.)

A more popular definition of second-order stationary point actually only requires positive semi-definite, but you cannot guarentee it’s stable then.

Around FOSP, we can apply Taylor series (approximated to second order):

\[f(x)\approx f(x^*) + <\nabla f(x^*), x-x^*>+\frac{1}{2}(x-x^*)^T\nabla^2f(x^*)(x-x^*) = \phi(x)\]

The gradient is zero, so we only need to worry about the second-order. How would gradient descent behave?

$H:=\nabla^2 f(x)$, $h$ is the learning rate.

\[x_{t+1}=x_t - h\nabla \phi(x)=x_t - hH(x_t - x^*)\]

Quick review: $ \frac{1}{2}x^TAx = \frac{1}{2}\sum_{i\ne j}2a_{ij}x_ix_j + \frac{1}{2}\sum_{i}a_{ii}x_i^2$. Looking at the gradient for $a_i$, it’s $\sum_ja_{ij}x_j$, which is the $i$ th element of $Ax$.

\[x_{t+1}-x^* = (I-hH)(x_t - x^*)\]

$N$ is the optimization step

$H=u\Lambda u^T, \Lambda$ is a diagnoal matrix composed of all eigen values.

\[x_T - x^* = (I-hH)^N(x_0-x^*)=u(I-h\Lambda)^Nu^T(x_0-x^*)\]

Case 1: If one of the eigen values is negative, $I-h\lambda_i$ is larger than 1, so it would explode. This is a saddle point.
Case 2: $\lambda_1 > … \lambda_d> 0$.

Quick review: The norm of a matrix: $||A||_p=\max_{||x||_p<1} ||Ax||{_p}$. Geometrically speaking, one can imagine a p-norm unit ball, then apply the linear map to the ball. It would end up becoming a distorted convex shape and p-norm measures the longest “radius” of the distorted convex shape. According to the definition, we have $||Ax||\le ||A||\ ||x||$

\[\begin{align*} ||x_{T}-x^*||_2 &= ||u(I-h\Lambda)^N u^T (x_0-x^*)|| \\ &\le ||u(I-h\Lambda)^N u^T ||_2 \ ||(x_0-x^*)||_2 \end{align*}\]

We need to have $h\le \frac{1}{\lambda}$, otherwise it will explode. (*) Choose it to be $\lambda_1$, the max eigen value. So, in the last direction is the slowest direction, because $1-\frac{\lambda_d}{\lambda_1}$ is close to one.

\[||x_T-x^*||_2 \le (I-\frac{\lambda_d}{\lambda_1})^T||x_0 - x^*||_2 \le \epsilon\]

It takes $T=\frac{\lambda_1}{\lambda_d}\log \frac{||x_0 - x^*||_2}{\epsilon}$ to converge. Here, $\frac{\lambda_1}{\lambda_d}$ is the condition number.

(*) Note that, only the first-order information is visible to GD. We make use of the Hessian matrix only to find what’s the best GD can achieve.

Towards global solution

We need stronger regularization.

Convexity

Second-order condition: $\nabla^2f(x)\succeq 0$
First-order condition: Predicting function value with tangent, the actual function is always above the projection.
\[<\nabla f(x), y-x> \le f(y) - f(x)\]
Local information tells a lot about global information: You can ignore a direction if the gradient is larger than 0, because you know it will never decrease.
Zero-order condition: The intersect line is always above the actual functions
\[f(\lambda x + (1-\lambda) y) \le \lambda f(x) + (1-\lambda)f(y)\]

Proof

Hint: Think about the geometry. Increasing order need taking derivation, otherwise integration.

0 -> 1:
\[f(y+\lambda(x-y)) \le \lambda f(x) + (1-\lambda)f(y)\]
Taking derivative of $\lambda$:

Quick review: Think of the limit when $\lambda \rightarrow 0$, and ignore smaller items in the Taylor expansion.

\[<\nabla f(y+\lambda(x-y)), (x-y)> \le f(x) - f(y)\]
Setting $\lambda =0$:
\[<\nabla f(y), x-y> \le f(x) - f(y)\]
1 -> 2:

(a) Starting with the simple case where $d=1$.
\[f(y) \ge f(x) + f'(x) (y-x)\] \[f(x) \ge f(y) + f'(y) (x-y)\]
From these two inequation, we can get:
\[f'(y)(x-y) \le f(x)-f(y) \le f'(x)(x-y)\]
Assuming $x-y>0$,
\[\frac{f'(x)-f'(y)}{x-y}\ge 0\]
(b) For any dimensions, we want to make it into 1-d:
\[g(\alpha) = f(x+\alpha v)\]
We can show it’s also a convex function by expanding it. So, applying the $1-d$ conclusion, we have:
\[g''(\alpha)\ge 0\] \[v^T\nabla f(x+\alpha v)v \ge 0\]
Taking $\alpha=0$, we get the definition of semi-positive matrix.

Strongly convex

Intuition: $x^2$ is the standard convex function. If a function is “more convex” than $mx^2$…

Definition of $m$-strong convex: $f(x) - \frac{m}{2} ||x||^2 \text{ is convex.}$

Second-order: $\nabla^2f(x)\succeq m\cdot I$ $\lambda_{\min}(\nabla^2f(x))\ge m$
First-order: $f(y)-f(x)\ge <\nabla f(x), y-x> + \frac{m}{2} \|\|y-x\|\|^2$
Zero-order: Skipped. (simply apply the definition)

In the next class, we will show how to use these conditions to derive global convergence.