Optimization Methods

Recall Gradient Descent

$$ \text{Update Rule}\\

w_{i+1} = w_i - \alpha \nabla L(w_i) $$

incorporate previous weight to move in direction of optimum
works well on convex loss functions

Pitfalls of native GD

Local Optima: can end up stuck not at global minima
Saddle points: maximum in one dimension but not in another, in the Hessian of the loss function, only a max/min if all points are +1 / -1

Improvements

Newton’s Method → will lead us to Momentum

Screenshot 2025-02-10 at 1.30.41 AM.png

not viable for training since gradient descent (n) + hessian calculation (n^2) + inverse of hessian (n) result in n^4

Momentum

Main idea: if gradients point in different directions, we should cancel them out; averaging gradients points us in the right direction

$$ \text{New Update Rule} \\ w_{i+1} = w_i - \alpha g_i\\ \text{where } g_i = \nabla_w L(w_i) + \mu g_{i-1} \text{ - blend in previous direction} $$

RMS Prop

Main Idea: sign of gradient tells us where to go, but magnitude not favorable, overall magnitude can change over time making learning rates hard to tune → normalize out the magnitude along each dimension

$$ \text{New Update Rule} \\ w_{i+1} = w_i - \alpha \frac{\nabla_w L(w_i)}{\sqrt{s_i}} \\ \text{where } s_i = \beta s_{i-1} + (1-\beta) (\nabla_w L(w_i)^2) \\\text{aka the squared length of each dimension} $$

keep running average

AdaGrad

Main Idea: Estimate cumulative magnitude per dimension, instead of keeping running average, just sum squared values together