Recall Gradient Descent

$$ \text{Update Rule}\\

w_{i+1} = w_i - \alpha \nabla L(w_i) $$

Pitfalls of native GD

Improvements

Newton’s Method → will lead us to Momentum

Screenshot 2025-02-10 at 1.30.41 AM.png

Momentum

Main idea: if gradients point in different directions, we should cancel them out; averaging gradients points us in the right direction

$$ \text{New Update Rule} \\ w_{i+1} = w_i - \alpha g_i\\ \text{where } g_i = \nabla_w L(w_i) + \mu g_{i-1} \text{ - blend in previous direction} $$

RMS Prop

Main Idea: sign of gradient tells us where to go, but magnitude not favorable, overall magnitude can change over time making learning rates hard to tune → normalize out the magnitude along each dimension

$$ \text{New Update Rule} \\ w_{i+1} = w_i - \alpha \frac{\nabla_w L(w_i)}{\sqrt{s_i}} \\ \text{where } s_i = \beta s_{i-1} + (1-\beta) (\nabla_w L(w_i)^2) \\\text{aka the squared length of each dimension} $$

AdaGrad

Main Idea: Estimate cumulative magnitude per dimension, instead of keeping running average, just sum squared values together