$$ \text{Update Rule}\\
w_{i+1} = w_i - \alpha \nabla L(w_i) $$
Newton’s Method → will lead us to Momentum
Main idea: if gradients point in different directions, we should cancel them out; averaging gradients points us in the right direction
$$ \text{New Update Rule} \\ w_{i+1} = w_i - \alpha g_i\\ \text{where } g_i = \nabla_w L(w_i) + \mu g_{i-1} \text{ - blend in previous direction} $$
Main Idea: sign of gradient tells us where to go, but magnitude not favorable, overall magnitude can change over time making learning rates hard to tune → normalize out the magnitude along each dimension
$$ \text{New Update Rule} \\ w_{i+1} = w_i - \alpha \frac{\nabla_w L(w_i)}{\sqrt{s_i}} \\ \text{where } s_i = \beta s_{i-1} + (1-\beta) (\nabla_w L(w_i)^2) \\\text{aka the squared length of each dimension} $$
Main Idea: Estimate cumulative magnitude per dimension, instead of keeping running average, just sum squared values together