1/18 - Optimization and Training | Notion

L2: make weight smaller in proportion to size, big weights more

L1: make weight smaller at a constant rate, all weights the same

Rumelhart’s idea: eliminate small weights

Improve Generalization

add noise to the input/model/output (stochastic: from a guassian distribution)
complexity minimization
dropout
early stopping
more data

SGD

shuffling allows for more different patterns to be seen
shuffling does not have an impact on the weight changes if not done beforehand

Mini Batch

we should shuffle our minibatch before computing gradient

Whats wrong with all positive inputs?

If we sum all the outputs of each node, the input to the activation function could be very large
all weights packed together: if delta (+) all weights will move in unison (+)