L2: make weight smaller in proportion to size, big weights more
L1: make weight smaller at a constant rate, all weights the same
Rumelhart’s idea: eliminate small weights
Improve Generalization
- add noise to the input/model/output (stochastic: from a guassian distribution)
- complexity minimization
- dropout
- early stopping
- more data
SGD
- shuffling allows for more different patterns to be seen
- shuffling does not have an impact on the weight changes if not done beforehand
Mini Batch
- we should shuffle our minibatch before computing gradient
Whats wrong with all positive inputs?
- If we sum all the outputs of each node, the input to the activation function could be very large
- all weights packed together: if delta (+) all weights will move in unison (+)