Consistency vs Simplicity
Fundamental tradeoff: bias vs. variance
Usually algorithms prefer consistency by default (why?)
Several ways to operationalize “simplicity”
- Reduce the hypothesis space
- Assume more: e.g. independence assumptions, as in naïve Bayes
- Have fewer, better features / attributes: feature selection
- Other structural limitations (decision lists vs trees)
Regularization
- Smoothing: cautious use of small counts
- Many other generalization parameters (pruning cutoffs today)
- Hypothesis space stays big, but harder to get to the outskirts
Other Optimizers
Second Order
Momentum

Adaptive Learning Rates
Key idea: different learning rates for each parameter
- We can make larger or smaller updates depending on how important a feature is
- Small updates for frequent features; big updates for rare features