Generalization

Early Stopping
- prevent further training and changing of weights
- only the best “peak” / lowest loss model in the end
Dropout
- delete each hidden unit w/ some probability, independently
- emulates ensembling Neural networks but in the singular case
L2 regularization
- penalizing larger weights reduces variance, though this is not necessarily true for the bias terms
Random Weight initialization
- If all weights in a network are initialized to the same value (e.g., 0), neurons in each layer will learn identical features, preventing the model from converging effectively. Random initialization ensures that each neuron learns unique features.
- Uniform or Normal Distribution: Weights are drawn randomly from a uniform or normal distribution (e.g., U(−0.1,0.1) or N(0,0.01)).
- Xavier Initialization (Glorot): Ensures that the variance of activations is consistent across layers.
  
  $$ \sim \mathcal{U}\left(-\frac{\sqrt{6}}{\sqrt{n_{in} + n_{out}}}, \frac{\sqrt{6}}{\sqrt{n_{in} + n_{out}}}\right)\\
  
  \text{where $n_{in}$ and $n_{out}$ are the number of inputs and outputs for the layer.} $$
Other Optimizers