Training Heuristics for Generalization

Screenshot 2025-02-10 at 1.46.15 PM.png

In large networks, the gradient can blow up, especially when different features are measured in different units.

Screenshot 2025-02-10 at 1.47.16 PM.png

This can cause slower convergence to the global minima, despite great architecture and implementation of your network

Batch Normalization

Main Idea: we can normalize / standardize inputs to activation functions to have a mean of 0 and standard deviation of 1

We can’t just renormalize at every gradient step. Why? The Activation functions depend on our weight matrices and bias vectors
However we don’t want to evaluate all points in each gradient step as it is expensive
Solution: compromise by computing mean and std dev. over every batch
We can often increase the learning rate when using batch normalization and therefore train faster

Weight Initialization

Main Idea: We want the overall scale of activations in the network not to be too big or too small for our initial (randomized) weights, so that the gradients propagate well

Basic initialization methods: ensure that activations are on a reasonable scale, and the scale of activations doesn’t grow or shrink in later layers as we increase the number of layers
f all weights in a network are initialized to the same value (e.g., 0), neurons in each layer will learn identical features, preventing the model from converging effectively. Random initialization ensures that each neuron learns unique features.
Uniform or Normal Distribution: Weights are drawn randomly from a uniform or normal distribution (e.g., U(−0.1,0.1) or N(0,0.01)).
Get std dev of Weight Matrix, W_{ij}, to be 1/sqrt(D_a)
Xavier Initialization: Ensures that the variance of activations is consistent across layers.

$$ \sim \mathcal{U}\left(-\frac{\sqrt{6}}{\sqrt{n_{in} + n_{out}}}, \frac{\sqrt{6}}{\sqrt{n_{in} + n_{out}}}\right)\\

\text{where $n_{in}$ and $n_{out}$ are the number of inputs and outputs for the layer.} $$
ReLU: half units on average are “dead”, so have bias term be some non-zero close to zero

Batch Normalization

Weight Initialization

Gradient Clipping