In large networks, the gradient can blow up, especially when different features are measured in different units.
This can cause slower convergence to the global minima, despite great architecture and implementation of your network
Main Idea: we can normalize / standardize inputs to activation functions to have a mean of 0 and standard deviation of 1
Main Idea: We want the overall scale of activations in the network not to be too big or too small for our initial (randomized) weights, so that the gradients propagate well
Basic initialization methods: ensure that activations are on a reasonable scale, and the scale of activations doesn’t grow or shrink in later layers as we increase the number of layers
f all weights in a network are initialized to the same value (e.g., 0), neurons in each layer will learn identical features, preventing the model from converging effectively. Random initialization ensures that each neuron learns unique features.
Uniform or Normal Distribution: Weights are drawn randomly from a uniform or normal distribution (e.g., U(−0.1,0.1) or N(0,0.01)).
Get std dev of Weight Matrix, W_{ij}, to be 1/sqrt(D_a)
Xavier Initialization: Ensures that the variance of activations is consistent across layers.
$$ \sim \mathcal{U}\left(-\frac{\sqrt{6}}{\sqrt{n_{in} + n_{out}}}, \frac{\sqrt{6}}{\sqrt{n_{in} + n_{out}}}\right)\\
\text{where $n_{in}$ and $n_{out}$ are the number of inputs and outputs for the layer.} $$
ReLU: half units on average are “dead”, so have bias term be some non-zero close to zero