Screenshot 2025-02-10 at 1.46.15 PM.png

In large networks, the gradient can blow up, especially when different features are measured in different units.

Screenshot 2025-02-10 at 1.47.16 PM.png

This can cause slower convergence to the global minima, despite great architecture and implementation of your network

Batch Normalization

Main Idea: we can normalize / standardize inputs to activation functions to have a mean of 0 and standard deviation of 1

Weight Initialization

Main Idea: We want the overall scale of activations in the network not to be too big or too small for our initial (randomized) weights, so that the gradients propagate well

Gradient Clipping