Backpropagration

Main Idea: Use the chain rule to compute gradients in the context of many features

Screenshot 2025-02-10 at 1.24.00 PM.png

walk backwards starting from L(z), computing its partial derivative w/ respect to whatever feature:weight pair you are optimizing

Screenshot 2025-02-10 at 1.27.56 PM.png

We should consider how big and how many layers there are, and what activation functions are used

in addition to weight vectors, we add biases to each w in order to ensure computed result after activation functions are not entirely 0

Example: Linear Layer

Screenshot 2025-02-10 at 1.36.35 PM.png

Example: Sigmoid

Screenshot 2025-02-10 at 1.37.14 PM.png