Activation Functions
- must be non-linear
- with enough layers / nodes we can represent any function
Chain Rule
Main Idea: Use the chain rule to compute gradients in the context of many features

- walk backwards starting from L(z), computing its partial derivative w/ respect to whatever feature:weight pair you are optimizing
Backpropagation

In Practice
We should consider how big and how many layers there are, and what activation functions are used
- in addition to weight vectors, we add biases to each w in order to ensure computed result after activation functions are not entirely 0
Example: Linear Layer

Example: Sigmoid

1/16 - Tricks of the Trade
1/18 - Optimization and Training