Derivation
- minimizing the sums of the squares of the lengths of the data points to the fitted plane. We didn’t need to square the lengths; we can just use the absolute sum of the lengths
- Let w be the (d+1) weight vector
- Let X be the (n x d+1) design matrix of training points
- Let y be the ground truth values
$$
\text{min}_w ||Xw - y||^2
$$
-
Minimizing through taking the gradient
$$
= w^TX^TXw - 2y^TXw+y^Ty\\
= 2X^TXw-2X^Ty\\
0= 2X^TXw-2X^Ty\\
\\
$$
We arrive at the Least Squares solution
$$
X^TXw=X^Ty\\
$$
Assumptions
- there is a linear relationship between dependent / independent variables
- there is true independence in the independent variable
- no mulitcollinearity; if there is, then variables correlate w/ each other. This ensures that there is no perfect linear relationship between the independent variables, leading to more accurate predictions.
- normality: The distribution of the residuals should be bell-shaped and symmetrical. This ensures that the errors are normally distributed, as in a study on the distribution of incomes.
Ridge Regression
Adds L2-norm penalty to the least squares loss function.
It shrinks the coefficients toward zero but does not enforce exact zero, meaning all features remain in the model
$$
L_2(\mathbf{w}) = \sum_i(y_i-w^Tx_i) + \lambda \sum_{j=1}^{p} w_j^2
$$
- Fully differentiable, easier to optimize.
- Shrinks but maintains all features due to non-zero weights
- prevents weights from being too large