Derivation
- minimizing the sums of the squares of the lengths of the data points to the fitted plane. We didn’t need to square the lengths; we can just use the absolute sum of the lengths
- Let w be the (d+1) weight vector
- Let X be the (n x d+1) design matrix of training points
- Let y be the ground truth values
$$
\text{min}_w ||Xw - y||^2
$$
-
Minimizing through taking the gradient
$$
= w^TX^TXw - 2y^TXw+y^Ty\\
= 2X^TXw-2X^Ty\\
0= 2X^TXw-2X^Ty\\
\\
$$
We arrive at the Least Squares solution
$$
X^TXw=X^Ty\\
$$
Assumptions
- Linearity: there is a linear relationship between dependent / independent variables
- Independence: there is true independence in the independent variable and its residuals, where one residual does not influence another
- No mulitcollinearity: if there is, then variables correlate w/ each other. This ensures that there is no perfect linear relationship between the independent variables, leading to more accurate predictions.
- Normality: The distribution of the residuals should be bell-shaped and symmetrical. This ensures that the errors are normally distributed, as in a study on the distribution of incomes. Additionally the variance of these residuals across variables is constant
- Equal variance of residuals across different values of X
Interpreting Coefficients
- Positive: as independent variable increases by a unit, dependent expected to increase
- Negative: vice versa
- coefficients are interpreted as "expected change in the outcome, following one unit of change in the predictor, with all other variables held constant”
- the bias term is simply the predicted value of the dependent variable when all independent variables (features) are zero
Example: Imagine you have a regression model predicting housing prices. One feature is “number of bathrooms.” The coefficient is -20,000.