Derivation
$$
\text{Probability of success:}\\(P(y=1|x): P(y=1|x) = p\\
\text{Probability of failure:}\\ (P(y=0|x))):
P(y=0|x) = 1 - p\\
\text{Odds} = \frac{p}{1-p}\\
\text{log-odds} = \log(\frac{p}{1-p}) = w^T x + b
$$
Transforming back into probabilities using the sigmoid function:
$$
p = \frac{1}{1 + e^{-(w^T x + b)}}
$$
The goal is to maximize the log-likelihood (or minimize the negative log-likelihood) to find w and b. The likelihood loss function function represents the probability of observing the data:
$$
L(w, b) = \prod_{i=1}^n P(y_i|x_i)
$$
Subsituting sigmoid
$$
L(w, b) = \prod_{i=1}^n p_i^{y_i} (1-p_i)^{1-y_i}
$$
$$
\ell(w, b) = \sum_{i=1}^n \left[ y_i \log(p_i) + (1-y_i) \log(1-p_i) \right]
$$
- in terms of log likelihood
Assumptions
- Independent Observations:
- Each observation in the dataset is assumed to be independent of the others.
- No Multicollinearity:
- Features (independent variables) should not be highly correlated with each other.
- Predictor Variables Are Correctly Specified:
- Assumes that all relevant variables are included and no irrelevant ones are added.
- Large Sample Size:
- Logistic regression works best with a large sample size for accurate parameter estimation.
Interpretability: High
- Why it’s interpretable:
- Logistic regression provides coefficients for each feature, which can be directly interpreted as the impact of that feature (in log-odds) on the outcome.
- For example, a positive coefficient increases the likelihood of the positive class, while a negative coefficient decreases it.
- If standardized, coefficients can be compared to evaluate the relative importance of features.
- Its simplicity and linearity make it easy to explain to non-technical stakeholders.
- Limitations:
- Assumes a linear relationship between features and log-odds. Complex, non-linear relationships cannot be captured.
- Interaction effects between features must be explicitly modeled (e.g., by adding interaction terms).