Logistic Regression

Derivation

$$ \text{Probability of success:}\\(P(y=1|x): P(y=1|x) = p\\ \text{Probability of failure:}\\ (P(y=0|x))):
P(y=0|x) = 1 - p\\

\text{Odds} = \frac{p}{1-p}\\

\text{log-odds} = \log(\frac{p}{1-p}) = w^T x + b

Transforming back into probabilities using the sigmoid function:

$$ p = \frac{1}{1 + e^{-(w^T x + b)}} $$

The goal is to maximize the log-likelihood (or minimize the negative log-likelihood) to find w and b. The likelihood loss function function represents the probability of observing the data:

$$ L(w, b) = \prod_{i=1}^n P(y_i|x_i) $$

Subsituting sigmoid

$$ L(w, b) = \prod_{i=1}^n p_i^{y_i} (1-p_i)^{1-y_i} $$

$$ \ell(w, b) = \sum_{i=1}^n \left[ y_i \log(p_i) + (1-y_i) \log(1-p_i) \right] $$

in terms of log likelihood

Assumptions

Independent Observations:
- Each observation in the dataset is assumed to be independent of the others.
No Multicollinearity:
- Features (independent variables) should not be highly correlated with each other.
Predictor Variables Are Correctly Specified:
- Assumes that all relevant variables are included and no irrelevant ones are added.
Large Sample Size:
- Logistic regression works best with a large sample size for accurate parameter estimation.

Interpretability: High

Why it’s interpretable:
- Logistic regression provides coefficients for each feature, which can be directly interpreted as the impact of that feature (in log-odds) on the outcome.
- For example, a positive coefficient increases the likelihood of the positive class, while a negative coefficient decreases it.
- If standardized, coefficients can be compared to evaluate the relative importance of features.
- Its simplicity and linearity make it easy to explain to non-technical stakeholders.
Limitations:
- Assumes a linear relationship between features and log-odds. Complex, non-linear relationships cannot be captured.
- Interaction effects between features must be explicitly modeled (e.g., by adding interaction terms).