Definition:
Logistic regression is a statistical model used for binary classification. It models the probability of a binary outcome as a function of input features using the logistic (sigmoid) function.
The probability that an observation belongs to the positive class (y=1) is modeled as:
P(y=1∣x;β)=σ(x⊤β)=1+e−x⊤β1
where:
- x∈Rn is the vector of features (including an intercept term).
- β∈Rn is the vector of model parameters.
- σ(z) is the sigmoid function:
σ(z)=1+e−z1
Log-Likelihood Function:
Given m training examples {(xi,yi)}i=1m where yi∈{0,1}, the likelihood of the data is:
L(β)=∏i=1mσ(xi⊤β)yi(1−σ(xi⊤β))1−yi
Taking the logarithm gives the log-likelihood:
ℓ(β)=∑i=1m[yilogσ(xi⊤β)+(1−yi)log(1−σ(xi⊤β))]
Optimization Problem:
The logistic regression parameters β are obtained by maximizing the log-likelihood:
β^=argmaxβℓ(β)
Equivalently, minimizing the negative log-likelihood:
β^=argminβ[−∑i=1m(yilogσ(xi⊤β)+(1−yi)log(1−σ(xi⊤β)))]
Gradient Descent:
The gradient of the log-likelihood with respect to β is:
∇βℓ(β)=∑i=1m(yi−σ(xi⊤β))xi
This gradient is used in iterative optimization algorithms like gradient descent or Newton’s method.
Decision Rule:
The predicted probability for the positive class is:
P^(y=1∣x)=σ(x⊤β^)
The decision rule for classification is:
\begin{cases}
1 & \text{if } \hat{P}(y=1 | x) \geq 0.5 \\
0 & \text{otherwise}
\end{cases}$$
**Assumptions**:
1. The relationship between the log-odds of the outcome and the features is linear:
$$\log \frac{P(y=1 | x)}{P(y=0 | x)} = x^\top \beta$$
2. Independence of observations.
3. Features are not highly collinear (to ensure stable estimation).
**Extensions**:
1. **Multiclass Logistic Regression** ([[Softmax Regression]]):
Generalizes logistic regression to classify among $k$ classes using the [[Softmax Function]].
2. **Regularized Logistic Regression**:
Adds $L_1$ or $L_2$ regularization to prevent overfitting:
- Lasso ($L_1$):
$$\ell_\text{reg}(\beta) = \ell(\beta) - \lambda \|\beta\|_1$$
- Ridge ($L_2$):
$$\ell_\text{reg}(\beta) = \ell(\beta) - \lambda \|\beta\|^2$$
**Applications**:
- Binary classification problems: spam detection, medical diagnosis, etc.
- Estimating probabilities of binary events.
- Feature importance analysis through parameter interpretation.
**Limitations**:
- Assumes linearity in the log-odds, which may not hold for complex relationships.
- Not inherently robust to outliers or highly imbalanced datasets.