Ridge Regression

categories: Data Science, Method

Definition:
Ridge regression, also called $L_{2}$ regularized Linear Regression, modifies the ordinary least squares (OLS) objective by adding a Regularization term to penalize large model coefficients. It addresses multicollinearity and prevents overfitting in linear regression.

Objective Function:
The ridge regression minimizes:
$J (β) = ∥ y - Xβ ∥^{2} + λ ∥ β ∥^{2}$
where:

$∥ y - Xβ ∥^{2}$ is the residual sum of squares,
$∥ β ∥^{2} = β^{⊤} β$ is the $L_{2}$ norm of the coefficients,
$λ \geq 0$ is the regularization parameter that controls the trade-off between fitting the data and keeping coefficients small.

Closed-Form Solution:
The ridge regression solution is derived using the normal equations:
$β = (X^{⊤} X + λ I)^{- 1} X^{⊤} y$
where $I$ is the identity matrix.

Intuition:

The term $λ ∥ β ∥^{2}$ penalizes large coefficients, effectively shrinking them towards zero.
For $λ = 0$ , ridge regression reduces to ordinary least squares.
For $λ \to \infty$ , $β \to 0$ (shrinking coefficients completely).

Key Properties:

Regularization Strength:
- Larger $λ$ increases the penalty, leading to smaller coefficients and potentially underfitting.
- Smaller $λ$ reduces the penalty, approaching OLS and potentially overfitting.
Bias-Variance Trade-Off:
- Ridge regression increases bias but reduces variance, improving the generalization of the model.
No Feature Elimination:
Unlike Lasso regression, ridge regression does not perform variable selection; all coefficients are shrunk but not set to zero.
Stabilizes Inversion:
The addition of $λ I$ ensures $(X^{⊤} X + λ I)$ is invertible even if $X^{⊤} X$ is singular (e.g., when features are highly collinear).

Gradient Descent Formulation:
The gradient of the ridge loss function is:
$\nabla_{β} J (β) = - 2 X^{⊤} (y - Xβ) + 2 λ β$
This can be used in iterative optimization methods when $n$ (number of features) is large.

Applications:

Addressing multicollinearity in linear regression.
Regularizing models with a large number of features.
Situations where interpretability (non-zero coefficients) is desired over sparsity.

Evgeny's Notes

Explorer

Recent posts

Installing the Homebrew Channel App on an LG TV (Ubuntu)

Obsidian + Zettelkasten + PARA

About this site

Ridge Regression

Graph View

Backlinks