Linear Regression & Regularization
From ordinary least squares to Ridge and Lasso — the foundation of every supervised learning method
Introduction
Linear regression is arguably the most fundamental tool in machine learning and scientific data analysis. Despite its simplicity, it forms the backbone of nearly every predictive model in the sciences — from calibrating instruments to fitting physical laws. Understanding its derivation, assumptions, and failure modes is essential before moving to more complex models.
In this chapter, we derive the ordinary least squares (OLS) solution from first principles, explore the geometric interpretation of projection, and then motivate regularization techniques (Ridge and Lasso) as principled ways to handle overfitting and multicollinearity.
Key Topics
- 1. The OLS Objective and Its Derivation
- 2. Normal Equations and Matrix Calculus
- 3. Geometric Interpretation: Projection onto Column Space
- 4. Ridge Regression (L2 Regularization)
- 5. Lasso Regression (L1 Regularization)
- 6. Bias-Variance Tradeoff
1. Ordinary Least Squares Derivation
We have $n$ observations of $d$ features collected in a design matrix$\mathbf{X} \in \mathbb{R}^{n \times d}$ and a target vector $\mathbf{y} \in \mathbb{R}^n$. We seek a weight vector $\mathbf{w} \in \mathbb{R}^d$ such that $\mathbf{Xw} \approx \mathbf{y}$.
The OLS Objective
We define the residual sum of squares (RSS) as:
Expanding this quadratic form:
Deriving the Normal Equations
To minimize $\mathcal{L}(\mathbf{w})$, we take the gradient with respect to $\mathbf{w}$ and set it to zero. We use the matrix calculus identities:
- $\nabla_{\mathbf{w}}(\mathbf{w}^T\mathbf{a}) = \mathbf{a}$ for constant vector $\mathbf{a}$
- $\nabla_{\mathbf{w}}(\mathbf{w}^T\mathbf{A}\mathbf{w}) = (\mathbf{A} + \mathbf{A}^T)\mathbf{w}$ for constant matrix $\mathbf{A}$
Since $\mathbf{X}^T\mathbf{X}$ is symmetric, we get:
Rearranging gives the normal equations:
If $\mathbf{X}^T\mathbf{X}$ is invertible (i.e., $\mathbf{X}$ has full column rank), the unique solution is:
Verifying the Minimum
The Hessian (second derivative) of the loss is:
Since $\mathbf{X}^T\mathbf{X}$ is positive semi-definite (and positive definite when $\mathbf{X}$ has full rank), this confirms the critical point is a global minimum. The loss surface is a convex paraboloid with no local minima.
2. Geometric Interpretation
The OLS solution has a beautiful geometric interpretation. The prediction $\hat{\mathbf{y}} = \mathbf{X}\mathbf{w}_{\text{OLS}}$ is the orthogonal projection of $\mathbf{y}$ onto the column space of $\mathbf{X}$.
The Hat Matrix
Define the projection (or "hat") matrix:
Then $\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}$. This matrix "puts the hat on $\mathbf{y}$". It satisfies the projection properties:
- Idempotent: $\mathbf{H}^2 = \mathbf{H}$ (projecting twice does nothing new)
- Symmetric: $\mathbf{H}^T = \mathbf{H}$ (orthogonal projection)
- Rank: $\text{rank}(\mathbf{H}) = d$ (the number of features)
- Trace: $\text{tr}(\mathbf{H}) = d$ (sum of leverages equals dimension)
Orthogonality of Residuals
The residual vector $\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}$ is orthogonal to the column space of $\mathbf{X}$:
This is the Pythagorean theorem in high-dimensional space: $\|\mathbf{y}\|^2 = \|\hat{\mathbf{y}}\|^2 + \|\mathbf{e}\|^2$. The $R^2$ statistic measures what fraction of $\|\mathbf{y}\|^2$ is explained by the projection.
3. Statistical Properties of OLS
Under the standard linear model assumptions $\mathbf{y} = \mathbf{X}\mathbf{w}^* + \boldsymbol{\epsilon}$ where$\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I})$, the OLS estimator has remarkable properties.
Gauss-Markov Theorem
The OLS estimator is the Best Linear Unbiased Estimator (BLUE):
- Unbiased: $\mathbb{E}[\mathbf{w}_{\text{OLS}}] = \mathbf{w}^*$
- Minimum variance: Among all linear unbiased estimators, OLS has the smallest variance
Proof of unbiasedness:
Covariance of the estimator:
This tells us that directions in feature space with low data variance yield high-variance estimates — a key motivation for regularization.
Maximum Likelihood Interpretation
Under Gaussian noise, the likelihood is:
Taking the negative log-likelihood:
Minimizing this with respect to $\mathbf{w}$ is equivalent to minimizing the OLS objective. Thus, OLS = MLE under Gaussian noise.
4. Ridge Regression (L2 Regularization)
When $\mathbf{X}^T\mathbf{X}$ is ill-conditioned (nearly singular), the OLS solution becomes unstable — small changes in data produce wildly different weights. Ridge regression adds an L2 penalty to stabilize the solution.
Ridge Objective
where $\lambda > 0$ is the regularization strength. Taking the gradient and setting to zero:
The matrix $\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I}$ is always invertible for $\lambda > 0$, since its smallest eigenvalue is at least $\lambda$.
SVD Interpretation
Using the singular value decomposition $\mathbf{X} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T$, the Ridge solution becomes:
The factor $\sigma_j^2/(\sigma_j^2 + \lambda)$ shrinks coefficients along directions with small singular values. This is spectral shrinkage: directions poorly supported by data are suppressed most.
Bayesian Interpretation
Ridge regression corresponds to the maximum a posteriori (MAP) estimate with a Gaussian prior on the weights:
The posterior is $p(\mathbf{w}|\mathbf{y},\mathbf{X}) \propto p(\mathbf{y}|\mathbf{X},\mathbf{w})p(\mathbf{w})$, and its mode (MAP) coincides with the Ridge solution.
5. Lasso Regression (L1 Regularization)
While Ridge shrinks all coefficients uniformly, the Lasso (Least Absolute Shrinkage and Selection Operator, Tibshirani 1996) can drive coefficients to exactly zero, performing automatic feature selection.
Lasso Objective
Unlike Ridge, there is no closed-form solution because the L1 norm is not differentiable at zero. However, for orthogonal features ($\mathbf{X}^T\mathbf{X} = \mathbf{I}$), the solution has the elegant soft-thresholding form:
Coordinate Descent for Lasso
For general (non-orthogonal) features, we solve the Lasso using coordinate descent. At each step, we optimize over a single coordinate $w_j$ while holding all others fixed.
Define the partial residual excluding feature $j$:
Then the update for $w_j$ is:
where $S(z, \gamma) = \text{sign}(z)\max(|z| - \gamma, 0)$ is the soft-thresholding operator.
Elastic Net: Combining L1 and L2
The Elastic Net (Zou & Hastie, 2005) combines both penalties:
This addresses the Lasso's limitation of selecting at most $n$ features when $d > n$, and handles correlated features more gracefully.
6. The Bias-Variance Tradeoff
Regularization introduces bias (the model no longer recovers the true parameters on average) but reduces variance (the model is less sensitive to the particular training set). The optimal model balances these.
Bias-Variance Decomposition
For a new test point $\mathbf{x}_0$, the expected prediction error decomposes as:
where:
Ridge Bias-Variance
For Ridge regression with true parameter $\mathbf{w}^*$:
As $\lambda \to 0$, bias vanishes but variance is high. As $\lambda \to \infty$, variance vanishes but bias dominates. The optimal $\lambda$ minimizes the total MSE.
7. Python Simulation: OLS, Ridge, and Lasso
The following simulation generates noisy polynomial data and compares OLS, Ridge, and coordinate-descent Lasso fits, demonstrating overfitting and regularization effects.
Linear Regression & Regularization Comparison
PythonClick Run to execute the Python code
Code will be executed with Python 3 on the server
8. Regularization Paths
Ridge Path
As $\lambda$ increases from 0 to $\infty$, the Ridge solution traces a continuous path from $\mathbf{w}_{\text{OLS}}$ to $\mathbf{0}$. Using the SVD:
Each coefficient is continuously shrunk toward zero, with coefficients in low-variance directions shrinking fastest.
Lasso Path
The Lasso path is piecewise linear: coefficients enter or leave the active set at critical values of $\lambda$. The LARS (Least Angle Regression) algorithm computes the entire path in $O(d^3 + d^2 n)$ time by solving for the exact $\lambda$ values at which the active set changes.
At $\lambda_{\max} = \|\mathbf{X}^T\mathbf{y}\|_\infty$, all coefficients are zero. As $\lambda$decreases, features enter one at a time, in order of their correlation with the residual:
The first feature to enter is the one most correlated with the target.
Degrees of Freedom
The effective degrees of freedom of Ridge regression is:
This ranges from $d$ (at $\lambda = 0$) to $0$ (as $\lambda \to \infty$), providing a continuous measure of model complexity. For Lasso, the degrees of freedom equals the number of nonzero coefficients (approximately).
9. Selecting $\lambda$: Cross-Validation
In practice, we select the regularization parameter $\lambda$ using k-fold cross-validation:
- Split data into $k$ equal folds
- For each fold, train on $k-1$ folds and evaluate on the held-out fold
- Average the $k$ validation errors
- Choose $\lambda$ that minimizes this average error
Leave-One-Out Cross-Validation (LOOCV)
For Ridge regression, LOOCV has an efficient closed-form:
where $H_{ii}(\lambda)$ is the $i$-th diagonal element of the hat matrix$\mathbf{H}(\lambda) = \mathbf{X}(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T$. This allows evaluating all $n$ leave-one-out models without refitting.
10. Multi-Collinearity and Condition Number
The Condition Number
The condition number of $\mathbf{X}^T\mathbf{X}$ measures the sensitivity of the OLS solution to perturbations in the data:
where $\sigma_{\max}$ and $\sigma_{\min}$ are the largest and smallest singular values of $\mathbf{X}$. A large condition number means near-singularity.
- $\kappa \approx 1$: Well-conditioned; stable solution
- $\kappa \sim 10^4$: Some digits of accuracy lost
- $\kappa \sim 10^{16}$: Essentially singular; solution meaningless
Ridge as Conditioning Fix
Ridge regression directly improves the condition number:
Even a small $\lambda$ can dramatically improve conditioning when $\sigma_{\min}$ is tiny. This is the numerical analysis perspective on why regularization helps — it stabilizes the linear system.
Variance Inflation Factor
The VIF for feature $j$ measures how much the variance of $w_j$ is inflated by multicollinearity:
where $R_j^2$ is the $R^2$ from regressing feature $j$ on all other features. VIF > 10 typically indicates problematic collinearity. If feature $j$ is nearly a linear combination of other features, $R_j^2 \approx 1$ and VIF diverges.
11. Applications in Science
Spectroscopy
In chemometrics, spectra have thousands of wavelengths but few samples. Ridge/Lasso regression is essential for predicting chemical concentrations from highly correlated spectral features.
Genomics
Genome-wide association studies (GWAS) have millions of SNPs but thousands of individuals. Lasso and Elastic Net identify which genetic variants are associated with disease.
Climate Science
Multicollinear climate variables (temperature, humidity, pressure at many locations) require regularized regression for stable predictions.
Materials Science
Predicting material properties from composition vectors. Lasso identifies which elements/descriptors matter most for a target property.
Summary
| Method | Penalty | Solution | Sparsity |
|---|---|---|---|
| OLS | None | $(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$ | No |
| Ridge | $\lambda\|\mathbf{w}\|_2^2$ | $(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$ | No (shrinkage only) |
| Lasso | $\lambda\|\mathbf{w}\|_1$ | Coordinate descent | Yes (exact zeros) |
| Elastic Net | $\lambda_1\|\mathbf{w}\|_1 + \lambda_2\|\mathbf{w}\|_2^2$ | Coordinate descent | Yes (grouped) |