Linear Regression & Regularization

From ordinary least squares to Ridge and Lasso — the foundation of every supervised learning method

Introduction

Linear regression is arguably the most fundamental tool in machine learning and scientific data analysis. Despite its simplicity, it forms the backbone of nearly every predictive model in the sciences — from calibrating instruments to fitting physical laws. Understanding its derivation, assumptions, and failure modes is essential before moving to more complex models.

In this chapter, we derive the ordinary least squares (OLS) solution from first principles, explore the geometric interpretation of projection, and then motivate regularization techniques (Ridge and Lasso) as principled ways to handle overfitting and multicollinearity.

Key Topics

1. The OLS Objective and Its Derivation
2. Normal Equations and Matrix Calculus
3. Geometric Interpretation: Projection onto Column Space
4. Ridge Regression (L2 Regularization)
5. Lasso Regression (L1 Regularization)
6. Bias-Variance Tradeoff

1. Ordinary Least Squares Derivation

We have $n$ observations of $d$ features collected in a design matrix$\mathbf{X} \in \mathbb{R}^{n \times d}$ and a target vector $\mathbf{y} \in \mathbb{R}^n$. We seek a weight vector $\mathbf{w} \in \mathbb{R}^d$ such that $\mathbf{Xw} \approx \mathbf{y}$.

The OLS Objective

We define the residual sum of squares (RSS) as:

$$\mathcal{L}(\mathbf{w}) = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 = (\mathbf{y} - \mathbf{X}\mathbf{w})^T(\mathbf{y} - \mathbf{X}\mathbf{w})$$

Expanding this quadratic form:

$$\mathcal{L}(\mathbf{w}) = \mathbf{y}^T\mathbf{y} - 2\mathbf{w}^T\mathbf{X}^T\mathbf{y} + \mathbf{w}^T\mathbf{X}^T\mathbf{X}\mathbf{w}$$

Deriving the Normal Equations

To minimize $\mathcal{L}(\mathbf{w})$, we take the gradient with respect to $\mathbf{w}$ and set it to zero. We use the matrix calculus identities:

$\nabla_{\mathbf{w}}(\mathbf{w}^T\mathbf{a}) = \mathbf{a}$ for constant vector $\mathbf{a}$
$\nabla_{\mathbf{w}}(\mathbf{w}^T\mathbf{A}\mathbf{w}) = (\mathbf{A} + \mathbf{A}^T)\mathbf{w}$ for constant matrix $\mathbf{A}$

Since $\mathbf{X}^T\mathbf{X}$ is symmetric, we get:

$$\nabla_{\mathbf{w}}\mathcal{L} = -2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\mathbf{w} = 0$$

Rearranging gives the normal equations:

$$\mathbf{X}^T\mathbf{X}\mathbf{w} = \mathbf{X}^T\mathbf{y}$$

If $\mathbf{X}^T\mathbf{X}$ is invertible (i.e., $\mathbf{X}$ has full column rank), the unique solution is:

$$\boxed{\mathbf{w}_{\text{OLS}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}}$$

Verifying the Minimum

The Hessian (second derivative) of the loss is:

$$\mathbf{H} = \nabla^2_{\mathbf{w}}\mathcal{L} = 2\mathbf{X}^T\mathbf{X}$$

Since $\mathbf{X}^T\mathbf{X}$ is positive semi-definite (and positive definite when $\mathbf{X}$ has full rank), this confirms the critical point is a global minimum. The loss surface is a convex paraboloid with no local minima.

2. Geometric Interpretation

The OLS solution has a beautiful geometric interpretation. The prediction $\hat{\mathbf{y}} = \mathbf{X}\mathbf{w}_{\text{OLS}}$ is the orthogonal projection of $\mathbf{y}$ onto the column space of $\mathbf{X}$.

The Hat Matrix

Define the projection (or "hat") matrix:

$$\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T$$

Then $\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}$. This matrix "puts the hat on $\mathbf{y}$". It satisfies the projection properties:

Idempotent: $\mathbf{H}^2 = \mathbf{H}$ (projecting twice does nothing new)
Symmetric: $\mathbf{H}^T = \mathbf{H}$ (orthogonal projection)
Rank: $\text{rank}(\mathbf{H}) = d$ (the number of features)
Trace: $\text{tr}(\mathbf{H}) = d$ (sum of leverages equals dimension)

Orthogonality of Residuals

The residual vector $\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}$ is orthogonal to the column space of $\mathbf{X}$:

$$\mathbf{X}^T\mathbf{e} = \mathbf{X}^T(\mathbf{y} - \mathbf{X}\mathbf{w}_{\text{OLS}}) = \mathbf{X}^T\mathbf{y} - \mathbf{X}^T\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} = \mathbf{0}$$

This is the Pythagorean theorem in high-dimensional space: $\|\mathbf{y}\|^2 = \|\hat{\mathbf{y}}\|^2 + \|\mathbf{e}\|^2$. The $R^2$ statistic measures what fraction of $\|\mathbf{y}\|^2$ is explained by the projection.

3. Statistical Properties of OLS

Under the standard linear model assumptions $\mathbf{y} = \mathbf{X}\mathbf{w}^* + \boldsymbol{\epsilon}$ where$\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I})$, the OLS estimator has remarkable properties.

Gauss-Markov Theorem

The OLS estimator is the Best Linear Unbiased Estimator (BLUE):

Unbiased: $\mathbb{E}[\mathbf{w}_{\text{OLS}}] = \mathbf{w}^*$
Minimum variance: Among all linear unbiased estimators, OLS has the smallest variance

Proof of unbiasedness:

$$\mathbb{E}[\mathbf{w}_{\text{OLS}}] = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbb{E}[\mathbf{y}] = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}\mathbf{w}^* = \mathbf{w}^*$$

Covariance of the estimator:

$$\text{Cov}(\mathbf{w}_{\text{OLS}}) = \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}$$

This tells us that directions in feature space with low data variance yield high-variance estimates — a key motivation for regularization.

Maximum Likelihood Interpretation

Under Gaussian noise, the likelihood is:

$$p(\mathbf{y}|\mathbf{X},\mathbf{w},\sigma^2) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \mathbf{x}_i^T\mathbf{w})^2}{2\sigma^2}\right)$$

Taking the negative log-likelihood:

$$-\log p(\mathbf{y}|\mathbf{X},\mathbf{w},\sigma^2) = \frac{n}{2}\log(2\pi\sigma^2) + \frac{1}{2\sigma^2}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2$$

Minimizing this with respect to $\mathbf{w}$ is equivalent to minimizing the OLS objective. Thus, OLS = MLE under Gaussian noise.

4. Ridge Regression (L2 Regularization)

When $\mathbf{X}^T\mathbf{X}$ is ill-conditioned (nearly singular), the OLS solution becomes unstable — small changes in data produce wildly different weights. Ridge regression adds an L2 penalty to stabilize the solution.

Ridge Objective

$$\mathcal{L}_{\text{Ridge}}(\mathbf{w}) = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \lambda\|\mathbf{w}\|^2$$

where $\lambda > 0$ is the regularization strength. Taking the gradient and setting to zero:

$$\nabla_{\mathbf{w}}\mathcal{L}_{\text{Ridge}} = -2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\mathbf{w} + 2\lambda\mathbf{w} = 0$$

$$(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})\mathbf{w} = \mathbf{X}^T\mathbf{y}$$

$$\boxed{\mathbf{w}_{\text{Ridge}} = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}}$$

The matrix $\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I}$ is always invertible for $\lambda > 0$, since its smallest eigenvalue is at least $\lambda$.

SVD Interpretation

Using the singular value decomposition $\mathbf{X} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T$, the Ridge solution becomes:

$$\mathbf{w}_{\text{Ridge}} = \sum_{j=1}^{d} \frac{\sigma_j^2}{\sigma_j^2 + \lambda} \frac{\mathbf{u}_j^T\mathbf{y}}{\sigma_j}\mathbf{v}_j$$

The factor $\sigma_j^2/(\sigma_j^2 + \lambda)$ shrinks coefficients along directions with small singular values. This is spectral shrinkage: directions poorly supported by data are suppressed most.

Bayesian Interpretation

Ridge regression corresponds to the maximum a posteriori (MAP) estimate with a Gaussian prior on the weights:

$$p(\mathbf{w}) = \mathcal{N}\left(\mathbf{0}, \frac{\sigma^2}{\lambda}\mathbf{I}\right)$$

The posterior is $p(\mathbf{w}|\mathbf{y},\mathbf{X}) \propto p(\mathbf{y}|\mathbf{X},\mathbf{w})p(\mathbf{w})$, and its mode (MAP) coincides with the Ridge solution.

5. Lasso Regression (L1 Regularization)

While Ridge shrinks all coefficients uniformly, the Lasso (Least Absolute Shrinkage and Selection Operator, Tibshirani 1996) can drive coefficients to exactly zero, performing automatic feature selection.

Lasso Objective

$$\mathcal{L}_{\text{Lasso}}(\mathbf{w}) = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \lambda\|\mathbf{w}\|_1 = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \lambda\sum_{j=1}^{d}|w_j|$$

Unlike Ridge, there is no closed-form solution because the L1 norm is not differentiable at zero. However, for orthogonal features ($\mathbf{X}^T\mathbf{X} = \mathbf{I}$), the solution has the elegant soft-thresholding form:

$$\boxed{w_j^{\text{Lasso}} = \text{sign}(\hat{w}_j^{\text{OLS}})\max\left(|\hat{w}_j^{\text{OLS}}| - \frac{\lambda}{2}, 0\right)}$$

Coordinate Descent for Lasso

For general (non-orthogonal) features, we solve the Lasso using coordinate descent. At each step, we optimize over a single coordinate $w_j$ while holding all others fixed.

Define the partial residual excluding feature $j$:

$$r_j = \mathbf{y} - \sum_{k \neq j} \mathbf{x}_k w_k$$

Then the update for $w_j$ is:

$$w_j \leftarrow \frac{S(\mathbf{x}_j^T r_j, \lambda/2)}{\mathbf{x}_j^T\mathbf{x}_j}$$

where $S(z, \gamma) = \text{sign}(z)\max(|z| - \gamma, 0)$ is the soft-thresholding operator.

Elastic Net: Combining L1 and L2

The Elastic Net (Zou & Hastie, 2005) combines both penalties:

$$\mathcal{L}_{\text{EN}}(\mathbf{w}) = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \lambda_1\|\mathbf{w}\|_1 + \lambda_2\|\mathbf{w}\|^2$$

This addresses the Lasso's limitation of selecting at most $n$ features when $d > n$, and handles correlated features more gracefully.

6. The Bias-Variance Tradeoff

Regularization introduces bias (the model no longer recovers the true parameters on average) but reduces variance (the model is less sensitive to the particular training set). The optimal model balances these.

Bias-Variance Decomposition

For a new test point $\mathbf{x}_0$, the expected prediction error decomposes as:

$$\mathbb{E}\left[(y_0 - \hat{f}(\mathbf{x}_0))^2\right] = \underbrace{\text{Bias}^2[\hat{f}(\mathbf{x}_0)]}_{\text{systematic error}} + \underbrace{\text{Var}[\hat{f}(\mathbf{x}_0)]}_{\text{sensitivity to data}} + \underbrace{\sigma^2}_{\text{irreducible noise}}$$

where:

$$\text{Bias}[\hat{f}(\mathbf{x}_0)] = \mathbb{E}[\hat{f}(\mathbf{x}_0)] - f(\mathbf{x}_0)$$

$$\text{Var}[\hat{f}(\mathbf{x}_0)] = \mathbb{E}\left[(\hat{f}(\mathbf{x}_0) - \mathbb{E}[\hat{f}(\mathbf{x}_0)])^2\right]$$

Ridge Bias-Variance

For Ridge regression with true parameter $\mathbf{w}^*$:

$$\text{Bias}(\mathbf{w}_{\text{Ridge}}) = -\lambda(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{w}^*$$

$$\text{Var}(\mathbf{w}_{\text{Ridge}}) = \sigma^2(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{X}(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}$$

As $\lambda \to 0$, bias vanishes but variance is high. As $\lambda \to \infty$, variance vanishes but bias dominates. The optimal $\lambda$ minimizes the total MSE.

7. Python Simulation: OLS, Ridge, and Lasso

The following simulation generates noisy polynomial data and compares OLS, Ridge, and coordinate-descent Lasso fits, demonstrating overfitting and regularization effects.

Linear Regression & Regularization Comparison

Python

script.py105 lines

import numpy as np

# --- Generate synthetic polynomial data ---
np.random.seed(42)
n = 30
x = np.linspace(-1, 1, n)
y_true = 2.0 + 1.5*x - 0.8*x**2 + 0.3*x**3
y = y_true + 0.3*np.random.randn(n)

# Build polynomial feature matrix (degree 10 for overfitting demo)
degree = 10
X = np.column_stack([x**k for k in range(degree+1)])

# --- OLS Solution ---
XtX = X.T @ X
Xty = X.T @ y
w_ols = np.linalg.solve(XtX, Xty)

# --- Ridge Regression ---
lam_ridge = 1.0
w_ridge = np.linalg.solve(XtX + lam_ridge * np.eye(degree+1), Xty)

# --- Lasso via Coordinate Descent ---
def soft_threshold(z, gamma):
    return np.sign(z) * np.maximum(np.abs(z) - gamma, 0.0)

def lasso_coordinate_descent(X, y, lam, max_iter=1000, tol=1e-6):
    n, d = X.shape
    w = np.zeros(d)
    for iteration in range(max_iter):
        w_old = w.copy()
        for j in range(d):
            r_j = y - X @ w + X[:, j] * w[j]
            w[j] = soft_threshold(X[:, j] @ r_j, lam / 2) / (X[:, j] @ X[:, j])
        if np.max(np.abs(w - w_old)) < tol:
            break
    return w

lam_lasso = 0.5
w_lasso = lasso_coordinate_descent(X, y, lam_lasso)

# --- Evaluate on fine grid ---
x_fine = np.linspace(-1, 1, 200)
X_fine = np.column_stack([x_fine**k for k in range(degree+1)])
y_ols_fine = X_fine @ w_ols
y_ridge_fine = X_fine @ w_ridge
y_lasso_fine = X_fine @ w_lasso
y_true_fine = 2.0 + 1.5*x_fine - 0.8*x_fine**2 + 0.3*x_fine**3

# --- Compute MSE on training data ---
mse_ols = np.mean((y - X @ w_ols)**2)
mse_ridge = np.mean((y - X @ w_ridge)**2)
mse_lasso = np.mean((y - X @ w_lasso)**2)

print("=" * 60)
print("LINEAR REGRESSION & REGULARIZATION COMPARISON")
print("=" * 60)
print(f"Polynomial degree: {degree}, Samples: {n}")
print(f"True function: 2.0 + 1.5x - 0.8x^2 + 0.3x^3")
print()
print(f"{'Method':<20} {'Train MSE':<15} {'||w||_2':<15} {'||w||_1':<15} {'# zeros':<10}")
print("-" * 75)
print(f"{'OLS':<20} {mse_ols:<15.6f} {np.linalg.norm(w_ols):<15.4f} {np.sum(np.abs(w_ols)):<15.4f} {np.sum(np.abs(w_ols) < 1e-6):<10}")
print(f"{'Ridge (lam=1.0)':<20} {mse_ridge:<15.6f} {np.linalg.norm(w_ridge):<15.4f} {np.sum(np.abs(w_ridge)):<15.4f} {np.sum(np.abs(w_ridge) < 1e-6):<10}")
print(f"{'Lasso (lam=0.5)':<20} {mse_lasso:<15.6f} {np.linalg.norm(w_lasso):<15.4f} {np.sum(np.abs(w_lasso)):<15.4f} {np.sum(np.abs(w_lasso) < 1e-6):<10}")
print()

# --- Weight comparison ---
print("Weight vectors (first 6 coefficients):")
print(f"{'Degree':<8} {'OLS':<15} {'Ridge':<15} {'Lasso':<15}")
print("-" * 53)
for k in range(min(6, degree+1)):
    print(f"x^{k:<5} {w_ols[k]:<15.6f} {w_ridge[k]:<15.6f} {w_lasso[k]:<15.6f}")

# --- Bias-Variance analysis with Ridge lambda sweep ---
print()
print("=" * 60)
print("RIDGE REGRESSION: BIAS-VARIANCE vs LAMBDA")
print("=" * 60)
lambdas = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
n_trials = 200
x_test = np.array([0.5])
X_test = np.column_stack([x_test**k for k in range(degree+1)])
f_true_test = 2.0 + 1.5*0.5 - 0.8*0.5**2 + 0.3*0.5**3

print(f"\nTest point: x = 0.5, true f(x) = {f_true_test:.4f}")
print(f"{'Lambda':<12} {'Bias^2':<15} {'Variance':<15} {'MSE':<15}")
print("-" * 57)

for lam in lambdas:
    preds = []
    for trial in range(n_trials):
        y_trial = y_true + 0.3*np.random.randn(n)
        w_trial = np.linalg.solve(XtX + lam*np.eye(degree+1), X.T @ y_trial)
        preds.append(float(X_test @ w_trial))
    preds = np.array(preds)
    bias2 = (np.mean(preds) - f_true_test)**2
    var = np.var(preds)
    mse = bias2 + var
    print(f"{lam:<12.3f} {bias2:<15.6f} {var:<15.6f} {mse:<15.6f}")

print()
print("Key insight: As lambda increases, bias^2 grows but variance shrinks.")
print("The optimal lambda minimizes total MSE (bias^2 + variance).")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

8. Regularization Paths

Ridge Path

As $\lambda$ increases from 0 to $\infty$, the Ridge solution traces a continuous path from $\mathbf{w}_{\text{OLS}}$ to $\mathbf{0}$. Using the SVD:

$$w_j(\lambda) = \frac{\sigma_j^2}{\sigma_j^2 + \lambda}w_j^{\text{OLS}}$$

Each coefficient is continuously shrunk toward zero, with coefficients in low-variance directions shrinking fastest.

Lasso Path

The Lasso path is piecewise linear: coefficients enter or leave the active set at critical values of $\lambda$. The LARS (Least Angle Regression) algorithm computes the entire path in $O(d^3 + d^2 n)$ time by solving for the exact $\lambda$ values at which the active set changes.

At $\lambda_{\max} = \|\mathbf{X}^T\mathbf{y}\|_\infty$, all coefficients are zero. As $\lambda$decreases, features enter one at a time, in order of their correlation with the residual:

$$\lambda_{\max} = \max_j |\mathbf{x}_j^T\mathbf{y}|$$

The first feature to enter is the one most correlated with the target.

Degrees of Freedom

The effective degrees of freedom of Ridge regression is:

$$\text{df}(\lambda) = \text{tr}(\mathbf{H}(\lambda)) = \sum_{j=1}^{d}\frac{\sigma_j^2}{\sigma_j^2 + \lambda}$$

This ranges from $d$ (at $\lambda = 0$) to $0$ (as $\lambda \to \infty$), providing a continuous measure of model complexity. For Lasso, the degrees of freedom equals the number of nonzero coefficients (approximately).

9. Selecting $\lambda$: Cross-Validation

In practice, we select the regularization parameter $\lambda$ using k-fold cross-validation:

Split data into $k$ equal folds
For each fold, train on $k-1$ folds and evaluate on the held-out fold
Average the $k$ validation errors
Choose $\lambda$ that minimizes this average error

Leave-One-Out Cross-Validation (LOOCV)

For Ridge regression, LOOCV has an efficient closed-form:

$$\text{CV}(\lambda) = \frac{1}{n}\sum_{i=1}^{n}\left(\frac{y_i - \hat{y}_i(\lambda)}{1 - H_{ii}(\lambda)}\right)^2$$

where $H_{ii}(\lambda)$ is the $i$-th diagonal element of the hat matrix$\mathbf{H}(\lambda) = \mathbf{X}(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T$. This allows evaluating all $n$ leave-one-out models without refitting.

10. Multi-Collinearity and Condition Number

The Condition Number

The condition number of $\mathbf{X}^T\mathbf{X}$ measures the sensitivity of the OLS solution to perturbations in the data:

$$\kappa(\mathbf{X}^T\mathbf{X}) = \frac{\sigma_{\max}^2}{\sigma_{\min}^2}$$

where $\sigma_{\max}$ and $\sigma_{\min}$ are the largest and smallest singular values of $\mathbf{X}$. A large condition number means near-singularity.

$\kappa \approx 1$: Well-conditioned; stable solution
$\kappa \sim 10^4$: Some digits of accuracy lost
$\kappa \sim 10^{16}$: Essentially singular; solution meaningless

Ridge as Conditioning Fix

Ridge regression directly improves the condition number:

$$\kappa(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I}) = \frac{\sigma_{\max}^2 + \lambda}{\sigma_{\min}^2 + \lambda} \leq \kappa(\mathbf{X}^T\mathbf{X})$$

Even a small $\lambda$ can dramatically improve conditioning when $\sigma_{\min}$ is tiny. This is the numerical analysis perspective on why regularization helps — it stabilizes the linear system.

Variance Inflation Factor

The VIF for feature $j$ measures how much the variance of $w_j$ is inflated by multicollinearity:

$$\text{VIF}_j = \frac{1}{1 - R_j^2}$$

where $R_j^2$ is the $R^2$ from regressing feature $j$ on all other features. VIF > 10 typically indicates problematic collinearity. If feature $j$ is nearly a linear combination of other features, $R_j^2 \approx 1$ and VIF diverges.

11. Applications in Science

Spectroscopy

In chemometrics, spectra have thousands of wavelengths but few samples. Ridge/Lasso regression is essential for predicting chemical concentrations from highly correlated spectral features.

Genomics

Genome-wide association studies (GWAS) have millions of SNPs but thousands of individuals. Lasso and Elastic Net identify which genetic variants are associated with disease.

Climate Science

Multicollinear climate variables (temperature, humidity, pressure at many locations) require regularized regression for stable predictions.

Materials Science

Predicting material properties from composition vectors. Lasso identifies which elements/descriptors matter most for a target property.

Summary

Method	Penalty	Solution	Sparsity
OLS	None	$(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$	No
Ridge	$\lambda\\|\mathbf{w}\\|_2^2$	$(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$	No (shrinkage only)
Lasso	$\lambda\\|\mathbf{w}\\|_1$	Coordinate descent	Yes (exact zeros)
Elastic Net	$\lambda_1\\|\mathbf{w}\\|_1 + \lambda_2\\|\mathbf{w}\\|_2^2$	Coordinate descent	Yes (grouped)

Share:X Reddit LinkedIn