Part IV › Chapter 11

Minimum Description Length

Rissanen’s MDL principle: the best model is the one that compresses data most. A rigorous information-theoretic foundation for Occam’s razor and model selection.

11.1 The MDL Principle

Given data \(D\) and a class of models \(\mathcal{M}\), the Minimum Description Length principle selects the model \(M^*\) that minimizes the total description length:

\[ M^* = \arg\min_{M\in\mathcal{M}}\bigl[L(M) + L(D \mid M)\bigr] \]

‣\(L(M)\): length of the model description (complexity, number of parameters)
‣\(L(D\mid M)\): length of the data given the model (goodness-of-fit, residuals)

Simple models have short \(L(M)\) but poor fit (large \(L(D\mid M)\)). Complex models fit well but have large \(L(M)\). MDL finds the optimum — exactly Occam’s razor, quantified in bits.

11.2 Two-Part Codes

The simplest form of MDL uses a two-part code:

Encode the model: \(L(M)\) bits. For a parametric model with \(k\) parameters, Rissanen showed\(L(M) \approx \frac{k}{2}\log n\) bits (the BIC cost).
Encode the data given the model: \(L(D\mid M)\) bits. For a Gaussian noise model with variance \(\hat\sigma^2\):\(L(D\mid M) = \frac{n}{2}\log(2\pi e\hat\sigma^2)\).

For polynomial regression of degree \(d\) with \(k = d+1\) parameters:

\[ \text{MDL}(d) = \underbrace{\frac{d+1}{2}\log_2 n}_{L(M)} + \underbrace{\frac{n}{2}\log_2(2\pi e\,\hat\sigma^2)}_{L(D\mid M)} \]

11.3 Connections: BIC, AIC, and Bayesian Inference

MDL (two-part)

\[ \frac{k}{2}\log n - \log\hat{L} \]

Information-theoretic, consistent, equivalent to BIC asymptotically

BIC (Schwarz)

\[ k\log n - 2\log\hat{L} \]

Approximates the log marginal likelihood; consistent

AIC (Akaike)

\[ 2k - 2\log\hat{L} \]

Minimizes expected prediction error; not consistent

MDL and Bayesian inference are deeply equivalent. If \(p(M)\) is a prior and \(p(D\mid M)\) a likelihood, then\(-\log p(M) - \log p(D\mid M)\) is the description length under the Shannon code for prior \(p(M)\). MDL is Bayesian with the Jeffreys/reference prior.

11.4 Normalized Maximum Likelihood (NML)

Rissanen’s refined MDL uses the Normalized Maximum Likelihood (NML) distribution, which provides a single-part code that is minimax optimal:

\[ p_{\text{NML}}(D\mid\mathcal{M}) = \frac{p(D\mid\hat\theta(D), \mathcal{M})}{\sum_{D'} p(D'\mid\hat\theta(D'), \mathcal{M})} \]

The denominator — the parametric complexity of model class \(\mathcal{M}\)— measures how many distinct data patterns the model class can fit. It equals\(e^{(k/2)\log(n/2\pi e) + O(1)}\) for regular parametric models, recovering the BIC/MDL penalty.

Python: MDL Model Selection

Generates 40 noisy points from a true degree-2 polynomial. Fits polynomials of degrees 1–10. Computes two-part MDL, BIC, and AIC for each. Four plots: the data with selected fits, MDL decomposition into \(L(M)\) and \(L(D|M)\), MDL vs BIC vs AIC comparison, and the bias-variance tradeoff via RSS and total MDL.

Python

script.py171 lines

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from numpy.polynomial import polynomial as P

np.random.seed(42)
plt.style.use('dark_background')

# ── Generate noisy data from a true degree-2 polynomial ────────────────────
n_data = 40
x_data = np.linspace(-3, 3, n_data)
true_degree = 2
true_coeffs = [1.5, -0.5, 0.8]  # 1.5 + (-0.5)*x + 0.8*x^2
y_true = np.polyval(true_coeffs[::-1] + [0]*(3 - len(true_coeffs)), x_data)
# Use np.polyval with reversed coefficients
y_true = 1.5 + (-0.5)*x_data + 0.8*x_data**2
noise_sigma = 1.2
y_data = y_true + np.random.normal(0, noise_sigma, n_data)

# ── Fit polynomials of degrees 1..10 ──────────────────────────────────────
max_degree = 10
degrees = list(range(1, max_degree + 1))

rss_list    = []   # residual sum of squares
bic_list    = []
aic_list    = []
mdl_list    = []   # two-part MDL
sigma2_list = []

for d in degrees:
    # Design matrix
    X_mat = np.vstack([x_data**k for k in range(d+1)]).T  # (n, d+1)
    # Least squares
    coeffs, res, rank, sv = np.linalg.lstsq(X_mat, y_data, rcond=None)
    y_pred = X_mat @ coeffs
    rss = np.sum((y_data - y_pred)**2)
    rss_list.append(rss)

n = n_data
    k = d + 1   # number of parameters (including intercept)
    sigma2_hat = rss / (n - k) if n > k else rss / 1
    sigma2_list.append(sigma2_hat)

# AIC = n*log(rss/n) + 2k
    aic = n * np.log(rss/n + 1e-300) + 2*k
    aic_list.append(aic)

# BIC = n*log(rss/n) + k*log(n)
    bic = n * np.log(rss/n + 1e-300) + k * np.log(n)
    bic_list.append(bic)

# Two-part MDL (Rissanen):
    # L(model) ≈ k * 0.5 * log2(n)  bits for parameter description
    # L(data|model) ≈ 0.5*n*log2(2*pi*e*sigma2_hat)  bits for residuals
    L_model = k * 0.5 * np.log2(n)
    L_data  = 0.5 * n * np.log2(2 * np.pi * np.e * max(sigma2_hat, 1e-10))
    mdl_list.append(L_model + L_data)

rss_list    = np.array(rss_list)
bic_list    = np.array(bic_list)
aic_list    = np.array(aic_list)
mdl_list    = np.array(mdl_list)

best_aic = degrees[np.argmin(aic_list)]
best_bic = degrees[np.argmin(bic_list)]
best_mdl = degrees[np.argmin(mdl_list)]

print(f"True degree: {true_degree}")
print(f"Best degree by AIC: {best_aic}")
print(f"Best degree by BIC: {best_bic}")
print(f"Best degree by MDL: {best_mdl}")
print(f"\nMDL scores by degree:")
for d, m in zip(degrees, mdl_list):
    marker = " ← best" if d == best_mdl else ""
    print(f"  degree {d:2d}: MDL = {m:.2f} bits{marker}")

# ── Plot ───────────────────────────────────────────────────────────────────
fig, axes = plt.subplots(2, 2, figsize=(13, 10))
fig.patch.set_facecolor('#0a0a0f')
for ax in axes.flat:
    ax.set_facecolor('#0d0d1a')

# 1. Data + fits
ax = axes[0, 0]
x_plot = np.linspace(-3.2, 3.2, 300)
ax.scatter(x_data, y_data, color='#9ca3af', s=25, zorder=5, alpha=0.85, label='Data')
ax.plot(x_plot, 1.5 + (-0.5)*x_plot + 0.8*x_plot**2,
        color='#34d399', lw=2, ls='--', label=f'True (degree {true_degree})', zorder=6)

colors_deg = {2: '#818cf8', 1: '#fbbf24', 5: '#f87171', 10: '#a78bfa'}
for d, col in colors_deg.items():
    X_mat = np.vstack([x_data**k for k in range(d+1)]).T
    coeffs, _, _, _ = np.linalg.lstsq(X_mat, y_data, rcond=None)
    X_plot_mat = np.vstack([x_plot**k for k in range(d+1)]).T
    y_plot = X_plot_mat @ coeffs
    lw = 2.5 if d == best_mdl else 1.5
    ax.plot(x_plot, y_plot, color=col, lw=lw,
            label=f'degree {d}' + (' (MDL best)' if d == best_mdl else ''))

ax.set_ylim(-6, 14)
ax.set_xlabel('x', color='#9ca3af')
ax.set_ylabel('y', color='#9ca3af')
ax.set_title('Polynomial Fits (degrees 1,2,5,10)', color='#a5b4fc', fontweight='bold')
ax.legend(fontsize=8, facecolor='#0d0d1a', edgecolor='#374151', labelcolor='white')
ax.tick_params(colors='#6b7280')
for spine in ax.spines.values(): spine.set_edgecolor('#1e293b')

# 2. MDL two-part decomposition
ax = axes[0, 1]
L_model_arr = np.array([(d+1)*0.5*np.log2(n_data) for d in degrees])
L_data_arr  = mdl_list - L_model_arr
ax.stackplot(degrees, L_model_arr, L_data_arr,
             labels=['L(model): parameter description', 'L(data|model): residuals'],
             colors=['#6366f1', '#818cf8'], alpha=[0.85, 0.6])
ax.axvline(best_mdl, color='#34d399', lw=2, ls='--', label=f'MDL minimum at degree {best_mdl}')
ax.axvline(true_degree, color='#fbbf24', lw=1.5, ls=':', label=f'True degree {true_degree}')
ax.set_xlabel('Polynomial degree', color='#9ca3af')
ax.set_ylabel('Description length (bits)', color='#9ca3af')
ax.set_title('Two-Part MDL Decomposition', color='#a5b4fc', fontweight='bold')
ax.legend(fontsize=8, facecolor='#0d0d1a', edgecolor='#374151', labelcolor='white')
ax.set_xticks(degrees)
ax.tick_params(colors='#6b7280')
for spine in ax.spines.values(): spine.set_edgecolor('#1e293b')

# 3. AIC, BIC, MDL comparison (normalized)
ax = axes[1, 0]
def normalize(arr):
    return (arr - arr.min()) / (arr.max() - arr.min() + 1e-10)

ax.plot(degrees, normalize(mdl_list),  color='#818cf8', lw=2.5, marker='o', ms=5, label='MDL (normalized)')
ax.plot(degrees, normalize(bic_list),  color='#34d399', lw=2,   marker='s', ms=5, label='BIC (normalized)')
ax.plot(degrees, normalize(aic_list),  color='#f87171', lw=2,   marker='^', ms=5, label='AIC (normalized)')
ax.axvline(best_mdl, color='#818cf8', lw=1.5, ls='--', alpha=0.7)
ax.axvline(best_bic, color='#34d399', lw=1.5, ls='--', alpha=0.7)
ax.axvline(best_aic, color='#f87171', lw=1.5, ls='--', alpha=0.7)
ax.axvline(true_degree, color='#fbbf24', lw=2, ls=':', label=f'True degree {true_degree}')
ax.set_xlabel('Polynomial degree', color='#9ca3af')
ax.set_ylabel('Score (normalized)', color='#9ca3af')
ax.set_title('MDL vs BIC vs AIC Model Selection', color='#a5b4fc', fontweight='bold')
ax.legend(fontsize=9, facecolor='#0d0d1a', edgecolor='#374151', labelcolor='white')
ax.set_xticks(degrees)
ax.tick_params(colors='#6b7280')
for spine in ax.spines.values(): spine.set_edgecolor('#1e293b')

# 4. RSS and model complexity tradeoff
ax = axes[1, 1]
ax2 = ax.twinx()
line1, = ax.plot(degrees, rss_list, color='#f87171', lw=2.5, marker='o', ms=5, label='RSS (fit quality)')
line2, = ax2.plot(degrees, mdl_list, color='#818cf8', lw=2.5, marker='s', ms=5, label='MDL total')
ax.axvline(best_mdl, color='#34d399', lw=2, ls='--', label=f'MDL optimum d={best_mdl}')
ax2.axvline(true_degree, color='#fbbf24', lw=1.5, ls=':', label=f'True d={true_degree}')
ax.set_xlabel('Polynomial degree', color='#9ca3af')
ax.set_ylabel('RSS', color='#f87171')
ax2.set_ylabel('MDL (bits)', color='#818cf8')
ax.set_title('Bias-Variance Tradeoff via MDL', color='#a5b4fc', fontweight='bold')
lines = [line1, line2]
labels = [l.get_label() for l in lines]
ax.legend(lines, labels, fontsize=8, facecolor='#0d0d1a', edgecolor='#374151', labelcolor='white')
ax.set_xticks(degrees)
ax.tick_params(colors='#6b7280')
ax2.tick_params(colors='#6b7280')
for spine in ax.spines.values(): spine.set_edgecolor('#1e293b')

fig.suptitle('Chapter 11: Minimum Description Length & Model Selection', color='#a5b4fc',
             fontsize=13, fontweight='bold', y=1.01)
plt.tight_layout()
plt.savefig('output.png', dpi=130, bbox_inches='tight',
            facecolor='#0a0a0f', edgecolor='none')
print("\nSaved output.png")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

← Kolmogorov Complexity Next: Gödel, Turing & Information →

Share:X Reddit LinkedIn