Minimum Description Length
Rissanenβs MDL principle: the best model is the one that compresses data most. A rigorous information-theoretic foundation for Occamβs razor and model selection.
11.1 The MDL Principle
Given data \(D\) and a class of models \(\mathcal{M}\), the Minimum Description Length principle selects the model \(M^*\) that minimizes the total description length:
\[ M^* = \arg\min_{M\in\mathcal{M}}\bigl[L(M) + L(D \mid M)\bigr] \]
- β£\(L(M)\): length of the model description (complexity, number of parameters)
- β£\(L(D\mid M)\): length of the data given the model (goodness-of-fit, residuals)
Simple models have short \(L(M)\) but poor fit (large \(L(D\mid M)\)). Complex models fit well but have large \(L(M)\). MDL finds the optimum β exactly Occamβs razor, quantified in bits.
11.2 Two-Part Codes
The simplest form of MDL uses a two-part code:
- Encode the model: \(L(M)\) bits. For a parametric model with \(k\) parameters, Rissanen showed\(L(M) \approx \frac{k}{2}\log n\) bits (the BIC cost).
- Encode the data given the model: \(L(D\mid M)\) bits. For a Gaussian noise model with variance \(\hat\sigma^2\):\(L(D\mid M) = \frac{n}{2}\log(2\pi e\hat\sigma^2)\).
For polynomial regression of degree \(d\) with \(k = d+1\) parameters:
\[ \text{MDL}(d) = \underbrace{\frac{d+1}{2}\log_2 n}_{L(M)} + \underbrace{\frac{n}{2}\log_2(2\pi e\,\hat\sigma^2)}_{L(D\mid M)} \]
11.3 Connections: BIC, AIC, and Bayesian Inference
MDL (two-part)
\[ \frac{k}{2}\log n - \log\hat{L} \]
Information-theoretic, consistent, equivalent to BIC asymptotically
BIC (Schwarz)
\[ k\log n - 2\log\hat{L} \]
Approximates the log marginal likelihood; consistent
AIC (Akaike)
\[ 2k - 2\log\hat{L} \]
Minimizes expected prediction error; not consistent
MDL and Bayesian inference are deeply equivalent. If \(p(M)\) is a prior and \(p(D\mid M)\) a likelihood, then\(-\log p(M) - \log p(D\mid M)\) is the description length under the Shannon code for prior \(p(M)\). MDL is Bayesian with the Jeffreys/reference prior.
11.4 Normalized Maximum Likelihood (NML)
Rissanenβs refined MDL uses the Normalized Maximum Likelihood (NML) distribution, which provides a single-part code that is minimax optimal:
\[ p_{\text{NML}}(D\mid\mathcal{M}) = \frac{p(D\mid\hat\theta(D), \mathcal{M})}{\sum_{D'} p(D'\mid\hat\theta(D'), \mathcal{M})} \]
The denominator β the parametric complexity of model class \(\mathcal{M}\)β measures how many distinct data patterns the model class can fit. It equals\(e^{(k/2)\log(n/2\pi e) + O(1)}\) for regular parametric models, recovering the BIC/MDL penalty.
Python: MDL Model Selection
Generates 40 noisy points from a true degree-2 polynomial. Fits polynomials of degrees 1β10. Computes two-part MDL, BIC, and AIC for each. Four plots: the data with selected fits, MDL decomposition into \(L(M)\) and \(L(D|M)\), MDL vs BIC vs AIC comparison, and the bias-variance tradeoff via RSS and total MDL.
Click Run to execute the Python code
Code will be executed with Python 3 on the server