Source Coding Theorem
Entropy is not just an abstract measure of uncertainty β it is the precise minimum rate at which a source can be compressed. Shannon's first theorem makes this exact.
1. Lossless Compression: The Fundamental Question
Lossless compression asks: given a source producing symbols \(X_1, X_2, \ldots\)drawn i.i.d. from a distribution \(p(x)\), what is the shortest binary description we can produce such that the original can be recovered exactly?
The naΓ―ve approach is a fixed-length code: assign every symbol the same number of bits. An alphabet of size \(n\) needs\(\lceil \log_2 n \rceil\) bits per symbol. This ignores the probability structure entirely and is wasteful whenever the distribution is non-uniform.
A variable-length code assigns shorter codewords to more probable symbols. Morse code is an ancient example: βEβ (the most common letter) is a single dot, while βZβ is dash-dash-dot-dot. Shannon asked: how short can we make the average codeword?
2. The Kraft Inequality
For a code to be uniquely decodable without needing delimiters, it must be prefix-free: no codeword is a prefix of another. Prefix-free codes can be decoded instantaneously symbol-by-symbol. Shannon (and independently Kraft) proved:
\[ \sum_{i=1}^{n} 2^{-\ell_i} \leq 1 \]
where \(\ell_i\) is the length of the codeword for symbol \(i\).
The Kraft inequality is both necessary and sufficient: a set of codeword lengths\(\{\ell_i\}\) corresponds to a valid prefix-free code if and only if the Kraft sum is at most 1.
Geometrically, each codeword of length \(\ell\) βuses upβ a fraction \(2^{-\ell}\) of the unit interval on the binary tree, and the prefix-free constraint prevents these intervals from overlapping.
3. Shannon's Source Coding Theorem
Theorem (Shannon, 1948)
Let \(X\) be a discrete random variable with entropy \(H(X)\). For any uniquely decodable code with average length \(\bar{\ell} = \sum_i p_i \ell_i\):
\[ H(X) \leq \bar{\ell} \]
Furthermore, there exists a prefix-free code achieving:
\[ H(X) \leq \bar{\ell} < H(X) + 1 \]
The lower bound follows from the Kraft inequality via Lagrange optimization: minimizing \(\bar{\ell} = \sum p_i \ell_i\) subject to \(\sum 2^{-\ell_i} \leq 1\)yields the ideal lengths \(\ell_i^* = -\log_2 p_i\), giving\(\bar{\ell}^* = H(X)\).
The problem is that \(-\log_2 p_i\) is generally not an integer. Rounding up to \(\ell_i = \lceil -\log_2 p_i \rceil\) gives integer codeword lengths that satisfy Kraft (since \(\sum 2^{-\lceil -\log p_i \rceil} \leq \sum p_i = 1\)) and achieve average length strictly less than \(H(X) + 1\).
For block coding over \(n\) symbols at a time, the bound tightens to \(H(X) \leq \bar{\ell}/n < H(X) + 1/n\), converging to entropy as\(n \to \infty\).
4. Typical Sequences and the AEP
Shannon's deeper proof of the source coding theorem uses the Asymptotic Equipartition Property (AEP), the information-theoretic law of large numbers.
AEP
\[ -\frac{1}{n} \log_2 p(X_1, \ldots, X_n) \xrightarrow{P} H(X) \]
For large \(n\), most sequences have probability close to \(2^{-nH(X)}\). The set of such typical sequences has probability approaching 1 and contains roughly \(2^{nH(X)}\) elements.
This gives an elegant compression scheme:
- Enumerate the \(\approx 2^{nH(X)}\) typical sequences.
- Assign each a binary index of length \(\lceil nH(X) \rceil + 1\) bits.
- Encode typical sequences with this index; atypical sequences with a flag + raw bits.
- As \(n \to \infty\), atypical sequences occur with vanishing probability, so the average rate approaches \(H(X)\).
No compression scheme can do better: any code with rate below \(H(X)\) must cause reconstruction errors on a positive fraction of sequences.
5. Preview: Rate-Distortion Theory
The source coding theorem applies to lossless compression. What if we are willing to accept some distortion (as in JPEG images or MP3 audio)?
The rate-distortion function \(R(D)\)gives the minimum rate needed to describe the source with average distortion at most \(D\):
\[ R(D) = \min_{p(\hat{x}|x) : \mathbb{E}[d(X,\hat{X})] \leq D} I(X; \hat{X}) \]
At \(D = 0\) (lossless), \(R(0) = H(X)\), recovering the source coding theorem. For Gaussian sources with squared-error distortion, \(R(D) = \max(0,\, \tfrac{1}{2}\log(\sigma^2/D))\), the famous βwater-fillingβ result for multi-dimensional sources. We will study rate-distortion theory in depth in Part III.
Fixed-Length vs Variable-Length Coding
For this dyadic distribution, Huffman coding achieves the entropy bound exactly because all \(-\log_2 p_i\) are integers.
Python Simulation: Source Coding in Action
Four panels: (1) average code length converging to entropy with block size, (2) Kraft inequality for various code designs, (3) Huffman vs ideal lengths, (4) compression efficiency across code types.
Click Run to execute the Python code
Code will be executed with Python 3 on the server