Chapter 4: Huffman & Arithmetic Codes
Source coding answers a precise question: how short can we make our messages? Shannon's entropy\( H(X) \) is the ultimate lower bound. Huffman codes achieve within one bit of it per symbol; arithmetic coding collapses the gap arbitrarily by coding long sequences as a single interval.
Prefix-Free Codes & Kraft's Inequality
A prefix-free (instantaneous) code assigns codewords such that no codeword is a prefix of another. This allows unique decodability without a separator: the decoder recognises each codeword as it arrives.
Kraft's Inequality
A binary prefix-free code with codeword lengths \( \ell_1, \ell_2, \ldots, \ell_m \)exists if and only if:
\( \displaystyle\sum_{i=1}^{m} 2^{-\ell_i} \leq 1 \)
Equality holds for complete codes (no wasted leaves in the code tree). Shannon showed that choosing \( \ell_i = \lceil -\log_2 p_i \rceil \) satisfies Kraft with equality, giving average length \( L < H(X) + 1 \).
Source Coding Theorem (Tight)
For any uniquely decodable code with average length \( L \):
\( H(X) \leq L < H(X) + 1 \)
Huffman codes achieve the lower bound exactly when probabilities are dyadic (\( p_i = 2^{-k} \)), and are always within 1 bit.
Block Coding
Coding \( n \)-symbol blocks reduces the per-symbol overhead:
\( H(X) \leq L_n/n < H(X) + 1/n \)
As \( n \to \infty \), the average length \( L_n/n \to H(X) \). Arithmetic coding achieves this without exponential-size codebooks.
Huffman Tree Example — Alphabet: a(0.4), b(0.25), c(0.2), d(0.1), e(0.05)
The Huffman Algorithm
David Huffman (1952) proved that a greedy bottom-up merging procedure produces an optimal prefix-free code. The key insight is that the two least-probable symbols should be siblings at the deepest level of the code tree.
Algorithm (Min-Heap Version)
- Create a leaf node for each symbol with its probability; insert all into a min-heap (priority queue ordered by frequency).
- While the heap has more than one node:
- Pop the two nodes \( x, y \) with smallest frequencies.
- Create an internal node \( z \) with \( p(z) = p(x) + p(y) \).
- Make \( x \) left child (bit 0) and \( y \) right child (bit 1).
- Push \( z \) back into the heap.
- The remaining node is the root. Traverse the tree to read off codewords.
Optimality Proof Sketch
Lemma 1: In any optimal code, the two symbols with the lowest probabilities have the longest codewords and are siblings.
Lemma 2: Merging \( x \) and \( y \)into a single symbol \( z \) with \( p_z = p_x + p_y \) reduces average length by exactly \( p_z \):
\( L_{\text{original}} = L_{\text{reduced}} + p_x + p_y \)
By induction: if the merged code is optimal for the reduced alphabet, the expanded code is optimal for the original alphabet. Complexity: \( O(m \log m) \) for \( m \) symbols.
Limitations of Huffman Coding
Arithmetic Coding
Arithmetic coding (Rissanen & Langdon, 1979) sidesteps the integer codeword constraint entirely. It encodes an entire message as a single number in \( [0, 1) \), subdividing the interval according to cumulative probabilities. The encoded number requires only\( \lceil -\log_2 P(\text{message}) \rceil + 1 \) bits.
Encoding Algorithm
Maintain interval \( [l, h) \), initially \( [0, 1) \). For each symbol \( x_t \):
\( l' = l + (h-l)\,F(x_t) \)
\( h' = l + (h-l)\,F(x_t + 1) \)
where \( F(x) = \sum_{y < x} p(y) \) is the cumulative distribution. Output any binary fraction inside the final interval.
Performance Guarantee
For a message \( x_1^n \) with probability \( P(x_1^n) \), the code length satisfies:
\( -\log_2 P(x_1^n) \leq L(x_1^n) \leq -\log_2 P(x_1^n) + 2 \)
Per symbol: \( H(X) \leq L_n/n < H(X) + 2/n \to H(X) \). Arithmetic coding approaches entropy with only a constant 2-bit overhead for any block length.
Comparison: Huffman vs. Arithmetic
| Property | Huffman | Arithmetic |
|---|---|---|
| Granularity | Per symbol | Per message |
| Overhead | Up to 1 bit/symbol | 2 bits/message |
| Adaptivity | Adaptive variant exists | Naturally adaptive |
| Complexity | \( O(m \log m) \) build | \( O(n) \) encode/decode |
| Precision | Not needed | Requires careful finite arithmetic |
| Real-world use | DEFLATE, JPEG, MP3 | LZMA, HEVC, FLAC |
Python: Huffman Coding in Practice
Build a Huffman tree for a sample text string, print the complete codebook, compute average length versus entropy, and visualise symbol frequencies, code lengths, and an arithmetic coding interval diagram.
Click Run to execute the Python code
Code will be executed with Python 3 on the server