Chapter 4: Huffman & Arithmetic Codes

Source coding answers a precise question: how short can we make our messages? Shannon's entropy\( H(X) \) is the ultimate lower bound. Huffman codes achieve within one bit of it per symbol; arithmetic coding collapses the gap arbitrarily by coding long sequences as a single interval.

Prefix-Free Codes & Kraft's Inequality

A prefix-free (instantaneous) code assigns codewords such that no codeword is a prefix of another. This allows unique decodability without a separator: the decoder recognises each codeword as it arrives.

Kraft's Inequality

A binary prefix-free code with codeword lengths \( \ell_1, \ell_2, \ldots, \ell_m \)exists if and only if:

\( \displaystyle\sum_{i=1}^{m} 2^{-\ell_i} \leq 1 \)

Equality holds for complete codes (no wasted leaves in the code tree). Shannon showed that choosing \( \ell_i = \lceil -\log_2 p_i \rceil \) satisfies Kraft with equality, giving average length \( L < H(X) + 1 \).

Source Coding Theorem (Tight)

For any uniquely decodable code with average length \( L \):

\( H(X) \leq L < H(X) + 1 \)

Huffman codes achieve the lower bound exactly when probabilities are dyadic (\( p_i = 2^{-k} \)), and are always within 1 bit.

Block Coding

Coding \( n \)-symbol blocks reduces the per-symbol overhead:

\( H(X) \leq L_n/n < H(X) + 1/n \)

As \( n \to \infty \), the average length \( L_n/n \to H(X) \). Arithmetic coding achieves this without exponential-size codebooks.

Huffman Tree Example — Alphabet: a(0.4), b(0.25), c(0.2), d(0.1), e(0.05)

1.000.60a=0.400.35b=0.250.15c=0.20010101d0.10e0.05Codewords:a → 1b → 01c → 001d → 0000e → 00001 (if needed)H = 2.04 bitsL = 2.10 bitseff. = 97.2%

The Huffman Algorithm

David Huffman (1952) proved that a greedy bottom-up merging procedure produces an optimal prefix-free code. The key insight is that the two least-probable symbols should be siblings at the deepest level of the code tree.

Algorithm (Min-Heap Version)

  1. Create a leaf node for each symbol with its probability; insert all into a min-heap (priority queue ordered by frequency).
  2. While the heap has more than one node:
    • Pop the two nodes \( x, y \) with smallest frequencies.
    • Create an internal node \( z \) with \( p(z) = p(x) + p(y) \).
    • Make \( x \) left child (bit 0) and \( y \) right child (bit 1).
    • Push \( z \) back into the heap.
  3. The remaining node is the root. Traverse the tree to read off codewords.

Optimality Proof Sketch

Lemma 1: In any optimal code, the two symbols with the lowest probabilities have the longest codewords and are siblings.

Lemma 2: Merging \( x \) and \( y \)into a single symbol \( z \) with \( p_z = p_x + p_y \) reduces average length by exactly \( p_z \):

\( L_{\text{original}} = L_{\text{reduced}} + p_x + p_y \)

By induction: if the merged code is optimal for the reduced alphabet, the expanded code is optimal for the original alphabet. Complexity: \( O(m \log m) \) for \( m \) symbols.

Limitations of Huffman Coding

Integer constraint: Each symbol gets an integer number of bits, so \( L \geq \lceil H \rceil \) in the worst case (up to 1 bit wasted per symbol).
Known probabilities: Requires full knowledge of the source distribution. Adaptive Huffman addresses unknown distributions online.
Block overhead: Block Huffman needs to transmit the codebook, which grows exponentially with block size.

Arithmetic Coding

Arithmetic coding (Rissanen & Langdon, 1979) sidesteps the integer codeword constraint entirely. It encodes an entire message as a single number in \( [0, 1) \), subdividing the interval according to cumulative probabilities. The encoded number requires only\( \lceil -\log_2 P(\text{message}) \rceil + 1 \) bits.

Encoding Algorithm

Maintain interval \( [l, h) \), initially \( [0, 1) \). For each symbol \( x_t \):

\( l' = l + (h-l)\,F(x_t) \)
\( h' = l + (h-l)\,F(x_t + 1) \)

where \( F(x) = \sum_{y < x} p(y) \) is the cumulative distribution. Output any binary fraction inside the final interval.

Performance Guarantee

For a message \( x_1^n \) with probability \( P(x_1^n) \), the code length satisfies:

\( -\log_2 P(x_1^n) \leq L(x_1^n) \leq -\log_2 P(x_1^n) + 2 \)

Per symbol: \( H(X) \leq L_n/n < H(X) + 2/n \to H(X) \). Arithmetic coding approaches entropy with only a constant 2-bit overhead for any block length.

Comparison: Huffman vs. Arithmetic

PropertyHuffmanArithmetic
GranularityPer symbolPer message
OverheadUp to 1 bit/symbol2 bits/message
AdaptivityAdaptive variant existsNaturally adaptive
Complexity\( O(m \log m) \) build\( O(n) \) encode/decode
PrecisionNot neededRequires careful finite arithmetic
Real-world useDEFLATE, JPEG, MP3LZMA, HEVC, FLAC

Python: Huffman Coding in Practice

Build a Huffman tree for a sample text string, print the complete codebook, compute average length versus entropy, and visualise symbol frequencies, code lengths, and an arithmetic coding interval diagram.

Python
script.py148 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server