Chapter 4: Huffman & Arithmetic Codes

Source coding answers a precise question: how short can we make our messages? Shannon's entropy\( H(X) \) is the ultimate lower bound. Huffman codes achieve within one bit of it per symbol; arithmetic coding collapses the gap arbitrarily by coding long sequences as a single interval.

Prefix-Free Codes & Kraft's Inequality

A prefix-free (instantaneous) code assigns codewords such that no codeword is a prefix of another. This allows unique decodability without a separator: the decoder recognises each codeword as it arrives.

Kraft's Inequality

A binary prefix-free code with codeword lengths \( \ell_1, \ell_2, \ldots, \ell_m \)exists if and only if:

\( \displaystyle\sum_{i=1}^{m} 2^{-\ell_i} \leq 1 \)

Equality holds for complete codes (no wasted leaves in the code tree). Shannon showed that choosing \( \ell_i = \lceil -\log_2 p_i \rceil \) satisfies Kraft with equality, giving average length \( L < H(X) + 1 \).

Source Coding Theorem (Tight)

For any uniquely decodable code with average length \( L \):

\( H(X) \leq L < H(X) + 1 \)

Huffman codes achieve the lower bound exactly when probabilities are dyadic (\( p_i = 2^{-k} \)), and are always within 1 bit.

Block Coding

Coding \( n \)-symbol blocks reduces the per-symbol overhead:

\( H(X) \leq L_n/n < H(X) + 1/n \)

As \( n \to \infty \), the average length \( L_n/n \to H(X) \). Arithmetic coding achieves this without exponential-size codebooks.

Huffman Tree Example — Alphabet: a(0.4), b(0.25), c(0.2), d(0.1), e(0.05)

The Huffman Algorithm

David Huffman (1952) proved that a greedy bottom-up merging procedure produces an optimal prefix-free code. The key insight is that the two least-probable symbols should be siblings at the deepest level of the code tree.

Algorithm (Min-Heap Version)

Create a leaf node for each symbol with its probability; insert all into a min-heap (priority queue ordered by frequency).
While the heap has more than one node:
- Pop the two nodes \( x, y \) with smallest frequencies.
- Create an internal node \( z \) with \( p(z) = p(x) + p(y) \).
- Make \( x \) left child (bit 0) and \( y \) right child (bit 1).
- Push \( z \) back into the heap.
The remaining node is the root. Traverse the tree to read off codewords.

Optimality Proof Sketch

Lemma 1: In any optimal code, the two symbols with the lowest probabilities have the longest codewords and are siblings.

Lemma 2: Merging \( x \) and \( y \)into a single symbol \( z \) with \( p_z = p_x + p_y \) reduces average length by exactly \( p_z \):

\( L_{\text{original}} = L_{\text{reduced}} + p_x + p_y \)

By induction: if the merged code is optimal for the reduced alphabet, the expanded code is optimal for the original alphabet. Complexity: \( O(m \log m) \) for \( m \) symbols.

Limitations of Huffman Coding

Integer constraint: Each symbol gets an integer number of bits, so \( L \geq \lceil H \rceil \) in the worst case (up to 1 bit wasted per symbol).

Known probabilities: Requires full knowledge of the source distribution. Adaptive Huffman addresses unknown distributions online.

Block overhead: Block Huffman needs to transmit the codebook, which grows exponentially with block size.

Arithmetic Coding

Arithmetic coding (Rissanen & Langdon, 1979) sidesteps the integer codeword constraint entirely. It encodes an entire message as a single number in \( [0, 1) \), subdividing the interval according to cumulative probabilities. The encoded number requires only\( \lceil -\log_2 P(\text{message}) \rceil + 1 \) bits.

Encoding Algorithm

Maintain interval \( [l, h) \), initially \( [0, 1) \). For each symbol \( x_t \):

\( l' = l + (h-l)\,F(x_t) \)
\( h' = l + (h-l)\,F(x_t + 1) \)

where \( F(x) = \sum_{y < x} p(y) \) is the cumulative distribution. Output any binary fraction inside the final interval.

Performance Guarantee

For a message \( x_1^n \) with probability \( P(x_1^n) \), the code length satisfies:

\( -\log_2 P(x_1^n) \leq L(x_1^n) \leq -\log_2 P(x_1^n) + 2 \)

Per symbol: \( H(X) \leq L_n/n < H(X) + 2/n \to H(X) \). Arithmetic coding approaches entropy with only a constant 2-bit overhead for any block length.

Comparison: Huffman vs. Arithmetic

Property	Huffman	Arithmetic
Granularity	Per symbol	Per message
Overhead	Up to 1 bit/symbol	2 bits/message
Adaptivity	Adaptive variant exists	Naturally adaptive
Complexity	\( O(m \log m) \) build	\( O(n) \) encode/decode
Precision	Not needed	Requires careful finite arithmetic
Real-world use	DEFLATE, JPEG, MP3	LZMA, HEVC, FLAC

Python: Huffman Coding in Practice

Build a Huffman tree for a sample text string, print the complete codebook, compute average length versus entropy, and visualise symbol frequencies, code lengths, and an arithmetic coding interval diagram.

Python

script.py148 lines

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import heapq
from collections import Counter
import math

# ── Huffman Tree ──────────────────────────────────────────────
class HuffmanNode:
    def __init__(self, char, freq):
        self.char = char
        self.freq = freq
        self.left = None
        self.right = None
    def __lt__(self, other):
        return self.freq < other.freq

def build_huffman_tree(text):
    freq = Counter(text)
    heap = [HuffmanNode(ch, f) for ch, f in freq.items()]
    heapq.heapify(heap)
    while len(heap) > 1:
        left = heapq.heappop(heap)
        right = heapq.heappop(heap)
        merged = HuffmanNode(None, left.freq + right.freq)
        merged.left = left
        merged.right = right
        heapq.heappush(heap, merged)
    return heap[0], freq

def get_codes(node, prefix='', codebook={}):
    if node is None:
        return
    if node.char is not None:
        codebook[node.char] = prefix if prefix else '0'
    else:
        get_codes(node.left,  prefix + '0', codebook)
        get_codes(node.right, prefix + '1', codebook)
    return codebook

# Sample text
text = "information theory is beautiful and powerful"
n = len(text)
freq = Counter(text)
total = sum(freq.values())

root, _ = build_huffman_tree(text)
codebook = get_codes(root, '', {})

# Metrics
entropy = -sum((c/total)*math.log2(c/total) for c in freq.values())
avg_len = sum((freq[ch]/total)*len(code) for ch, code in codebook.items())

print("=" * 60)
print("HUFFMAN CODING ANALYSIS")
print("=" * 60)
print(f"Text: '{text}'")
print(f"Alphabet size: {len(freq)} symbols")
print(f"Text length:   {n} characters")
print()
print(f"Shannon Entropy:     {entropy:.4f} bits/symbol")
print(f"Avg Huffman length:  {avg_len:.4f} bits/symbol")
print(f"Efficiency:          {entropy/avg_len*100:.2f}%")
print(f"Compression ratio:   {8/avg_len:.2f}x  (vs 8-bit ASCII)")
print()
print(f"{'Symbol':<8} {'Freq':>6} {'Prob':>7} {'Code':<20} {'Len':>4}")
print("-" * 50)
for ch, code in sorted(codebook.items(), key=lambda x: -freq[x[0]]):
    sym = repr(ch)
    print(f"{sym:<8} {freq[ch]:>6} {freq[ch]/total:>7.4f}  {code:<20} {len(code):>4}")

# ── Figure ────────────────────────────────────────────────────
fig, axes = plt.subplots(1, 3, figsize=(16, 6))
fig.patch.set_facecolor('#0a0a0f')
for ax in axes:
    ax.set_facecolor('#0d1117')
    for spine in ax.spines.values():
        spine.set_edgecolor('#334155')

# ── Plot 1: Symbol frequencies ──
symbols = sorted(freq.keys(), key=lambda x: -freq[x])
freqs   = [freq[s] for s in symbols]
sym_labels = [repr(s) for s in symbols]
colors = plt.cm.plasma(np.linspace(0.2, 0.9, len(symbols)))
axes[0].bar(range(len(symbols)), freqs, color=colors, edgecolor='#1e1e2e')
axes[0].set_xticks(range(len(symbols)))
axes[0].set_xticklabels(sym_labels, rotation=45, ha='right', fontsize=8, color='#94a3b8')
axes[0].set_title('Symbol Frequencies', color='#818cf8', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Count', color='#94a3b8')
axes[0].tick_params(colors='#94a3b8')

# ── Plot 2: Code lengths vs -log2(p) ──
probs    = [freq[s]/total for s in symbols]
opt_lens = [-math.log2(p) for p in probs]
huf_lens = [len(codebook[s]) for s in symbols]
x_pos = range(len(symbols))
w = 0.35
axes[1].bar([x - w/2 for x in x_pos], opt_lens, w, label='Optimal −log₂p', color='#6366f1', alpha=0.85)
axes[1].bar([x + w/2 for x in x_pos], huf_lens, w, label='Huffman length', color='#22d3ee', alpha=0.85)
axes[1].axhline(entropy, color='#f472b6', linestyle='--', linewidth=1.5, label=f'Entropy H={entropy:.2f}')
axes[1].axhline(avg_len, color='#4ade80', linestyle=':', linewidth=1.5, label=f'Avg L={avg_len:.2f}')
axes[1].set_xticks(list(x_pos))
axes[1].set_xticklabels(sym_labels, rotation=45, ha='right', fontsize=8, color='#94a3b8')
axes[1].set_title('Code Lengths vs Optimal', color='#818cf8', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Bits', color='#94a3b8')
axes[1].legend(fontsize=8, facecolor='#1e2030', edgecolor='#334155', labelcolor='#94a3b8')
axes[1].tick_params(colors='#94a3b8')

# ── Plot 3: Arithmetic coding interval for first few symbols ──
# Demonstrate interval subdivision for first 5 chars
demo = list(freq.keys())[:6]
demo_probs = [freq[s]/total for s in demo]
# normalise to sum=1
s_sum = sum(demo_probs)
demo_probs = [p/s_sum for p in demo_probs]

cumulative = np.cumsum([0] + demo_probs)
bar_colors = plt.cm.cool(np.linspace(0.1, 0.9, len(demo)))
for i, (s, p, c) in enumerate(zip(demo, demo_probs, bar_colors)):
    axes[2].barh(0, p, left=cumulative[i], height=0.4, color=c, edgecolor='#0a0a0f', linewidth=0.5)
    if p > 0.04:
        axes[2].text(cumulative[i] + p/2, 0, repr(s), ha='center', va='center',
                     fontsize=9, color='black', fontweight='bold')

# Show one subdivision
first_sym = demo[0]
lo, hi = cumulative[0], cumulative[1]
for i, (s, p) in enumerate(zip(demo, demo_probs)):
    sub_lo = lo + (hi - lo) * cumulative[i]
    sub_hi = lo + (hi - lo) * cumulative[i+1]
    axes[2].barh(-0.6, sub_hi - sub_lo, left=sub_lo, height=0.4,
                  color=bar_colors[i], edgecolor='#0a0a0f', linewidth=0.5, alpha=0.9)

axes[2].set_xlim(0, 1)
axes[2].set_ylim(-1.0, 0.6)
axes[2].set_yticks([0, -0.6])
axes[2].set_yticklabels(['Initial', 'After 1st'], color='#94a3b8', fontsize=9)
axes[2].set_xlabel('Interval [0, 1)', color='#94a3b8')
axes[2].set_title('Arithmetic Coding Intervals', color='#818cf8', fontsize=12, fontweight='bold')
axes[2].tick_params(colors='#94a3b8')

plt.suptitle('Huffman & Arithmetic Coding', color='#c7d2fe', fontsize=14, fontweight='bold', y=1.01)
plt.tight_layout()
plt.savefig('output.png', dpi=150, bbox_inches='tight', facecolor='#0a0a0f')
plt.close()
print()
print("Plot saved to output.png")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

← Ch 3: Channel Capacity Ch 5: Error-Correcting Codes →

Share:X Reddit LinkedIn