Part I · Chapter 2

Source Coding Theorem

Entropy is not just an abstract measure of uncertainty — it is the precise minimum rate at which a source can be compressed. Shannon's first theorem makes this exact.

1. Lossless Compression: The Fundamental Question

Lossless compression asks: given a source producing symbols \(X_1, X_2, \ldots\)drawn i.i.d. from a distribution \(p(x)\), what is the shortest binary description we can produce such that the original can be recovered exactly?

The naïve approach is a fixed-length code: assign every symbol the same number of bits. An alphabet of size \(n\) needs\(\lceil \log_2 n \rceil\) bits per symbol. This ignores the probability structure entirely and is wasteful whenever the distribution is non-uniform.

A variable-length code assigns shorter codewords to more probable symbols. Morse code is an ancient example: “E” (the most common letter) is a single dot, while “Z” is dash-dash-dot-dot. Shannon asked: how short can we make the average codeword?

2. The Kraft Inequality

For a code to be uniquely decodable without needing delimiters, it must be prefix-free: no codeword is a prefix of another. Prefix-free codes can be decoded instantaneously symbol-by-symbol. Shannon (and independently Kraft) proved:

\[ \sum_{i=1}^{n} 2^{-\ell_i} \leq 1 \]

where \(\ell_i\) is the length of the codeword for symbol \(i\).

The Kraft inequality is both necessary and sufficient: a set of codeword lengths\(\{\ell_i\}\) corresponds to a valid prefix-free code if and only if the Kraft sum is at most 1.

Geometrically, each codeword of length \(\ell\) “uses up” a fraction \(2^{-\ell}\) of the unit interval on the binary tree, and the prefix-free constraint prevents these intervals from overlapping.

3. Shannon's Source Coding Theorem

Theorem (Shannon, 1948)

Let \(X\) be a discrete random variable with entropy \(H(X)\). For any uniquely decodable code with average length \(\bar{\ell} = \sum_i p_i \ell_i\):

\[ H(X) \leq \bar{\ell} \]

Furthermore, there exists a prefix-free code achieving:

\[ H(X) \leq \bar{\ell} < H(X) + 1 \]

The lower bound follows from the Kraft inequality via Lagrange optimization: minimizing \(\bar{\ell} = \sum p_i \ell_i\) subject to \(\sum 2^{-\ell_i} \leq 1\)yields the ideal lengths \(\ell_i^* = -\log_2 p_i\), giving\(\bar{\ell}^* = H(X)\).

The problem is that \(-\log_2 p_i\) is generally not an integer. Rounding up to \(\ell_i = \lceil -\log_2 p_i \rceil\) gives integer codeword lengths that satisfy Kraft (since \(\sum 2^{-\lceil -\log p_i \rceil} \leq \sum p_i = 1\)) and achieve average length strictly less than \(H(X) + 1\).

For block coding over \(n\) symbols at a time, the bound tightens to \(H(X) \leq \bar{\ell}/n < H(X) + 1/n\), converging to entropy as\(n \to \infty\).

4. Typical Sequences and the AEP

Shannon's deeper proof of the source coding theorem uses the Asymptotic Equipartition Property (AEP), the information-theoretic law of large numbers.

AEP

\[ -\frac{1}{n} \log_2 p(X_1, \ldots, X_n) \xrightarrow{P} H(X) \]

For large \(n\), most sequences have probability close to \(2^{-nH(X)}\). The set of such typical sequences has probability approaching 1 and contains roughly \(2^{nH(X)}\) elements.

This gives an elegant compression scheme:

Enumerate the \(\approx 2^{nH(X)}\) typical sequences.
Assign each a binary index of length \(\lceil nH(X) \rceil + 1\) bits.
Encode typical sequences with this index; atypical sequences with a flag + raw bits.
As \(n \to \infty\), atypical sequences occur with vanishing probability, so the average rate approaches \(H(X)\).

No compression scheme can do better: any code with rate below \(H(X)\) must cause reconstruction errors on a positive fraction of sequences.

5. Preview: Rate-Distortion Theory

The source coding theorem applies to lossless compression. What if we are willing to accept some distortion (as in JPEG images or MP3 audio)?

The rate-distortion function \(R(D)\)gives the minimum rate needed to describe the source with average distortion at most \(D\):

\[ R(D) = \min_{p(\hat{x}|x) : \mathbb{E}[d(X,\hat{X})] \leq D} I(X; \hat{X}) \]

At \(D = 0\) (lossless), \(R(0) = H(X)\), recovering the source coding theorem. For Gaussian sources with squared-error distortion, \(R(D) = \max(0,\, \tfrac{1}{2}\log(\sigma^2/D))\), the famous “water-filling” result for multi-dimensional sources. We will study rate-distortion theory in depth in Part III.

Fixed-Length vs Variable-Length Coding

For this dyadic distribution, Huffman coding achieves the entropy bound exactly because all \(-\log_2 p_i\) are integers.

Python Simulation: Source Coding in Action

Four panels: (1) average code length converging to entropy with block size, (2) Kraft inequality for various code designs, (3) Huffman vs ideal lengths, (4) compression efficiency across code types.

Python

script.py204 lines

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from collections import Counter
import heapq

# ── Shannon-Fano-Elias / optimal code builder (Huffman) ───────────────────────
class HuffmanNode:
    def __init__(self, symbol, prob):
        self.symbol = symbol
        self.prob = prob
        self.left = None
        self.right = None
    def __lt__(self, other):
        return self.prob < other.prob

def build_huffman(probs_dict):
    heap = [HuffmanNode(s, p) for s, p in probs_dict.items()]
    heapq.heapify(heap)
    while len(heap) > 1:
        lo = heapq.heappop(heap)
        hi = heapq.heappop(heap)
        merged = HuffmanNode(None, lo.prob + hi.prob)
        merged.left = lo; merged.right = hi
        heapq.heappush(heap, merged)
    codes = {}
    def traverse(node, code=''):
        if node.symbol is not None:
            codes[node.symbol] = code if code else '0'
        else:
            traverse(node.left, code + '0')
            traverse(node.right, code + '1')
    traverse(heap[0])
    return codes

def entropy(probs):
    probs = np.array(probs)
    probs = probs[probs > 0]
    return -np.sum(probs * np.log2(probs))

# ── Figure setup ──────────────────────────────────────────────────────────────
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.patch.set_facecolor('#0a0a1a')

# ── Panel 1: Average code length vs block size converging to H ─────────────────
ax1 = axes[0, 0]

# Source: biased coin with p(0)=0.7, p(1)=0.3
p0, p1 = 0.7, 0.3
H_source = entropy([p0, p1])
block_sizes = list(range(1, 9))
avg_lengths = []
lower_bounds = []
upper_bounds = []

rng = np.random.default_rng(42)
for n in block_sizes:
    # Enumerate all blocks of length n and their probabilities
    symbols = list(range(2**n))
    block_probs = {}
    for sym in symbols:
        bits = format(sym, f'0{n}b')
        log_p = sum(np.log(p0) if b == '0' else np.log(p1) for b in bits)
        block_probs[sym] = np.exp(log_p)
    # Build Huffman code for this block source
    codes = build_huffman(block_probs)
    avg_len_block = sum(block_probs[s] * len(codes[s]) for s in symbols)
    avg_lengths.append(avg_len_block / n)   # per-symbol
    lower_bounds.append(H_source)
    upper_bounds.append(H_source + 1.0 / n)

ax1.plot(block_sizes, avg_lengths, color='#818cf8', linewidth=2.5, marker='o',
         markersize=6, label='Huffman avg length / n')
ax1.axhline(y=H_source, color='#f59e0b', linestyle='--', linewidth=2,
            label=f'Entropy H(X) = {H_source:.4f} bits/sym')
ax1.fill_between(block_sizes, lower_bounds, upper_bounds, alpha=0.2, color='#34d399',
                 label='Guaranteed bound [H, H + 1/n]')
ax1.set_xlabel('Block size n', fontsize=12, color='white')
ax1.set_ylabel('Average bits per source symbol', fontsize=12, color='white')
ax1.set_title('Average Code Length → Entropy as Block Size Grows\n(p₀=0.7, p₁=0.3)',
              fontsize=11, color='white', fontweight='bold')
ax1.legend(fontsize=9, facecolor='#1a1a2e', edgecolor='#818cf8', labelcolor='white')
ax1.set_facecolor('#0a0a1a')
ax1.tick_params(colors='white')
ax1.grid(True, alpha=0.2, color='#818cf8')
for spine in ax1.spines.values():
    spine.set_color('#818cf8')
print(f"Source entropy H(X) = {H_source:.6f} bits/symbol (p₀=0.7, p₁=0.3)")
print(f"Block n=1 avg length: {avg_lengths[0]:.4f}  (overhead = {avg_lengths[0]-H_source:.4f})")
print(f"Block n=8 avg length: {avg_lengths[-1]:.4f}  (overhead = {avg_lengths[-1]-H_source:.4f})")

# ── Panel 2: Kraft inequality visualization ────────────────────────────────────
ax2 = axes[0, 1]

# Show how Kraft sum behaves for various code designs
code_designs = [
    ('Prefix-free optimal', [1, 2, 3, 3]),
    ('Prefix-free (wasteful)', [2, 2, 2, 2]),
    ('Non-prefix (invalid!)', [1, 1, 2, 2]),
    ('Over-complete', [3, 3, 3, 3]),
]
kraft_sums = [sum(2**(-l) for l in lengths) for _, lengths in code_designs]
bar_colors_kraft = ['#34d399', '#818cf8', '#f87171', '#f59e0b']

bars2 = ax2.bar(range(len(code_designs)), kraft_sums, color=bar_colors_kraft,
                alpha=0.85, edgecolor='white', linewidth=0.5, width=0.5)
ax2.axhline(y=1.0, color='white', linestyle='--', linewidth=1.5, alpha=0.7,
            label='Kraft = 1 (complete prefix-free)')
ax2.axhline(y=1.0, color='#f87171', linestyle=':', linewidth=1, alpha=0.4)
for bar, kv in zip(bars2, kraft_sums):
    ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,
             f'{kv:.3f}', ha='center', va='bottom', fontsize=11,
             color='white', fontweight='bold')
ax2.set_xticks(range(len(code_designs)))
ax2.set_xticklabels([d[0] for d in code_designs], rotation=12, ha='right',
                     fontsize=9, color='white')
ax2.set_ylabel('Kraft Sum  Σ 2⁻ˡᵢ', fontsize=12, color='white')
ax2.set_title('Kraft Inequality: Σ 2⁻ˡᵢ ≤ 1 for Valid Prefix-Free Codes',
              fontsize=11, color='white', fontweight='bold')
ax2.legend(fontsize=10, facecolor='#1a1a2e', edgecolor='#818cf8', labelcolor='white')
ax2.set_facecolor('#0a0a1a')
ax2.tick_params(colors='white')
ax2.grid(True, alpha=0.2, color='#818cf8', axis='y')
for spine in ax2.spines.values():
    spine.set_color('#818cf8')
print("\nKraft sums:")
for (name, lengths), kv in zip(code_designs, kraft_sums):
    status = "VALID" if kv <= 1.0 else "INVALID"
    print(f"  {name:<30} lengths={lengths}  Kraft={kv:.3f}  [{status}]")

# ── Panel 3: Per-symbol code lengths vs -log₂(p) ──────────────────────────────
ax3 = axes[1, 0]

# Six-symbol source
symbols6 = ['A', 'B', 'C', 'D', 'E', 'F']
probs6 = np.array([0.35, 0.25, 0.15, 0.12, 0.08, 0.05])
probs6 /= probs6.sum()
codes6 = build_huffman({s: p for s, p in zip(symbols6, probs6)})
huffman_lengths = np.array([len(codes6[s]) for s in symbols6])
shannon_lengths = -np.log2(probs6)  # ideal (non-integer) lengths

x6 = np.arange(len(symbols6))
w = 0.35
ax3.bar(x6 - w/2, huffman_lengths, width=w, color='#818cf8', alpha=0.85,
        edgecolor='white', linewidth=0.5, label='Huffman code length')
ax3.bar(x6 + w/2, shannon_lengths, width=w, color='#f59e0b', alpha=0.85,
        edgecolor='white', linewidth=0.5, label='-log₂(p) ideal length')
ax3.set_xticks(x6)
ax3.set_xticklabels([f'{s}\np={p:.2f}' for s, p in zip(symbols6, probs6)],
                     color='white', fontsize=9)
ax3.set_ylabel('Code length (bits)', fontsize=12, color='white')
ax3.set_title('Huffman Lengths ≈ −log₂(p) (Ideal Shannon Lengths)',
              fontsize=11, color='white', fontweight='bold')
ax3.legend(fontsize=10, facecolor='#1a1a2e', edgecolor='#818cf8', labelcolor='white')
ax3.set_facecolor('#0a0a1a')
ax3.tick_params(colors='white')
ax3.grid(True, alpha=0.2, color='#818cf8', axis='y')
for spine in ax3.spines.values():
    spine.set_color('#818cf8')
H6 = entropy(probs6)
avg6 = sum(probs6[i] * huffman_lengths[i] for i in range(6))
print(f"\n6-symbol source entropy H = {H6:.4f} bits")
print(f"Huffman average code length = {avg6:.4f} bits  (overhead = {avg6-H6:.4f})")
print(f"Huffman codes: {codes6}")

# ── Panel 4: Redundancy gap closing with block coding ─────────────────────────
ax4 = axes[1, 1]

# Compare fixed-length vs Huffman for various source entropies
ent_vals = np.linspace(0.5, 4.0, 30)
fixed_lengths = np.ceil(ent_vals)        # fixed-length: ceil(H) bits needed
huffman_n1 = ent_vals + 0.5             # Huffman 1-at-a-time: H < L < H+1
huffman_n4 = ent_vals + 0.125           # Huffman block-4: H < L < H+1/4

ax4.plot(ent_vals, fixed_lengths, color='#f87171', linewidth=2.0, linestyle='--',
         label='Fixed-length code ⌈H⌉')
ax4.plot(ent_vals, huffman_n1, color='#f59e0b', linewidth=2.0, linestyle='-.',
         label='Huffman (n=1):  H ≤ L < H+1')
ax4.plot(ent_vals, huffman_n4, color='#818cf8', linewidth=2.0, linestyle='-.',
         label='Huffman (n=4):  H ≤ L < H+¼')
ax4.plot(ent_vals, ent_vals, color='#34d399', linewidth=2.5,
         label='Entropy H (optimal limit)')
ax4.fill_between(ent_vals, ent_vals, huffman_n1, alpha=0.15, color='#818cf8')

ax4.set_xlabel('Source entropy H(X)  [bits/symbol]', fontsize=12, color='white')
ax4.set_ylabel('Average code length  [bits/symbol]', fontsize=12, color='white')
ax4.set_title('Compression Efficiency: Fixed vs Variable Length Codes',
              fontsize=11, color='white', fontweight='bold')
ax4.legend(fontsize=9, facecolor='#1a1a2e', edgecolor='#818cf8', labelcolor='white')
ax4.set_facecolor('#0a0a1a')
ax4.tick_params(colors='white')
ax4.grid(True, alpha=0.2, color='#818cf8')
for spine in ax4.spines.values():
    spine.set_color('#818cf8')

plt.tight_layout(pad=2.0)
plt.savefig('output.png', dpi=150, bbox_inches='tight', facecolor='#0a0a1a')
print("\n--- Source Coding Theorem Summary ---")
print("Theorem: For any prefix-free code, H(X) ≤ E[L] < H(X) + 1/n")
print("The bound is achievable: Huffman coding achieves E[L] < H(X) + 1")
print("Block coding drives average length arbitrarily close to H(X)")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

← Shannon Entropy Next: Channel Capacity →

Share:X Reddit LinkedIn