Part I · Chapter 1

Shannon Entropy

The quantitative measure of uncertainty — the single number that determines how compressible a source is and how much information each message carries.

1. Shannon's 1948 Paper: Background and Motivation

Before 1948, the word “information” had no precise mathematical meaning. Engineers designed communication systems by intuition and experience. Shannon's paper changed this permanently. Working at Bell Labs on telegraphy and telephony, he asked: is there a way to quantify how much “information” a message contains, independently of its meaning?

His starting point was probability. A message source produces symbols drawn from some probability distribution. The more surprising a symbol — the lower its probability — the more information it carries. The letter “E” in an English text is unsurprising and tells us little; the letter “Z” is rare and informative.

Shannon's insight was that any reasonable notion of “information” should be a function only of probabilities, additive for independent events, and continuous. These three axioms uniquely determine a formula.

2. Information Content of an Event

The self-information (or surprisal) of an event\(x\) with probability \(p(x)\) is defined as:

\[ I(x) = -\log_2 p(x) \quad \text{[bits]} \]

The key properties follow directly from this definition:

Certain events carry no information. If \(p(x)=1\), then \(I(x) = -\log_2 1 = 0\). Knowing that the sun rose today tells us nothing new.
Rarer events carry more information. As \(p(x) \to 0\),\(I(x) \to \infty\). Winning the lottery is highly informative precisely because it is rare.
Independent events add. If \(x\) and \(y\) are independent, \(I(x,y) = I(x) + I(y)\), because \(-\log p(x)p(y) = -\log p(x) - \log p(y)\).

The choice of base-2 logarithm gives information in bits. A fair coin flip carries exactly \(-\log_2(1/2) = 1\) bit of information. Using natural logarithm gives nats; base-10 gives hartleys.

3. Shannon Entropy: Derivation from Axioms

The Shannon entropy of a discrete random variable \(X\) with probability mass function \(\{p_1, p_2, \ldots, p_n\}\) is the expected self-information:

\[ H(X) = \mathbb{E}[I(X)] = -\sum_{i=1}^{n} p_i \log_2 p_i \]

Shannon proved that \(H\) is the unique function (up to a positive multiplicative constant) satisfying all three of:

Axiom:

Continuity. \(H(p_1, \ldots, p_n)\) is continuous in every \(p_i\).

Axiom:

Maximum at uniformity. For fixed \(n\), \(H\) is maximized when all \(p_i = 1/n\) (maximum uncertainty for uniform distribution).

Axiom:

Additivity. If a choice is broken into successive stages, \(H\) adds consistently: \(H(pq, p(1-q), (1-p)) = H(p, 1-p) + p\,H(q, 1-q)\).

The proof proceeds by showing that additivity forces \(H\) to be a sum of logarithms, continuity fixes the form to \(-\sum p_i \log p_i\), and the maximum axiom fixes the sign. This is the sense in which the entropy formula is the only consistent way to measure uncertainty.

4. Properties of Shannon Entropy

Non-negativity

Since \(0 \leq p_i \leq 1\) implies \(\log_2 p_i \leq 0\), every term \(-p_i \log_2 p_i \geq 0\). Therefore:

\( H(X) \geq 0 \)

Equality holds if and only if \(X\) is deterministic (one \(p_i = 1\), the rest zero). We adopt the convention \(0 \log 0 = 0\).

Maximum for Uniform Distribution

By Jensen's inequality applied to the concave function \(-\log\):

\( H(X) \leq \log_2 n \)

with equality iff all \(p_i = 1/n\). The uniform distribution over \(n\) symbols has the maximum possible entropy \(\log_2 n\) bits.

Concavity

\(H\) is a concave function of the probability vector \((p_1, \ldots, p_n)\): mixing two distributions always increases (or maintains) entropy.

\( H(\lambda \mathbf{p} + (1-\lambda)\mathbf{q}) \geq \lambda H(\mathbf{p}) + (1-\lambda) H(\mathbf{q}) \)

5. Joint and Conditional Entropy

For a pair of random variables \((X, Y)\), the joint entropy is:

\[ H(X,Y) = -\sum_{x,y} p(x,y) \log_2 p(x,y) \]

The conditional entropy \(H(Y|X)\) measures the remaining uncertainty in \(Y\) once \(X\) is known:

\[ H(Y|X) = -\sum_{x,y} p(x,y) \log_2 p(y|x) = \mathbb{E}_x[H(Y|X=x)] \]

Two key inequalities follow immediately:

Conditioning reduces entropy: \(H(Y|X) \leq H(Y)\), with equality iff \(X\) and \(Y\) are independent.
Subadditivity: \(H(X,Y) \leq H(X) + H(Y)\), with equality iff \(X \perp Y\).

6. The Chain Rule

The chain rule for entropy decomposes joint entropy into a sequence of conditional entropies:

\[ H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y) \]

This extends naturally to any number of variables:

\[ H(X_1, X_2, \ldots, X_n) = \sum_{i=1}^{n} H(X_i \mid X_1, \ldots, X_{i-1}) \]

The chain rule is the foundation of data compression: to encode a sequence of symbols optimally, encode each symbol given all previous ones. This is precisely what arithmetic coding does, achieving entropy to within 2 bits of the optimal rate.

Binary Entropy Curve

The binary entropy function \(H_b(p)\) peaks at \(p = 0.5\) (maximum uncertainty) and equals zero at the deterministic extremes.

Python Simulation: Entropy in Action

Four panels: (1) the binary entropy curve, (2) entropy of English vs uniform and skewed distributions, (3) geometric distributions, (4) Zipf distributions.

Python

script.py138 lines

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from collections import Counter

# ── Figure setup ──────────────────────────────────────────────────────────────
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.patch.set_facecolor('#0a0a1a')

# ── Panel 1: Binary entropy function H(p) ─────────────────────────────────────
ax1 = axes[0, 0]
p = np.linspace(1e-9, 1 - 1e-9, 2000)
Hb = -p * np.log2(p) - (1 - p) * np.log2(1 - p)

ax1.plot(p, Hb, color='#818cf8', linewidth=2.5, label=r'H(p)')
ax1.axvline(x=0.5, color='#f59e0b', linestyle='--', alpha=0.7, linewidth=1.5,
            label='Max at p=0.5: H=1 bit')
ax1.fill_between(p, 0, Hb, alpha=0.15, color='#6366f1')

ax1.set_xlabel('p', fontsize=12, color='white')
ax1.set_ylabel('H(p)  [bits]', fontsize=12, color='white')
ax1.set_title('Binary Entropy Function H(p) = -p log₂p - (1-p) log₂(1-p)',
              fontsize=11, color='white', fontweight='bold')
ax1.legend(fontsize=10, facecolor='#1a1a2e', edgecolor='#818cf8', labelcolor='white')
ax1.set_facecolor('#0a0a1a')
ax1.tick_params(colors='white')
ax1.grid(True, alpha=0.2, color='#818cf8')
ax1.set_xlim(0, 1); ax1.set_ylim(0, 1.05)
for spine in ax1.spines.values():
    spine.set_color('#818cf8')

# ── Panel 2: Entropy of English text vs random text ───────────────────────────
ax2 = axes[0, 1]

# Sample English text (gettysburg-style frequency profile)
english_freq = {
    'e': 12.7, 't': 9.1, 'a': 8.2, 'o': 7.5, 'i': 7.0, 'n': 6.7, 's': 6.3,
    'h': 6.1, 'r': 6.0, 'd': 4.3, 'l': 4.0, 'c': 2.8, 'u': 2.8, 'm': 2.4,
    'w': 2.4, 'f': 2.2, 'g': 2.0, 'y': 2.0, 'p': 1.9, 'b': 1.5, 'v': 1.0,
    'k': 0.77, 'j': 0.15, 'x': 0.15, 'q': 0.10, 'z': 0.07
}
letters = list(english_freq.keys())
freq_vals = np.array(list(english_freq.values()))
eng_probs = freq_vals / freq_vals.sum()
H_english = -np.sum(eng_probs * np.log2(eng_probs))

# Uniform 26-letter distribution (maximum entropy)
uniform_26 = np.ones(26) / 26
H_uniform26 = np.log2(26)

# 4-symbol distributions for comparison
uniform_4 = np.ones(4) / 4
H_uniform4 = np.log2(4)
skewed_4 = np.array([0.7, 0.15, 0.10, 0.05])
H_skewed4 = -np.sum(skewed_4 * np.log2(skewed_4))

labels = ['English\n26-letter', 'Uniform\n26-letter', 'Uniform\n4-symbol', 'Skewed\n4-symbol']
H_vals = [H_english, H_uniform26, H_uniform4, H_skewed4]
bar_colors = ['#818cf8', '#22d3ee', '#34d399', '#f59e0b']

bars = ax2.bar(labels, H_vals, color=bar_colors, alpha=0.85, edgecolor='white',
               linewidth=0.5, width=0.55)
for bar, hv in zip(bars, H_vals):
    ax2.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.05,
             f'{hv:.2f} bits', ha='center', va='bottom', fontsize=10,
             color='white', fontweight='bold')

ax2.set_ylabel('Entropy H  [bits/symbol]', fontsize=12, color='white')
ax2.set_title('Entropy: English Text vs Various Distributions',
              fontsize=11, color='white', fontweight='bold')
ax2.set_facecolor('#0a0a1a')
ax2.tick_params(colors='white')
ax2.grid(True, alpha=0.2, color='#818cf8', axis='y')
for spine in ax2.spines.values():
    spine.set_color('#818cf8')
print(f"H(English 26-letter): {H_english:.4f} bits  (max possible = {H_uniform26:.4f} bits)")
print(f"Redundancy of English: {1 - H_english / H_uniform26:.1%}")

# ── Panel 3: Geometric distribution entropy ────────────────────────────────────
ax3 = axes[1, 0]
max_k = 30
k = np.arange(1, max_k + 1)

for q_val, col, lbl in [(0.3, '#818cf8', 'q=0.3'), (0.5, '#22d3ee', 'q=0.5'), (0.7, '#f59e0b', 'q=0.7')]:
    geo_probs = (1 - q_val) ** (k - 1) * q_val
    geo_probs /= geo_probs.sum()          # renormalise to finite support
    H_geo = -np.sum(geo_probs * np.log2(geo_probs + 1e-15))
    H_theory = (-( (1-q_val)*np.log2(1-q_val) + q_val*np.log2(q_val) ) / q_val) if q_val < 1 else 0
    ax3.bar(k, geo_probs, alpha=0.6, color=col, width=0.8,
            label=f'{lbl}  H={H_geo:.2f} bits')

ax3.set_xlabel('k', fontsize=12, color='white')
ax3.set_ylabel('P(X = k)', fontsize=12, color='white')
ax3.set_title('Geometric Distribution — Higher Spread → More Entropy',
              fontsize=11, color='white', fontweight='bold')
ax3.legend(fontsize=10, facecolor='#1a1a2e', edgecolor='#818cf8', labelcolor='white')
ax3.set_facecolor('#0a0a1a')
ax3.tick_params(colors='white')
ax3.grid(True, alpha=0.2, color='#818cf8')
for spine in ax3.spines.values():
    spine.set_color('#818cf8')

# ── Panel 4: Zipf distribution entropy ────────────────────────────────────────
ax4 = axes[1, 1]
n_words = np.arange(1, 51)

for s_val, col, lbl in [(1.0, '#818cf8', 's=1.0'), (1.5, '#22d3ee', 's=1.5'), (2.0, '#f59e0b', 's=2.0')]:
    zipf_probs = n_words.astype(float) ** (-s_val)
    zipf_probs /= zipf_probs.sum()
    H_zipf = -np.sum(zipf_probs * np.log2(zipf_probs + 1e-15))
    ax4.plot(n_words, zipf_probs, color=col, linewidth=2.0, marker='o',
             markersize=3, label=f'Zipf {lbl}  H={H_zipf:.2f} bits')

uniform_50 = np.ones(50) / 50
H_u50 = np.log2(50)
ax4.axhline(y=1/50, color='#34d399', linestyle='--', alpha=0.6,
            label=f'Uniform(50)  H={H_u50:.2f} bits')

ax4.set_xlabel('Rank', fontsize=12, color='white')
ax4.set_ylabel('P(word)', fontsize=12, color='white')
ax4.set_title('Zipf Distribution — Entropy vs. Exponent s',
              fontsize=11, color='white', fontweight='bold')
ax4.legend(fontsize=9, facecolor='#1a1a2e', edgecolor='#818cf8', labelcolor='white')
ax4.set_facecolor('#0a0a1a')
ax4.tick_params(colors='white')
ax4.grid(True, alpha=0.2, color='#818cf8')
for spine in ax4.spines.values():
    spine.set_color('#818cf8')

plt.tight_layout(pad=2.0)
plt.savefig('output.png', dpi=150, bbox_inches='tight', facecolor='#0a0a1a')
print("\n--- Summary ---")
print("Binary entropy: peaks at p=0.5 (maximum uncertainty), zero at p=0 or p=1")
print("Uniform distributions always achieve maximum entropy for fixed alphabet size")
print(f"Zipf(s=1.0) top-50 entropy: {-np.sum((np.arange(1,51)**(-1.0)/np.sum(np.arange(1,51)**(-1.0)))*np.log2(np.arange(1,51)**(-1.0)/np.sum(np.arange(1,51)**(-1.0))+1e-15)):.2f} bits")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

← Course Overview Next: Source Coding →

Share:X Reddit LinkedIn