Part I Β· Chapter 1

Shannon Entropy

The quantitative measure of uncertainty β€” the single number that determines how compressible a source is and how much information each message carries.

1. Shannon's 1948 Paper: Background and Motivation

Before 1948, the word β€œinformation” had no precise mathematical meaning. Engineers designed communication systems by intuition and experience. Shannon's paper changed this permanently. Working at Bell Labs on telegraphy and telephony, he asked: is there a way to quantify how much β€œinformation” a message contains, independently of its meaning?

His starting point was probability. A message source produces symbols drawn from some probability distribution. The more surprising a symbol β€” the lower its probability β€” the more information it carries. The letter β€œE” in an English text is unsurprising and tells us little; the letter β€œZ” is rare and informative.

Shannon's insight was that any reasonable notion of β€œinformation” should be a function only of probabilities, additive for independent events, and continuous. These three axioms uniquely determine a formula.

2. Information Content of an Event

The self-information (or surprisal) of an event\(x\) with probability \(p(x)\) is defined as:

\[ I(x) = -\log_2 p(x) \quad \text{[bits]} \]

The key properties follow directly from this definition:

  • Certain events carry no information. If \(p(x)=1\), then \(I(x) = -\log_2 1 = 0\). Knowing that the sun rose today tells us nothing new.
  • Rarer events carry more information. As \(p(x) \to 0\),\(I(x) \to \infty\). Winning the lottery is highly informative precisely because it is rare.
  • Independent events add. If \(x\) and \(y\) are independent, \(I(x,y) = I(x) + I(y)\), because \(-\log p(x)p(y) = -\log p(x) - \log p(y)\).

The choice of base-2 logarithm gives information in bits. A fair coin flip carries exactly \(-\log_2(1/2) = 1\) bit of information. Using natural logarithm gives nats; base-10 gives hartleys.

3. Shannon Entropy: Derivation from Axioms

The Shannon entropy of a discrete random variable \(X\) with probability mass function \(\{p_1, p_2, \ldots, p_n\}\) is the expected self-information:

\[ H(X) = \mathbb{E}[I(X)] = -\sum_{i=1}^{n} p_i \log_2 p_i \]

Shannon proved that \(H\) is the unique function (up to a positive multiplicative constant) satisfying all three of:

Axiom:
Continuity. \(H(p_1, \ldots, p_n)\) is continuous in every \(p_i\).
Axiom:
Maximum at uniformity. For fixed \(n\), \(H\) is maximized when all \(p_i = 1/n\) (maximum uncertainty for uniform distribution).
Axiom:
Additivity. If a choice is broken into successive stages, \(H\) adds consistently: \(H(pq, p(1-q), (1-p)) = H(p, 1-p) + p\,H(q, 1-q)\).

The proof proceeds by showing that additivity forces \(H\) to be a sum of logarithms, continuity fixes the form to \(-\sum p_i \log p_i\), and the maximum axiom fixes the sign. This is the sense in which the entropy formula is the only consistent way to measure uncertainty.

4. Properties of Shannon Entropy

Non-negativity

Since \(0 \leq p_i \leq 1\) implies \(\log_2 p_i \leq 0\), every term \(-p_i \log_2 p_i \geq 0\). Therefore:

\( H(X) \geq 0 \)

Equality holds if and only if \(X\) is deterministic (one \(p_i = 1\), the rest zero). We adopt the convention \(0 \log 0 = 0\).

Maximum for Uniform Distribution

By Jensen's inequality applied to the concave function \(-\log\):

\( H(X) \leq \log_2 n \)

with equality iff all \(p_i = 1/n\). The uniform distribution over \(n\) symbols has the maximum possible entropy \(\log_2 n\) bits.

Concavity

\(H\) is a concave function of the probability vector \((p_1, \ldots, p_n)\): mixing two distributions always increases (or maintains) entropy.

\( H(\lambda \mathbf{p} + (1-\lambda)\mathbf{q}) \geq \lambda H(\mathbf{p}) + (1-\lambda) H(\mathbf{q}) \)

5. Joint and Conditional Entropy

For a pair of random variables \((X, Y)\), the joint entropy is:

\[ H(X,Y) = -\sum_{x,y} p(x,y) \log_2 p(x,y) \]

The conditional entropy \(H(Y|X)\) measures the remaining uncertainty in \(Y\) once \(X\) is known:

\[ H(Y|X) = -\sum_{x,y} p(x,y) \log_2 p(y|x) = \mathbb{E}_x[H(Y|X=x)] \]

Two key inequalities follow immediately:

  • Conditioning reduces entropy: \(H(Y|X) \leq H(Y)\), with equality iff \(X\) and \(Y\) are independent.
  • Subadditivity: \(H(X,Y) \leq H(X) + H(Y)\), with equality iff \(X \perp Y\).

6. The Chain Rule

The chain rule for entropy decomposes joint entropy into a sequence of conditional entropies:

\[ H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y) \]

This extends naturally to any number of variables:

\[ H(X_1, X_2, \ldots, X_n) = \sum_{i=1}^{n} H(X_i \mid X_1, \ldots, X_{i-1}) \]

The chain rule is the foundation of data compression: to encode a sequence of symbols optimally, encode each symbol given all previous ones. This is precisely what arithmetic coding does, achieving entropy to within 2 bits of the optimal rate.

Binary Entropy Curve

pH(p) [bits]00.250.250.50.50.750.7511H(0.5) = 1 bitp=0 β†’ H=0p=1 β†’ H=0

The binary entropy function \(H_b(p)\) peaks at \(p = 0.5\) (maximum uncertainty) and equals zero at the deterministic extremes.

Python Simulation: Entropy in Action

Four panels: (1) the binary entropy curve, (2) entropy of English vs uniform and skewed distributions, (3) geometric distributions, (4) Zipf distributions.

Python
script.py139 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server