Shannon Entropy
The quantitative measure of uncertainty β the single number that determines how compressible a source is and how much information each message carries.
1. Shannon's 1948 Paper: Background and Motivation
Before 1948, the word βinformationβ had no precise mathematical meaning. Engineers designed communication systems by intuition and experience. Shannon's paper changed this permanently. Working at Bell Labs on telegraphy and telephony, he asked: is there a way to quantify how much βinformationβ a message contains, independently of its meaning?
His starting point was probability. A message source produces symbols drawn from some probability distribution. The more surprising a symbol β the lower its probability β the more information it carries. The letter βEβ in an English text is unsurprising and tells us little; the letter βZβ is rare and informative.
Shannon's insight was that any reasonable notion of βinformationβ should be a function only of probabilities, additive for independent events, and continuous. These three axioms uniquely determine a formula.
2. Information Content of an Event
The self-information (or surprisal) of an event\(x\) with probability \(p(x)\) is defined as:
\[ I(x) = -\log_2 p(x) \quad \text{[bits]} \]
The key properties follow directly from this definition:
- Certain events carry no information. If \(p(x)=1\), then \(I(x) = -\log_2 1 = 0\). Knowing that the sun rose today tells us nothing new.
- Rarer events carry more information. As \(p(x) \to 0\),\(I(x) \to \infty\). Winning the lottery is highly informative precisely because it is rare.
- Independent events add. If \(x\) and \(y\) are independent, \(I(x,y) = I(x) + I(y)\), because \(-\log p(x)p(y) = -\log p(x) - \log p(y)\).
The choice of base-2 logarithm gives information in bits. A fair coin flip carries exactly \(-\log_2(1/2) = 1\) bit of information. Using natural logarithm gives nats; base-10 gives hartleys.
3. Shannon Entropy: Derivation from Axioms
The Shannon entropy of a discrete random variable \(X\) with probability mass function \(\{p_1, p_2, \ldots, p_n\}\) is the expected self-information:
\[ H(X) = \mathbb{E}[I(X)] = -\sum_{i=1}^{n} p_i \log_2 p_i \]
Shannon proved that \(H\) is the unique function (up to a positive multiplicative constant) satisfying all three of:
The proof proceeds by showing that additivity forces \(H\) to be a sum of logarithms, continuity fixes the form to \(-\sum p_i \log p_i\), and the maximum axiom fixes the sign. This is the sense in which the entropy formula is the only consistent way to measure uncertainty.
4. Properties of Shannon Entropy
Non-negativity
Since \(0 \leq p_i \leq 1\) implies \(\log_2 p_i \leq 0\), every term \(-p_i \log_2 p_i \geq 0\). Therefore:
\( H(X) \geq 0 \)
Equality holds if and only if \(X\) is deterministic (one \(p_i = 1\), the rest zero). We adopt the convention \(0 \log 0 = 0\).
Maximum for Uniform Distribution
By Jensen's inequality applied to the concave function \(-\log\):
\( H(X) \leq \log_2 n \)
with equality iff all \(p_i = 1/n\). The uniform distribution over \(n\) symbols has the maximum possible entropy \(\log_2 n\) bits.
Concavity
\(H\) is a concave function of the probability vector \((p_1, \ldots, p_n)\): mixing two distributions always increases (or maintains) entropy.
\( H(\lambda \mathbf{p} + (1-\lambda)\mathbf{q}) \geq \lambda H(\mathbf{p}) + (1-\lambda) H(\mathbf{q}) \)
5. Joint and Conditional Entropy
For a pair of random variables \((X, Y)\), the joint entropy is:
\[ H(X,Y) = -\sum_{x,y} p(x,y) \log_2 p(x,y) \]
The conditional entropy \(H(Y|X)\) measures the remaining uncertainty in \(Y\) once \(X\) is known:
\[ H(Y|X) = -\sum_{x,y} p(x,y) \log_2 p(y|x) = \mathbb{E}_x[H(Y|X=x)] \]
Two key inequalities follow immediately:
- Conditioning reduces entropy: \(H(Y|X) \leq H(Y)\), with equality iff \(X\) and \(Y\) are independent.
- Subadditivity: \(H(X,Y) \leq H(X) + H(Y)\), with equality iff \(X \perp Y\).
6. The Chain Rule
The chain rule for entropy decomposes joint entropy into a sequence of conditional entropies:
\[ H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y) \]
This extends naturally to any number of variables:
\[ H(X_1, X_2, \ldots, X_n) = \sum_{i=1}^{n} H(X_i \mid X_1, \ldots, X_{i-1}) \]
The chain rule is the foundation of data compression: to encode a sequence of symbols optimally, encode each symbol given all previous ones. This is precisely what arithmetic coding does, achieving entropy to within 2 bits of the optimal rate.
Binary Entropy Curve
The binary entropy function \(H_b(p)\) peaks at \(p = 0.5\) (maximum uncertainty) and equals zero at the deterministic extremes.
Python Simulation: Entropy in Action
Four panels: (1) the binary entropy curve, (2) entropy of English vs uniform and skewed distributions, (3) geometric distributions, (4) Zipf distributions.
Click Run to execute the Python code
Code will be executed with Python 3 on the server