Part III β€Ί Chapter 7

Differential Entropy

Extending Shannon entropy to continuous random variables β€” with the surprising twist that the entropy can be negative, and that the Gaussian is uniquely entropy-maximizing.

7.1 Definition

For a continuous random variable \(X\) with probability density function\(f(x)\), the differential entropy is:

\[ h(X) = -\int_{-\infty}^{\infty} f(x)\log f(x)\,dx \]

The integral is taken over the support of \(f\). Like discrete entropy, it measures β€œspread” or β€œuncertainty” of a distribution β€” but with a critical difference: differential entropy can be negative, and its value depends on the units of measurement.

The non-negativity of discrete entropy relied on \(0 \le p_i \le 1\) implying\(-\log p_i \ge 0\). For a continuous pdf, \(f(x)\) can exceed 1, making \(-\log f(x)\) negative, so \(h(X)\) can be negative.

7.2 Closed-Form Results

DistributionParametersDifferential Entropy h(X)
Uniform\([a,b]\)\(\log(b-a)\)
Gaussian\(\mathcal{N}(\mu,\sigma^2)\)\(\frac{1}{2}\log(2\pi e\,\sigma^2)\)
Exponential\(\text{Exp}(\lambda)\)\(1 - \log\lambda\)
Laplace\(\text{Lap}(\mu,b)\)\(1 + \log(2b)\)
Multivariate Gaussian\(\mathcal{N}(\boldsymbol\mu,\Sigma)\)\(\frac{n}{2}\log(2\pi e) + \frac{1}{2}\log\det\Sigma\)

7.3 Maximum Entropy Distributions

Subject to a variance constraint\(\operatorname{Var}(X)=\sigma^2\), the Gaussian maximizes differential entropy:

\[ h(X) \;\le\; \tfrac{1}{2}\log(2\pi e\,\sigma^2) \]

with equality iff \(X \sim \mathcal{N}(0,\sigma^2)\)

Proof sketch (relative entropy): For any \(f\) with variance \(\sigma^2\), let \(g=\mathcal{N}(0,\sigma^2)\). By non-negativity of KL divergence:

\[ 0 \;\le\; D_{\mathrm{KL}}(f\|g) = -h(X) - \int f\log g\,dx \]

Since \(\log g(x) = -\frac{x^2}{2\sigma^2} - \frac{1}{2}\log(2\pi\sigma^2)\), the integral \(\int f\log g\,dx\) depends only on the second moment of \(f\), which equals \(\sigma^2\) by assumption. Therefore \(h(X) \le h(g)\).

This result is why Gaussian noise is the worst case for channel capacity problems β€” it is the β€œmost uncertain” noise for a given power level.

Other Maximum Entropy Results

  • β€£Support [a, b] constraint: Uniform distribution maximizes \(h(X)\)
  • β€£Mean constraint \(E[X]=\mu\), support \([0,\infty)\): Exponential distribution
  • β€£No constraint (only normalization): \(h(X)\to\infty\) β€” no maximum exists

7.4 Properties of Differential Entropy

Scaling

\[ h(aX) = h(X) + \log|a| \]

Stretching a distribution increases its entropy logarithmically

Translation Invariance

\[ h(X + c) = h(X) \]

Shifting does not change the shape of the distribution

Chain Rule

\[ h(X,Y) = h(X) + h(Y|X) \]

Same form as discrete; conditioning reduces entropy

Mutual Information

\[ I(X;Y) = h(X) - h(X|Y) \ge 0 \]

Always non-negative even though h can be negative

Relation to Discrete Entropy

Quantizing \(X\) to bins of width \(\Delta\) gives a discrete variable \(X^\Delta\) with:

\[ H(X^\Delta) \approx h(X) - \log\Delta \]

As \(\Delta\to 0\) (finer quantization), \(H(X^\Delta)\to\infty\). The divergence rate is \(-\log\Delta\) and the β€œremainder” is \(h(X)\). This is why \(h(X)\) can be negative but \(H\) is always non-negative.

Python: Computing & Visualizing Differential Entropy

Compare analytical formulas against numerical estimates from histograms. Six panels: entropy vs parameter, PDFs, bar comparison, Gaussian maximality, negativity region, and the discrete-to-continuous relation.

Python
script.py185 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server