Chapter 13: Bayesian Inference

Bayesian inference is a principled framework for learning from data: instead of a single parameter estimate, we maintain a full probability distribution over hypotheses, updating it as evidence arrives. Every prior belief, every observation, and every prediction are part of a coherent calculus of uncertainty.

1. Bayes' Theorem as a Learning Rule

Let \(\theta\) be a parameter (e.g. the bias of a coin) and \(\mathcal{D} = \{x_1,\ldots,x_n\}\) be observed data. Bayes' theorem states:

\[ P(\theta \mid \mathcal{D}) = \frac{P(\mathcal{D} \mid \theta)\, P(\theta)}{P(\mathcal{D})} \]
P(θ|D)
Posterior
Updated belief about θ after seeing data
P(D|θ)
Likelihood
How probable is the data under parameter θ?
P(θ)
Prior
Initial belief about θ before any data
P(D)
Evidence
Normalising constant: ∫ P(D|θ)P(θ) dθ

The evidence \(P(\mathcal{D})\) is typically intractable for continuous \(\theta\) — this is why approximations like MCMC and variational inference are needed. For conjugate priors, the normaliser cancels analytically.

2. Conjugate Priors

A prior \(P(\theta)\) is conjugate to a likelihood \(P(\mathcal{D}|\theta)\) if the posterior is in the same distributional family as the prior. Conjugacy makes Bayesian updating a simple parameter update.

2.1 Beta-Bernoulli (Coin Flipping)

For coin flips \(x_i \in \{0,1\}\) with unknown bias \(p\), the likelihood of \(h\) heads in \(n\) flips is:

\[ P(\mathcal{D} \mid p) = p^h (1-p)^{n-h} \]

Choose a Beta prior: \(P(p) = \mathrm{Beta}(\alpha_0, \beta_0) \propto p^{\alpha_0-1}(1-p)^{\beta_0-1}\). Then:

\[ P(p \mid \mathcal{D}) \propto p^h (1-p)^{n-h} \cdot p^{\alpha_0-1}(1-p)^{\beta_0-1} = p^{(\alpha_0+h)-1}(1-p)^{(\beta_0+n-h)-1} \]\[ \Rightarrow \quad P(p \mid \mathcal{D}) = \mathrm{Beta}(\alpha_0 + h,\; \beta_0 + (n-h)) \]

The hyper-parameters \(\alpha_0, \beta_0\) act as pseudo-counts: \(\alpha_0\) is like having seen \(\alpha_0-1\) extra heads before any real data. The posterior mean is \(\hat{p} = \frac{\alpha_0+h}{\alpha_0+\beta_0+n}\), a weighted average between the prior mean \(\frac{\alpha_0}{\alpha_0+\beta_0}\) and the MLE \(\frac{h}{n}\).

2.2 Normal-Normal (Known Variance)

Suppose observations \(x_i \sim \mathcal{N}(\mu, \sigma^2)\) with known \(\sigma^2\), and prior \(\mu \sim \mathcal{N}(\mu_0, \tau_0^2)\). The likelihood for \(n\) observations is:

\[ \log P(\mathcal{D} \mid \mu) = -\frac{n}{2\sigma^2}\left(\bar{x} - \mu\right)^2 + \text{const} \]

Adding the log-prior \(-\frac{1}{2\tau_0^2}(\mu-\mu_0)^2\) and completing the square in \(\mu\):

\[ \log P(\mu \mid \mathcal{D}) \propto -\frac{1}{2}\left(\frac{n}{\sigma^2} + \frac{1}{\tau_0^2}\right)\left(\mu - \mu_n\right)^2 \]\[ \mu_n = \frac{\frac{n}{\sigma^2}\bar{x} + \frac{1}{\tau_0^2}\mu_0}{\frac{n}{\sigma^2} + \frac{1}{\tau_0^2}}, \qquad \frac{1}{\tau_n^2} = \frac{n}{\sigma^2} + \frac{1}{\tau_0^2} \]

So the posterior is \(\mu \mid \mathcal{D} \sim \mathcal{N}(\mu_n, \tau_n^2)\). The posterior mean \(\mu_n\) is a precision-weighted average of the prior mean and the MLE \(\bar{x}\). As \(n \to \infty\), the data dominates and \(\mu_n \to \bar{x}\).

3. Bayesian Updating Diagram

Prior P(θ)Initial beliefsbefore dataData DLikelihoodP(D|θ)Posterior P(θ|D)Updated beliefsafter dataPosterior ∝ Likelihood × PriorPosterior becomes prior for next update

4. Predictive Distribution

Rather than plugging in a point estimate of \(\theta\), Bayesian prediction integrates over the posterior — this accounts for parameter uncertainty:

\[ P(x_{\rm new} \mid \mathcal{D}) = \int P(x_{\rm new} \mid \theta)\, P(\theta \mid \mathcal{D})\, d\theta \]

For the Beta-Bernoulli model with posterior \(\mathrm{Beta}(\alpha_n, \beta_n)\), the probability of the next flip being heads is simply \(\mathbb{E}[p \mid \mathcal{D}] = \frac{\alpha_n}{\alpha_n+\beta_n}\). For predicting \(k\) heads in the next \(m\) flips, the result is the Beta-Binomial distribution:

\[ P(k \mid m, \mathcal{D}) = \binom{m}{k} \frac{B(\alpha_n+k,\, \beta_n+m-k)}{B(\alpha_n,\beta_n)} \]

5. MCMC: Metropolis-Hastings

For complex posteriors where conjugacy fails, Markov Chain Monte Carlo (MCMC) draws samples from \(P(\theta|\mathcal{D})\) without needing to compute the normaliser. The Metropolis-Hastings algorithm works as follows:

  1. Start at some \(\theta^{(0)}\).
  2. At step \(t\), propose \(\theta^* \sim q(\theta^* \mid \theta^{(t-1)})\) (e.g., Gaussian random walk).
  3. Compute the acceptance ratio:
    \[ \alpha = \min\!\left(1,\; \frac{P(\theta^* \mid \mathcal{D})\, q(\theta^{(t-1)} \mid \theta^*)}{P(\theta^{(t-1)} \mid \mathcal{D})\, q(\theta^* \mid \theta^{(t-1)})}\right) \]
    Note: since \(P(\theta^*|\mathcal{D}) \propto P(\mathcal{D}|\theta^*)P(\theta^*)\), the normaliser cancels in the ratio.
  4. Accept \(\theta^{(t)} = \theta^*\) with probability \(\alpha\); otherwise stay: \(\theta^{(t)} = \theta^{(t-1)}\).
  5. Discard the burn-in period. The remaining samples are (approximately) i.i.d. from the posterior.

Why does MH converge to the correct distribution?

The chain satisfies detailed balance: \(\pi(\theta)\,T(\theta^*|\theta) = \pi(\theta^*)\,T(\theta|\theta^*)\), where \(\pi = P(\theta|\mathcal{D})\) is the target and \(T\) is the transition kernel. Detailed balance ensures \(\pi\) is the stationary distribution of the chain.

5.1 Gibbs Sampling

When \(\theta = (\theta_1,\ldots,\theta_K)\) is multivariate but each full conditional \(P(\theta_j | \theta_{-j}, \mathcal{D})\) is tractable, Gibbs sampling cycles through sampling each variable from its full conditional. It is a special case of MH with acceptance ratio always 1.

Cross-Disciplinary Connection

This Bayesian framework is the same one used to model musical expectations in the Music & Mathematics: Neuroscience & Perception chapter. The brain maintains a prior over upcoming notes and updates it with each heard sound — exactly the Beta-Bernoulli update derived above, with musical context playing the role of \(\alpha_0, \beta_0\).

6. Python Simulation: Bayesian Coin Flip

We simulate a coin with true bias \(p=0.65\) and watch how three different Beta priors converge to the true value as evidence accumulates. We also implement Metropolis-Hastings and compare its histogram to the exact Beta posterior, and plot the Beta-Binomial predictive distribution.

Python
script.py151 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server