Machine Learning/Part V: Probabilistic ML/Chapter 13

Chapter 13: Bayesian Inference

Bayesian inference is a principled framework for learning from data: instead of a single parameter estimate, we maintain a full probability distribution over hypotheses, updating it as evidence arrives. Every prior belief, every observation, and every prediction are part of a coherent calculus of uncertainty.

1. Bayes' Theorem as a Learning Rule

Let \(\theta\) be a parameter (e.g. the bias of a coin) and \(\mathcal{D} = \{x_1,\ldots,x_n\}\) be observed data. Bayes' theorem states:

\[ P(\theta \mid \mathcal{D}) = \frac{P(\mathcal{D} \mid \theta)\, P(\theta)}{P(\mathcal{D})} \]

P(θ|D)

Posterior

Updated belief about θ after seeing data

P(D|θ)

Likelihood

How probable is the data under parameter θ?

P(θ)

Prior

Initial belief about θ before any data

P(D)

Evidence

Normalising constant: ∫ P(D|θ)P(θ) dθ

The evidence \(P(\mathcal{D})\) is typically intractable for continuous \(\theta\) — this is why approximations like MCMC and variational inference are needed. For conjugate priors, the normaliser cancels analytically.

2. Conjugate Priors

A prior \(P(\theta)\) is conjugate to a likelihood \(P(\mathcal{D}|\theta)\) if the posterior is in the same distributional family as the prior. Conjugacy makes Bayesian updating a simple parameter update.

2.1 Beta-Bernoulli (Coin Flipping)

For coin flips \(x_i \in \{0,1\}\) with unknown bias \(p\), the likelihood of \(h\) heads in \(n\) flips is:

\[ P(\mathcal{D} \mid p) = p^h (1-p)^{n-h} \]

Choose a Beta prior: \(P(p) = \mathrm{Beta}(\alpha_0, \beta_0) \propto p^{\alpha_0-1}(1-p)^{\beta_0-1}\). Then:

\[ P(p \mid \mathcal{D}) \propto p^h (1-p)^{n-h} \cdot p^{\alpha_0-1}(1-p)^{\beta_0-1} = p^{(\alpha_0+h)-1}(1-p)^{(\beta_0+n-h)-1} \]\[ \Rightarrow \quad P(p \mid \mathcal{D}) = \mathrm{Beta}(\alpha_0 + h,\; \beta_0 + (n-h)) \]

The hyper-parameters \(\alpha_0, \beta_0\) act as pseudo-counts: \(\alpha_0\) is like having seen \(\alpha_0-1\) extra heads before any real data. The posterior mean is \(\hat{p} = \frac{\alpha_0+h}{\alpha_0+\beta_0+n}\), a weighted average between the prior mean \(\frac{\alpha_0}{\alpha_0+\beta_0}\) and the MLE \(\frac{h}{n}\).

2.2 Normal-Normal (Known Variance)

Suppose observations \(x_i \sim \mathcal{N}(\mu, \sigma^2)\) with known \(\sigma^2\), and prior \(\mu \sim \mathcal{N}(\mu_0, \tau_0^2)\). The likelihood for \(n\) observations is:

\[ \log P(\mathcal{D} \mid \mu) = -\frac{n}{2\sigma^2}\left(\bar{x} - \mu\right)^2 + \text{const} \]

Adding the log-prior \(-\frac{1}{2\tau_0^2}(\mu-\mu_0)^2\) and completing the square in \(\mu\):

\[ \log P(\mu \mid \mathcal{D}) \propto -\frac{1}{2}\left(\frac{n}{\sigma^2} + \frac{1}{\tau_0^2}\right)\left(\mu - \mu_n\right)^2 \]\[ \mu_n = \frac{\frac{n}{\sigma^2}\bar{x} + \frac{1}{\tau_0^2}\mu_0}{\frac{n}{\sigma^2} + \frac{1}{\tau_0^2}}, \qquad \frac{1}{\tau_n^2} = \frac{n}{\sigma^2} + \frac{1}{\tau_0^2} \]

So the posterior is \(\mu \mid \mathcal{D} \sim \mathcal{N}(\mu_n, \tau_n^2)\). The posterior mean \(\mu_n\) is a precision-weighted average of the prior mean and the MLE \(\bar{x}\). As \(n \to \infty\), the data dominates and \(\mu_n \to \bar{x}\).

3. Bayesian Updating Diagram

4. Predictive Distribution

Rather than plugging in a point estimate of \(\theta\), Bayesian prediction integrates over the posterior — this accounts for parameter uncertainty:

\[ P(x_{\rm new} \mid \mathcal{D}) = \int P(x_{\rm new} \mid \theta)\, P(\theta \mid \mathcal{D})\, d\theta \]

For the Beta-Bernoulli model with posterior \(\mathrm{Beta}(\alpha_n, \beta_n)\), the probability of the next flip being heads is simply \(\mathbb{E}[p \mid \mathcal{D}] = \frac{\alpha_n}{\alpha_n+\beta_n}\). For predicting \(k\) heads in the next \(m\) flips, the result is the Beta-Binomial distribution:

\[ P(k \mid m, \mathcal{D}) = \binom{m}{k} \frac{B(\alpha_n+k,\, \beta_n+m-k)}{B(\alpha_n,\beta_n)} \]

5. MCMC: Metropolis-Hastings

For complex posteriors where conjugacy fails, Markov Chain Monte Carlo (MCMC) draws samples from \(P(\theta|\mathcal{D})\) without needing to compute the normaliser. The Metropolis-Hastings algorithm works as follows:

Start at some \(\theta^{(0)}\).
At step \(t\), propose \(\theta^* \sim q(\theta^* \mid \theta^{(t-1)})\) (e.g., Gaussian random walk).
Compute the acceptance ratio:
\[ \alpha = \min\!\left(1,\; \frac{P(\theta^* \mid \mathcal{D})\, q(\theta^{(t-1)} \mid \theta^*)}{P(\theta^{(t-1)} \mid \mathcal{D})\, q(\theta^* \mid \theta^{(t-1)})}\right) \]
Note: since \(P(\theta^*|\mathcal{D}) \propto P(\mathcal{D}|\theta^*)P(\theta^*)\), the normaliser cancels in the ratio.
Accept \(\theta^{(t)} = \theta^*\) with probability \(\alpha\); otherwise stay: \(\theta^{(t)} = \theta^{(t-1)}\).
Discard the burn-in period. The remaining samples are (approximately) i.i.d. from the posterior.

Why does MH converge to the correct distribution?

The chain satisfies detailed balance: \(\pi(\theta)\,T(\theta^*|\theta) = \pi(\theta^*)\,T(\theta|\theta^*)\), where \(\pi = P(\theta|\mathcal{D})\) is the target and \(T\) is the transition kernel. Detailed balance ensures \(\pi\) is the stationary distribution of the chain.

5.1 Gibbs Sampling

When \(\theta = (\theta_1,\ldots,\theta_K)\) is multivariate but each full conditional \(P(\theta_j | \theta_{-j}, \mathcal{D})\) is tractable, Gibbs sampling cycles through sampling each variable from its full conditional. It is a special case of MH with acceptance ratio always 1.

Cross-Disciplinary Connection

This Bayesian framework is the same one used to model musical expectations in the Music & Mathematics: Neuroscience & Perception chapter. The brain maintains a prior over upcoming notes and updates it with each heard sound — exactly the Beta-Bernoulli update derived above, with musical context playing the role of \(\alpha_0, \beta_0\).

6. Python Simulation: Bayesian Coin Flip

We simulate a coin with true bias \(p=0.65\) and watch how three different Beta priors converge to the true value as evidence accumulates. We also implement Metropolis-Hastings and compare its histogram to the exact Beta posterior, and plot the Beta-Binomial predictive distribution.

Python

script.py151 lines

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from scipy import stats

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
fig.patch.set_facecolor('#0a0a0f')
fig.suptitle('Bayesian Inference: Coin Flip Posterior Evolution', fontsize=15, color='white', fontweight='bold', y=0.98)

colors = ['#a78bfa', '#7c3aed', '#5b21b6', '#4c1d95']
x = np.linspace(0, 1, 500)

# Three different priors: Uniform Beta(1,1), Weak Beta(2,2), Strong Beta(10,3)
priors = [
    (1, 1, 'Uniform Beta(1,1)', '#a78bfa'),
    (2, 2, 'Weak Beta(2,2)', '#7c3aed'),
    (10, 3, 'Informative Beta(10,3)', '#5b21b6'),
]

# Coin sequences: observe heads (H=1) and tails (T=0)
np.random.seed(42)
true_p = 0.65
observations = np.random.binomial(1, true_p, size=50)

observation_steps = [1, 3, 10, 20, 50]
row_titles = ['After 1 flip', 'After 3 flips', 'After 10 flips', 'After 20 flips', 'After 50 flips']

# Top row: posterior evolution for Uniform prior
ax_row = [axes[0, 0], axes[0, 1], axes[0, 2]]
selected_steps = [1, 10, 50]

for idx, n_obs in enumerate(selected_steps):
    ax = ax_row[idx]
    ax.set_facecolor('#0d0d1a')
    heads = int(observations[:n_obs].sum())
    tails = n_obs - heads
    for (a0, b0, label, col) in priors:
        a_post = a0 + heads
        b_post = b0 + tails
        prior_y = stats.beta.pdf(x, a0, b0)
        post_y = stats.beta.pdf(x, a_post, b_post)
        ax.plot(x, prior_y, '--', color=col, alpha=0.4, linewidth=1.5, label=f'Prior {label}')
        ax.plot(x, post_y, '-', color=col, linewidth=2.5, label=f'Posterior a={a_post}, b={b_post}')
    ax.axvline(true_p, color='#fbbf24', linestyle=':', linewidth=2, label=f'True p={true_p}')
    ax.axvline(heads/n_obs if n_obs > 0 else 0, color='#34d399', linestyle=':', linewidth=2, label=f'MLE={heads/n_obs:.2f}')
    ax.set_title(f'n={n_obs}: {heads}H / {tails}T', color='white', fontsize=12, fontweight='bold')
    ax.set_xlabel('p (bias)', color='white', fontsize=10)
    ax.set_ylabel('Density', color='white', fontsize=10)
    ax.tick_params(colors='white')
    for spine in ax.spines.values():
        spine.set_color('#7c3aed')
    ax.grid(True, alpha=0.2, color='#7c3aed')
    if idx == 0:
        ax.legend(fontsize=7, facecolor='#0d0d1a', edgecolor='#7c3aed', labelcolor='white', loc='upper left')

# Bottom row: posterior mean vs MLE convergence and MCMC illustration
ax_conv = axes[1, 0]
ax_conv.set_facecolor('#0d0d1a')
ns = np.arange(1, 51)
mle_vals = np.cumsum(observations) / ns
for (a0, b0, label, col) in priors:
    post_means = (a0 + np.cumsum(observations)) / (a0 + b0 + ns)
    ax_conv.plot(ns, post_means, '-', color=col, linewidth=2, label=f'Post. mean {label}')
ax_conv.plot(ns, mle_vals, '--', color='#34d399', linewidth=2, label='MLE (freq.)')
ax_conv.axhline(true_p, color='#fbbf24', linestyle=':', linewidth=2, label=f'True p={true_p}')
ax_conv.set_title('Posterior Mean vs MLE Convergence', color='white', fontsize=11, fontweight='bold')
ax_conv.set_xlabel('Number of observations n', color='white')
ax_conv.set_ylabel('Estimate of p', color='white')
ax_conv.tick_params(colors='white')
for spine in ax_conv.spines.values():
    spine.set_color('#7c3aed')
ax_conv.grid(True, alpha=0.2, color='#7c3aed')
ax_conv.legend(fontsize=8, facecolor='#0d0d1a', edgecolor='#7c3aed', labelcolor='white')

# Metropolis-Hastings for Beta posterior
ax_mh = axes[1, 1]
ax_mh.set_facecolor('#0d0d1a')
heads_final = int(observations.sum())
tails_final = 50 - heads_final

def log_posterior(p, a0=2, b0=2, h=heads_final, t=tails_final):
    if p <= 0 or p >= 1:
        return -np.inf
    return (a0 - 1) * np.log(p) + (b0 - 1) * np.log(1 - p) + h * np.log(p) + t * np.log(1 - p)

n_samples = 5000
samples = np.zeros(n_samples)
samples[0] = 0.5
proposal_width = 0.05
accepted = 0
for i in range(1, n_samples):
    proposal = samples[i-1] + np.random.normal(0, proposal_width)
    log_ratio = log_posterior(proposal) - log_posterior(samples[i-1])
    if np.log(np.random.uniform()) < log_ratio:
        samples[i] = proposal
        accepted += 1
    else:
        samples[i] = samples[i-1]

burn_in = 500
mcmc_samples = samples[burn_in:]
a_true = 2 + heads_final
b_true = 2 + tails_final
x_plot = np.linspace(0.3, 1.0, 300)
true_post = stats.beta.pdf(x_plot, a_true, b_true)
ax_mh.hist(mcmc_samples, bins=60, density=True, color='#7c3aed', alpha=0.6, label=f'MH samples (accept={accepted/n_samples:.1%})')
ax_mh.plot(x_plot, true_post, '-', color='#fbbf24', linewidth=2.5, label='True posterior Beta')
ax_mh.axvline(true_p, color='#34d399', linestyle=':', linewidth=2, label=f'True p={true_p}')
ax_mh.set_title('Metropolis-Hastings vs True Posterior', color='white', fontsize=11, fontweight='bold')
ax_mh.set_xlabel('p', color='white')
ax_mh.set_ylabel('Density', color='white')
ax_mh.tick_params(colors='white')
for spine in ax_mh.spines.values():
    spine.set_color('#7c3aed')
ax_mh.grid(True, alpha=0.2, color='#7c3aed')
ax_mh.legend(fontsize=8, facecolor='#0d0d1a', edgecolor='#7c3aed', labelcolor='white')

# Predictive distribution: P(X_new=H | data) = integral p * posterior dp = E[p | data]
ax_pred = axes[1, 2]
ax_pred.set_facecolor('#0d0d1a')
n_future = 20
k_vals = np.arange(0, n_future + 1)
for (a0, b0, label, col) in priors:
    a_post = a0 + heads_final
    b_post = b0 + tails_final
    # Beta-Binomial predictive: P(k|data) = C(n,k) * B(a+k, b+n-k) / B(a,b)
    from scipy.special import betaln, comb
    log_probs = []
    for k in k_vals:
        lp = np.log(comb(n_future, k)) + betaln(a_post + k, b_post + n_future - k) - betaln(a_post, b_post)
        log_probs.append(lp)
    probs = np.exp(log_probs)
    ax_pred.plot(k_vals, probs, 'o-', color=col, linewidth=2, markersize=5, label=f'Predictive ({label})')
binomial_probs = stats.binom.pmf(k_vals, n_future, true_p)
ax_pred.plot(k_vals, binomial_probs, 's--', color='#fbbf24', linewidth=2, markersize=5, label=f'Binomial(true p={true_p})')
ax_pred.set_title(f'Predictive: P(k heads in {n_future} future flips)', color='white', fontsize=11, fontweight='bold')
ax_pred.set_xlabel('k (number of heads)', color='white')
ax_pred.set_ylabel('Probability', color='white')
ax_pred.tick_params(colors='white')
for spine in ax_pred.spines.values():
    spine.set_color('#7c3aed')
ax_pred.grid(True, alpha=0.2, color='#7c3aed')
ax_pred.legend(fontsize=8, facecolor='#0d0d1a', edgecolor='#7c3aed', labelcolor='white')

plt.tight_layout()
plt.savefig('output.png', dpi=130, bbox_inches='tight', facecolor='#0a0a0f')
print(f"Done. True p={true_p}, 50 flips: {heads_final}H / {tails_final}T")
print(f"MH acceptance rate: {accepted/n_samples:.1%}")
print(f"Posterior mean (Uniform prior): {(1+heads_final)/(2+50):.4f}")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Share:X Reddit LinkedIn

← Ch 12: Autoencoders & VAEs Ch 14: Gaussian Processes →