Machine Learning/Part V: Probabilistic ML/Chapter 15

Chapter 15: Variational Inference

When the posterior \(P(\mathbf{z}|\mathbf{x})\) is intractable — as in almost all deep latent variable models — we replace exact inference with an optimisation problem: find the member of a tractable family \(q(\mathbf{z})\) closest to the true posterior in KL divergence.

1. The Intractability Problem

In a latent variable model with observations \(\mathbf{x}\) and latent variables \(\mathbf{z}\), we want the posterior:

\[ P(\mathbf{z} \mid \mathbf{x}) = \frac{P(\mathbf{x} \mid \mathbf{z})\,P(\mathbf{z})}{P(\mathbf{x})}, \qquad P(\mathbf{x}) = \int P(\mathbf{x} \mid \mathbf{z})\,P(\mathbf{z})\,d\mathbf{z} \]

The marginal \(P(\mathbf{x})\) requires integrating over all \(\mathbf{z}\) — exponential in dimension. Variational inference turns this into optimisation.

2. The ELBO: Full Derivation

Introduce a variational distribution \(q(\mathbf{z})\). Multiply and divide the log-evidence by \(q(\mathbf{z})\):

\[ \log P(\mathbf{x}) = \log \int P(\mathbf{x}, \mathbf{z})\,d\mathbf{z} = \log \int q(\mathbf{z})\frac{P(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})}\,d\mathbf{z} \]

Apply Jensen's inequality (\(\log\) is concave, so \(\log \mathbb{E}[X] \geq \mathbb{E}[\log X]\)):

\[ \log P(\mathbf{x}) \;\geq\; \underbrace{\mathbb{E}_{q(\mathbf{z})}\!\left[\log \frac{P(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})}\right]}_{\mathcal{L}(q) \;=\; \text{ELBO}} \]

To see the gap, expand using the KL divergence:

\begin{align} \log P(\mathbf{x}) &= \mathbb{E}_{q}\!\left[\log \frac{P(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})}\right] + \mathbb{E}_{q}\!\left[\log \frac{q(\mathbf{z})}{P(\mathbf{z} \mid \mathbf{x})}\right] \\ &= \mathcal{L}(q) + \mathrm{KL}(q(\mathbf{z}) \| P(\mathbf{z} \mid \mathbf{x})) \end{align}

Since \(\mathrm{KL} \geq 0\) with equality iff \(q = P(\mathbf{z}|\mathbf{x})\), maximising the ELBO is equivalent to minimising the KL to the true posterior. Rewrite the ELBO:

\[ \mathcal{L}(q) = \underbrace{\mathbb{E}_{q(\mathbf{z})}\!\left[\log P(\mathbf{x} \mid \mathbf{z})\right]}_{\text{reconstruction}} - \underbrace{\mathrm{KL}(q(\mathbf{z}) \| P(\mathbf{z}))}_{\text{regularisation}} \]

This decomposition is the VAE objective from Chapter 12: the first term rewards fitting the data; the second penalises the variational distribution for deviating from the prior.

3. ELBO Decomposition Diagram

4. Mean-Field Approximation & Coordinate Ascent VI

The mean-field family assumes the latent variables factorise:

\[ q(\mathbf{z}) = \prod_{j=1}^{J} q_j(z_j) \]

To find the optimal \(q_j^*(z_j)\) holding all others fixed, take the functional derivative of the ELBO with respect to \(q_j\) and set it to zero. Using the log-sum decomposition:

\[ \log q_j^*(z_j) = \mathbb{E}_{q_{-j}}\!\left[\log P(\mathbf{x}, \mathbf{z})\right] + \text{const} \]

This is the expectation of the log-joint over all variables except \(z_j\). The normalising constant is determined by requiring \(\int q_j^*(z_j)\,dz_j = 1\). Coordinate Ascent VI (CAVI) cycles through all \(j\), updating each \(q_j\) while holding others fixed — guaranteed to increase the ELBO at each step.

4.1 Application: Bayesian Linear Regression

With prior \(\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \alpha^{-1}I)\) and likelihood \(\mathbf{y} \sim \mathcal{N}(\mathbf{X}\mathbf{w}, \beta^{-1}I)\), the log-joint is:

\[ \log P(\mathbf{y}, \mathbf{w}) = -\frac{\beta}{2}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 - \frac{\alpha}{2}\|\mathbf{w}\|^2 + \text{const} \]

This is quadratic in \(\mathbf{w}\), so the optimal \(q^*(\mathbf{w}) = \mathcal{N}(\mathbf{m}_N, \mathbf{S}_N)\) with:

\[ \mathbf{S}_N^{-1} = \alpha\mathbf{I} + \beta\mathbf{X}^\top\mathbf{X}, \qquad \mathbf{m}_N = \beta\mathbf{S}_N\mathbf{X}^\top\mathbf{y} \]

This matches the exact posterior — a special property of Gaussian models. For non-Gaussian models, VI provides an approximation.

5. Amortised Inference & VAEs

Classical VI optimises \(q(\mathbf{z})\) separately for each data point — \(O(N)\) separate optimisations. Amortised inference trains a neural network \(q_\phi(\mathbf{z}|\mathbf{x})\) (the encoder) to predict variational parameters directly from \(\mathbf{x}\):

\[ \mathcal{L}(\phi, \theta) = \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}\!\left[\log p_\theta(\mathbf{x}|\mathbf{z})\right] - \mathrm{KL}\!\left(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z})\right) \]

This is exactly the VAE objective from Chapter 12. The encoder \(q_\phi\) and decoder \(p_\theta\) are trained jointly by maximising the ELBO via the reparameterisation trick: sample \(\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\varepsilon}\), \(\boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, I)\) to get differentiable gradients through the sampling step.

Expectation Propagation (brief overview)

EP minimises \(\mathrm{KL}(P(\mathbf{z}|\mathbf{x}) \| q(\mathbf{z}))\) — note the reversed KL compared to VI. This means EP is inclusive: it spreads \(q\) to cover all modes of the posterior, while VI tends to be mode-seeking, collapsing onto a single mode. EP is preferred when the true posterior is multi-modal.

6. Python Simulation: VI vs Exact Posterior

We implement CAVI for Bayesian linear regression with a quadratic feature basis and compare the VI posterior (mean-field Gaussian) against the exact Gaussian posterior. For this Gaussian model they agree exactly — illustrating that VI is exact when the posterior is Gaussian.

Python

script.py211 lines

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from scipy import stats

np.random.seed(42)

# Bayesian linear regression: y = X @ w + noise
# Prior: w ~ N(0, alpha^-1 * I)
# Likelihood: y ~ N(X @ w, beta^-1 * I)
# Exact posterior: w | y ~ N(m_N, S_N)
# S_N = (alpha*I + beta * X.T @ X)^-1
# m_N = beta * S_N @ X.T @ y

N = 30
D = 2
alpha = 1.0   # prior precision
beta = 5.0    # likelihood precision (noise precision)

# True weights
w_true = np.array([1.5, -0.8])

# Features: [x, x^2] for quadratic fitting on 1D input
x_data = np.linspace(-2, 2, N)
X = np.column_stack([x_data, x_data**2])
y = X @ w_true + np.random.normal(0, 1/np.sqrt(beta), N)

# ---- Exact Bayesian posterior ----
S_N_inv = alpha * np.eye(D) + beta * X.T @ X
S_N = np.linalg.inv(S_N_inv)
m_N = beta * S_N @ X.T @ y

# ---- Variational Inference (mean-field Gaussian) ----
# Variational family: q(w) = N(m, S) with S diagonal (mean-field)
# ELBO = E_q[log p(y|w)] - KL(q(w) || p(w))
# For Gaussian models, optimal q is actually Gaussian and mean-field gives exact result
# We do iterative ELBO maximisation to illustrate

def compute_elbo(m, s_diag, X, y, alpha, beta):
    S = np.diag(s_diag)
    # E_q[log p(y|w)] = sum_i E_q[-beta/2 (y_i - x_i^T w)^2] + const
    # = -beta/2 * sum_i [(y_i - x_i^T m)^2 + x_i^T S x_i] + N/2 log(beta/2pi)
    residuals = y - X @ m
    trace_term = np.sum([X[i] @ S @ X[i] for i in range(len(y))])
    E_log_lik = -beta/2 * (np.sum(residuals**2) + trace_term) + len(y)/2 * np.log(beta/(2*np.pi))
    # KL(N(m,S) || N(0, alpha^-1 I))
    # = 0.5 * [alpha * (m^T m + tr(S)) - D - log|alpha S|]
    # = 0.5 * [alpha * m^T m + alpha * tr(S) - D - D*log(alpha) - log|S| - D]
    log_det_S = np.sum(np.log(s_diag))
    kl = 0.5 * (alpha * (m @ m + np.sum(s_diag)) - D * np.log(alpha) - log_det_S - D)
    return E_log_lik - kl

# Iterative coordinate-ascent VI (mean-field)
# Optimal: S^-1 = alpha * I + beta * X^T X (same as exact for Gaussian)
# m = beta * S @ X^T y
# We track ELBO convergence

m_vi = np.zeros(D)
s_vi = np.ones(D) * 0.5  # diagonal variances
elbo_history = []
n_iters = 50

for iteration in range(n_iters):
    # Update S (diagonal mean-field)
    s_vi = 1.0 / (alpha + beta * np.sum(X**2, axis=0))
    # Update m
    for d in range(D):
        residual_d = y - X @ m_vi + X[:, d] * m_vi[d]
        m_vi[d] = beta * s_vi[d] * np.dot(X[:, d], residual_d)
    elbo_history.append(compute_elbo(m_vi, s_vi, X, y, alpha, beta))

print(f"VI mean:   {m_vi}")
print(f"Exact mean: {m_N}")
print(f"Final ELBO: {elbo_history[-1]:.4f}")

# ---- Plotting ----
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
fig.patch.set_facecolor('#0a0a0f')
fig.suptitle('Variational Inference for Bayesian Linear Regression', fontsize=14, color='white', fontweight='bold')

# Panel 1: Data and fits
ax = axes[0, 0]
ax.set_facecolor('#0d0d1a')
x_pred = np.linspace(-2.2, 2.2, 200)
X_pred = np.column_stack([x_pred, x_pred**2])
y_true_line = X_pred @ w_true
y_exact = X_pred @ m_N
y_vi_mean = X_pred @ m_vi

# Uncertainty from exact posterior
y_var_exact = np.array([1/beta + X_pred[i] @ S_N @ X_pred[i] for i in range(len(x_pred))])
y_std_exact = np.sqrt(y_var_exact)

S_vi = np.diag(s_vi)
y_var_vi = np.array([1/beta + X_pred[i] @ S_vi @ X_pred[i] for i in range(len(x_pred))])
y_std_vi = np.sqrt(y_var_vi)

ax.scatter(x_data, y, color='#34d399', s=30, zorder=5, label='Data', alpha=0.8)
ax.plot(x_pred, y_true_line, '--', color='#fbbf24', linewidth=2, label=f'True (w={w_true})')
ax.plot(x_pred, y_exact, color='#a78bfa', linewidth=2.5, label='Exact posterior mean')
ax.fill_between(x_pred, y_exact - 1.96*y_std_exact, y_exact + 1.96*y_std_exact, color='#a78bfa', alpha=0.2, label='Exact 95% CI')
ax.plot(x_pred, y_vi_mean, ':', color='#34d399', linewidth=2.5, label='VI posterior mean')
ax.set_title('Data, Exact Posterior, VI Posterior', color='white', fontsize=11, fontweight='bold')
ax.set_xlabel('x', color='white')
ax.set_ylabel('y', color='white')
ax.tick_params(colors='white')
for sp in ax.spines.values(): sp.set_color('#7c3aed')
ax.grid(True, alpha=0.2, color='#7c3aed')
ax.legend(fontsize=7, facecolor='#0d0d1a', edgecolor='#7c3aed', labelcolor='white')

# Panel 2: ELBO convergence
ax = axes[0, 1]
ax.set_facecolor('#0d0d1a')
ax.plot(range(1, n_iters+1), elbo_history, color='#a78bfa', linewidth=2.5)
ax.set_title('ELBO Convergence (CAVI)', color='white', fontsize=11, fontweight='bold')
ax.set_xlabel('Iteration', color='white')
ax.set_ylabel('ELBO', color='white')
ax.tick_params(colors='white')
for sp in ax.spines.values(): sp.set_color('#7c3aed')
ax.grid(True, alpha=0.2, color='#7c3aed')

# Panel 3: KL divergence decomposition bar
ax = axes[0, 2]
ax.set_facecolor('#0d0d1a')
# Compute ELBO components for different n_data
n_vals = [5, 10, 20, 30]
elbos, kls, eliks = [], [], []
for nv in n_vals:
    Xv = X[:nv]; yv = y[:nv]
    S_inv_v = alpha * np.eye(D) + beta * Xv.T @ Xv
    Sv = np.linalg.inv(S_inv_v)
    mv = beta * Sv @ Xv.T @ yv
    sv = np.diag(Sv)
    e = compute_elbo(mv, sv, Xv, yv, alpha, beta)
    kl = 0.5 * (alpha * (mv @ mv + np.sum(sv)) - D * np.log(alpha) - np.sum(np.log(sv)) - D)
    elik = e + kl
    elbos.append(e); kls.append(kl); eliks.append(elik)
x_bar = np.arange(len(n_vals))
bars1 = ax.bar(x_bar - 0.2, eliks, width=0.35, color='#7c3aed', alpha=0.8, label='E[log p(y|w)]')
bars2 = ax.bar(x_bar + 0.2, [-k for k in kls], width=0.35, color='#a78bfa', alpha=0.8, label='-KL(q||p)')
ax.set_xticks(x_bar)
ax.set_xticklabels([f'n={nv}' for nv in n_vals], color='white')
ax.set_title('ELBO Decomposition vs Dataset Size', color='white', fontsize=11, fontweight='bold')
ax.set_xlabel('Dataset size', color='white')
ax.set_ylabel('Value', color='white')
ax.tick_params(colors='white')
for sp in ax.spines.values(): sp.set_color('#7c3aed')
ax.grid(True, alpha=0.2, color='#7c3aed', axis='y')
ax.legend(fontsize=8, facecolor='#0d0d1a', edgecolor='#7c3aed', labelcolor='white')

# Panel 4: Weight marginals
ax = axes[1, 0]
ax.set_facecolor('#0d0d1a')
w_range = np.linspace(-1, 3, 300)
for d, (col_e, col_v, col_t) in enumerate(zip(['#a78bfa', '#fbbf24'], ['#34d399', '#f87171'], [w_true[0], w_true[1]])):
    exact_pdf = stats.norm.pdf(w_range, m_N[d], np.sqrt(S_N[d, d]))
    vi_pdf = stats.norm.pdf(w_range, m_vi[d], np.sqrt(s_vi[d]))
    ax.plot(w_range, exact_pdf, '-', color=col_e, linewidth=2.5, label=f'Exact q(w_{d+1})')
    ax.plot(w_range, vi_pdf, '--', color=col_v, linewidth=2, label=f'VI q(w_{d+1})')
    ax.axvline(col_t, color=col_e, linestyle=':', alpha=0.5, linewidth=1.5, label=f'True w_{d+1}={col_t}')
ax.set_title('Weight Marginals: Exact vs VI', color='white', fontsize=11, fontweight='bold')
ax.set_xlabel('w', color='white')
ax.set_ylabel('Density', color='white')
ax.tick_params(colors='white')
for sp in ax.spines.values(): sp.set_color('#7c3aed')
ax.grid(True, alpha=0.2, color='#7c3aed')
ax.legend(fontsize=8, facecolor='#0d0d1a', edgecolor='#7c3aed', labelcolor='white')

# Panel 5: KL(q||p) vs alpha (prior strength)
ax = axes[1, 1]
ax.set_facecolor('#0d0d1a')
alphas = np.logspace(-2, 2, 50)
kl_vals = []
for alp in alphas:
    S_inv_a = alp * np.eye(D) + beta * X.T @ X
    S_a = np.linalg.inv(S_inv_a)
    m_a = beta * S_a @ X.T @ y
    s_a = np.diag(S_a)
    kl_a = 0.5 * (alp * (m_a @ m_a + np.sum(s_a)) - D * np.log(alp) - np.sum(np.log(s_a)) - D)
    kl_vals.append(kl_a)
ax.semilogx(alphas, kl_vals, color='#a78bfa', linewidth=2.5)
ax.axvline(1.0, color='#fbbf24', linestyle='--', linewidth=1.5, label='Default alpha=1')
ax.set_title('KL(posterior || prior) vs Prior Strength alpha', color='white', fontsize=11, fontweight='bold')
ax.set_xlabel('alpha (log scale)', color='white')
ax.set_ylabel('KL divergence', color='white')
ax.tick_params(colors='white')
for sp in ax.spines.values(): sp.set_color('#7c3aed')
ax.grid(True, alpha=0.2, color='#7c3aed')
ax.legend(fontsize=8, facecolor='#0d0d1a', edgecolor='#7c3aed', labelcolor='white')

# Panel 6: Posterior predictive comparison
ax = axes[1, 2]
ax.set_facecolor('#0d0d1a')
ax.plot(x_pred, y_exact, color='#a78bfa', linewidth=2.5, label='Exact pred. mean')
ax.fill_between(x_pred, y_exact - 1.96*y_std_exact, y_exact + 1.96*y_std_exact, color='#a78bfa', alpha=0.2, label='Exact 95% CI')
ax.plot(x_pred, y_vi_mean, '--', color='#34d399', linewidth=2.5, label='VI pred. mean')
ax.fill_between(x_pred, y_vi_mean - 1.96*y_std_vi, y_vi_mean + 1.96*y_std_vi, color='#34d399', alpha=0.15, label='VI 95% CI')
ax.scatter(x_data, y, color='white', s=25, zorder=5, alpha=0.6)
ax.set_title('Predictive Distribution: Exact vs VI', color='white', fontsize=11, fontweight='bold')
ax.set_xlabel('x', color='white')
ax.set_ylabel('y', color='white')
ax.tick_params(colors='white')
for sp in ax.spines.values(): sp.set_color('#7c3aed')
ax.grid(True, alpha=0.2, color='#7c3aed')
ax.legend(fontsize=8, facecolor='#0d0d1a', edgecolor='#7c3aed', labelcolor='white')

plt.tight_layout()
plt.savefig('output.png', dpi=130, bbox_inches='tight', facecolor='#0a0a0f')
print("Variational inference simulation complete.")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Share:X Reddit LinkedIn

← Ch 14: Gaussian Processes Ch 16: Recurrent Networks →