Machine Learning/Part V: Probabilistic ML/Chapter 14

Chapter 14: Gaussian Processes

A Gaussian Process (GP) is a distribution over functions: instead of a finite parameter vector, it places a prior over the entire function space. GP regression yields an exact Bayesian posterior with analytic uncertainty estimates — the gold standard for small-data, high-uncertainty settings.

1. Multivariate Gaussian Conditioning

The GP posterior derivation rests on conditioning a multivariate Gaussian. Suppose:

\[ \begin{pmatrix} \mathbf{x}_1 \\ \mathbf{x}_2 \end{pmatrix} \sim \mathcal{N}\!\left( \begin{pmatrix} \boldsymbol{\mu}_1 \\ \boldsymbol{\mu}_2 \end{pmatrix},\; \begin{pmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{pmatrix} \right) \]

Then the conditional distribution is:

\[ \mathbf{x}_1 \mid \mathbf{x}_2 \;\sim\; \mathcal{N}\!\left(\boldsymbol{\mu}_{1|2},\; \Sigma_{1|2}\right) \]\[ \boldsymbol{\mu}_{1|2} = \boldsymbol{\mu}_1 + \Sigma_{12}\Sigma_{22}^{-1}(\mathbf{x}_2 - \boldsymbol{\mu}_2) \]\[ \Sigma_{1|2} = \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} \]

Derivation sketch: Complete the square in the joint Gaussian density with respect to \(\mathbf{x}_1\) while treating \(\mathbf{x}_2\) as fixed. The resulting quadratic form identifies both the conditional mean (linear in \(\mathbf{x}_2\)) and the conditional covariance (independent of \(\mathbf{x}_2\)).

2. Gaussian Process Definition

A Gaussian Process is a collection of random variables, any finite subset of which has a joint Gaussian distribution. It is fully specified by a mean function and a covariance (kernel) function:

\[ f \sim \mathcal{GP}(m(\mathbf{x}),\; k(\mathbf{x}, \mathbf{x}')) \]

For any finite set of inputs \(\mathbf{X} = \{x_1,\ldots,x_n\}\), the function values are jointly Gaussian: \(\mathbf{f} \sim \mathcal{N}(\mathbf{m},\mathbf{K})\) where \(K_{ij} = k(x_i, x_j)\).

2.1 Kernel Functions

RBF (Squared Exponential) Kernel

\[ k_{\rm RBF}(x, x') = \sigma^2 \exp\!\left(-\frac{\|x-x'\|^2}{2\ell^2}\right) \]

Infinitely differentiable — models very smooth functions. \(\ell\) controls the length scale (how quickly the correlation decays), \(\sigma^2\) controls the output variance.

Matérn-3/2 Kernel

\[ k_{3/2}(r) = \sigma^2\!\left(1 + \frac{\sqrt{3}\,r}{\ell}\right)\exp\!\left(-\frac{\sqrt{3}\,r}{\ell}\right), \quad r = \|x - x'\| \]

Once differentiable — rougher than RBF, more realistic for physical processes and spatial data.

Periodic Kernel

\[ k_{\rm per}(x, x') = \sigma^2 \exp\!\left(-\frac{2\sin^2(\pi|x-x'|/p)}{\ell^2}\right) \]

Models periodic functions with period \(p\) — useful for time series with daily/seasonal patterns.

3. GP Regression: Full Posterior Derivation

Given training data \(\mathcal{D} = \{(\mathbf{X}, \mathbf{y})\}\) with noise model \(y_i = f(x_i) + \varepsilon_i\), \(\varepsilon_i \sim \mathcal{N}(0, \sigma_n^2)\), we want the posterior over function values \(\mathbf{f}_* = f(\mathbf{X}_*)\) at test points \(\mathbf{X}_*\).

The joint prior over training and test function values is:

\[ \begin{pmatrix} \mathbf{y} \\ \mathbf{f}_* \end{pmatrix} \sim \mathcal{N}\!\left(\mathbf{0},\; \begin{pmatrix} K(\mathbf{X},\mathbf{X}) + \sigma_n^2 I & K(\mathbf{X},\mathbf{X}_*) \\ K(\mathbf{X}_*,\mathbf{X}) & K(\mathbf{X}_*,\mathbf{X}_*) \end{pmatrix}\right) \]

Applying the conditional Gaussian formulas with \(\Sigma_{22} = K(\mathbf{X},\mathbf{X})+\sigma_n^2 I\), \(\Sigma_{12} = K(\mathbf{X}_*,\mathbf{X})\):

\[ \mathbf{f}_* \mid \mathbf{X}_*, \mathbf{X}, \mathbf{y} \;\sim\; \mathcal{N}(\boldsymbol{\mu}_*, \boldsymbol{\Sigma}_*) \]\[ \boldsymbol{\mu}_* = K(\mathbf{X}_*,\mathbf{X})\underbrace{\left[K(\mathbf{X},\mathbf{X}) + \sigma_n^2 I\right]^{-1} \mathbf{y}}_{\boldsymbol{\alpha}} \]\[ \boldsymbol{\Sigma}_* = K(\mathbf{X}_*,\mathbf{X}_*) - K(\mathbf{X}_*,\mathbf{X})\left[K(\mathbf{X},\mathbf{X}) + \sigma_n^2 I\right]^{-1}K(\mathbf{X},\mathbf{X}_*) \]

The vector \(\boldsymbol{\alpha} = (K + \sigma_n^2 I)^{-1}\mathbf{y}\) is computed once. The posterior mean \(\mu_*(x_*) = \mathbf{k}(x_*,\mathbf{X})\boldsymbol{\alpha}\) is a kernel-weighted sum over training points — a non-parametric function that passes close to the data.

Computational note

Naive inversion is \(O(n^3)\). In practice use Cholesky: \(L = \mathrm{chol}(K + \sigma_n^2 I)\), then solve \(L^\top L\boldsymbol{\alpha} = \mathbf{y}\) in \(O(n^2)\) per test point. Memory is \(O(n^2)\).

3.1 Marginal Likelihood for Hyperparameter Optimisation

The log marginal likelihood (integrating out \(\mathbf{f}\)) has a closed form:

\[ \log P(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\theta}) = -\frac{1}{2}\mathbf{y}^\top(K+\sigma_n^2 I)^{-1}\mathbf{y} - \frac{1}{2}\log|K+\sigma_n^2 I| - \frac{n}{2}\log 2\pi \]

The first term rewards data fit; the second penalises model complexity (log-determinant grows with model flexibility). Maximising this w.r.t. kernel hyperparameters \(\boldsymbol{\theta} = (\sigma^2, \ell, \sigma_n^2)\) via gradient descent automatically balances fit and complexity — no cross-validation needed.

4. GP Prior and Posterior Diagram

5. Python Simulation: GP Regression from Scratch

We implement GP regression using only NumPy and SciPy. The simulation shows: (1) prior samples from the RBF kernel, (2) the posterior mean and 95% confidence band after observing 9 noisy data points, (3) posterior samples, and (4) a comparison of RBF, Matérn-3/2, and periodic kernels, each with their log marginal likelihood score.

Python

script.py135 lines

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist

fig, axes = plt.subplots(2, 3, figsize=(16, 10))
fig.patch.set_facecolor('#0a0a0f')
fig.suptitle('Gaussian Process Regression: Prior Samples, Posterior, and Kernel Comparison', fontsize=14, color='white', fontweight='bold')

def rbf_kernel(X1, X2, sigma2=1.0, l=1.0):
    dists = cdist(X1.reshape(-1,1), X2.reshape(-1,1), 'sqeuclidean')
    return sigma2 * np.exp(-dists / (2 * l**2))

def matern32_kernel(X1, X2, sigma2=1.0, l=1.0):
    r = cdist(X1.reshape(-1,1), X2.reshape(-1,1), 'euclidean')
    sqrt3r = np.sqrt(3) * r / l
    return sigma2 * (1 + sqrt3r) * np.exp(-sqrt3r)

def periodic_kernel(X1, X2, sigma2=1.0, l=1.0, p=3.0):
    r = cdist(X1.reshape(-1,1), X2.reshape(-1,1), 'euclidean')
    return sigma2 * np.exp(-2 * np.sin(np.pi * r / p)**2 / l**2)

def gp_posterior(X_train, y_train, X_test, kernel_fn, sigma_n2=0.01, **kwargs):
    K = kernel_fn(X_train, X_train, **kwargs) + sigma_n2 * np.eye(len(X_train))
    K_star = kernel_fn(X_test, X_train, **kwargs)
    K_starstar = kernel_fn(X_test, X_test, **kwargs)
    L = np.linalg.cholesky(K + 1e-9 * np.eye(len(K)))
    alpha = np.linalg.solve(L.T, np.linalg.solve(L, y_train))
    mu_star = K_star @ alpha
    v = np.linalg.solve(L, K_star.T)
    sigma_star = np.sqrt(np.maximum(np.diag(K_starstar) - np.einsum('ij,ij->j', v, v), 1e-8))
    # Log marginal likelihood
    log_ml = -0.5 * y_train @ alpha - np.sum(np.log(np.diag(L))) - 0.5 * len(y_train) * np.log(2*np.pi)
    return mu_star, sigma_star, log_ml

np.random.seed(0)
x_plot = np.linspace(-5, 5, 200)

# Panel 1: RBF prior samples
ax = axes[0, 0]
ax.set_facecolor('#0d0d1a')
K_prior = rbf_kernel(x_plot, x_plot, sigma2=1.0, l=1.0) + 1e-8 * np.eye(len(x_plot))
L_prior = np.linalg.cholesky(K_prior)
colors_prior = ['#a78bfa', '#7c3aed', '#5b21b6', '#c4b5fd', '#8b5cf6']
for i, col in enumerate(colors_prior):
    sample = L_prior @ np.random.randn(len(x_plot))
    ax.plot(x_plot, sample, color=col, linewidth=1.8, alpha=0.85, label=f'Sample {i+1}')
ax.set_title('GP Prior Samples (RBF, l=1)', color='white', fontsize=11, fontweight='bold')
ax.set_xlabel('x', color='white')
ax.set_ylabel('f(x)', color='white')
ax.tick_params(colors='white')
for sp in ax.spines.values(): sp.set_color('#7c3aed')
ax.grid(True, alpha=0.2, color='#7c3aed')
ax.legend(fontsize=7, facecolor='#0d0d1a', edgecolor='#7c3aed', labelcolor='white')

# Training data: noisy sin
true_fn = lambda x: np.sin(x) + 0.3 * np.cos(3*x)
X_train = np.array([-4, -3, -2, -1, 0, 1, 2, 3, 4], dtype=float)
y_train = true_fn(X_train) + np.random.normal(0, 0.15, len(X_train))

# Panel 2: RBF posterior
ax = axes[0, 1]
ax.set_facecolor('#0d0d1a')
mu, sigma, lml = gp_posterior(X_train, y_train, x_plot, rbf_kernel, sigma_n2=0.02, sigma2=1.0, l=1.0)
ax.plot(x_plot, true_fn(x_plot), '--', color='#fbbf24', linewidth=2, alpha=0.7, label='True f(x)')
ax.plot(x_plot, mu, color='#a78bfa', linewidth=2.5, label='Posterior mean')
ax.fill_between(x_plot, mu - 1.96*sigma, mu + 1.96*sigma, color='#7c3aed', alpha=0.25, label='95% CI')
ax.scatter(X_train, y_train, color='#34d399', zorder=5, s=60, label='Training pts')
ax.set_title(f'GP Posterior (RBF, log ML={lml:.1f})', color='white', fontsize=11, fontweight='bold')
ax.set_xlabel('x', color='white')
ax.set_ylabel('f(x)', color='white')
ax.tick_params(colors='white')
for sp in ax.spines.values(): sp.set_color('#7c3aed')
ax.grid(True, alpha=0.2, color='#7c3aed')
ax.legend(fontsize=8, facecolor='#0d0d1a', edgecolor='#7c3aed', labelcolor='white')

# Panel 3: posterior samples
ax = axes[0, 2]
ax.set_facecolor('#0d0d1a')
K = rbf_kernel(X_train, X_train, sigma2=1.0, l=1.0) + 0.02 * np.eye(len(X_train))
K_star = rbf_kernel(x_plot, X_train, sigma2=1.0, l=1.0)
K_ss = rbf_kernel(x_plot, x_plot, sigma2=1.0, l=1.0)
alpha = np.linalg.solve(K, y_train)
mu_post = K_star @ alpha
cov_post = K_ss - K_star @ np.linalg.solve(K, K_star.T) + 1e-8 * np.eye(len(x_plot))
L_post = np.linalg.cholesky(cov_post)
for i, col in enumerate(['#a78bfa', '#7c3aed', '#5b21b6', '#c4b5fd', '#8b5cf6']):
    samp = mu_post + L_post @ np.random.randn(len(x_plot))
    ax.plot(x_plot, samp, color=col, linewidth=1.5, alpha=0.7)
ax.plot(x_plot, mu_post, color='white', linewidth=2.5, label='Posterior mean')
ax.scatter(X_train, y_train, color='#34d399', zorder=5, s=60, label='Training pts')
ax.set_title('GP Posterior Samples', color='white', fontsize=11, fontweight='bold')
ax.set_xlabel('x', color='white')
ax.set_ylabel('f(x)', color='white')
ax.tick_params(colors='white')
for sp in ax.spines.values(): sp.set_color('#7c3aed')
ax.grid(True, alpha=0.2, color='#7c3aed')
ax.legend(fontsize=8, facecolor='#0d0d1a', edgecolor='#7c3aed', labelcolor='white')

# Kernel comparison
kernels = [
    ('RBF l=0.5', lambda x1, x2: rbf_kernel(x1, x2, l=0.5), '#a78bfa'),
    ('RBF l=2.0', lambda x1, x2: rbf_kernel(x1, x2, l=2.0), '#7c3aed'),
    ('Matern-3/2', lambda x1, x2: matern32_kernel(x1, x2, l=1.0), '#34d399'),
    ('Periodic', lambda x1, x2: periodic_kernel(x1, x2, l=1.0, p=3.0), '#fbbf24'),
]

for idx, (kname, kfn, col) in enumerate(kernels):
    ax = axes[1, idx % 3] if idx < 3 else axes[1, 2]
    if idx == 3:
        ax = axes[1, 2]

for idx, (kname, kfn, col) in enumerate(kernels[:3]):
    ax = axes[1, idx]
    ax.set_facecolor('#0d0d1a')
    mu_k, sigma_k, lml_k = gp_posterior(X_train, y_train, x_plot, kfn, sigma_n2=0.02)
    ax.plot(x_plot, true_fn(x_plot), '--', color='#fbbf24', linewidth=1.5, alpha=0.6, label='True')
    ax.plot(x_plot, mu_k, color=col, linewidth=2.5, label='Mean')
    ax.fill_between(x_plot, mu_k - 1.96*sigma_k, mu_k + 1.96*sigma_k, color=col, alpha=0.2, label='95% CI')
    ax.scatter(X_train, y_train, color='#34d399', zorder=5, s=50)
    ax.set_title(f'{kname} (log ML={lml_k:.1f})', color='white', fontsize=10, fontweight='bold')
    ax.set_xlabel('x', color='white')
    ax.set_ylabel('f(x)', color='white')
    ax.tick_params(colors='white')
    for sp in ax.spines.values(): sp.set_color('#7c3aed')
    ax.grid(True, alpha=0.2, color='#7c3aed')
    ax.legend(fontsize=8, facecolor='#0d0d1a', edgecolor='#7c3aed', labelcolor='white')

plt.tight_layout()
plt.savefig('output.png', dpi=130, bbox_inches='tight', facecolor='#0a0a0f')
print("GP regression complete.")
print(f"Training points: {len(X_train)}")
print(f"RBF log marginal likelihood: {gp_posterior(X_train, y_train, x_plot, rbf_kernel, sigma_n2=0.02)[2]:.3f}")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Share:X Reddit LinkedIn

← Ch 13: Bayesian Inference Ch 15: Variational Inference →