Chapter 14: Gaussian Processes

A Gaussian Process (GP) is a distribution over functions: instead of a finite parameter vector, it places a prior over the entire function space. GP regression yields an exact Bayesian posterior with analytic uncertainty estimates β€” the gold standard for small-data, high-uncertainty settings.

1. Multivariate Gaussian Conditioning

The GP posterior derivation rests on conditioning a multivariate Gaussian. Suppose:

\[ \begin{pmatrix} \mathbf{x}_1 \\ \mathbf{x}_2 \end{pmatrix} \sim \mathcal{N}\!\left( \begin{pmatrix} \boldsymbol{\mu}_1 \\ \boldsymbol{\mu}_2 \end{pmatrix},\; \begin{pmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{pmatrix} \right) \]

Then the conditional distribution is:

\[ \mathbf{x}_1 \mid \mathbf{x}_2 \;\sim\; \mathcal{N}\!\left(\boldsymbol{\mu}_{1|2},\; \Sigma_{1|2}\right) \]\[ \boldsymbol{\mu}_{1|2} = \boldsymbol{\mu}_1 + \Sigma_{12}\Sigma_{22}^{-1}(\mathbf{x}_2 - \boldsymbol{\mu}_2) \]\[ \Sigma_{1|2} = \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} \]

Derivation sketch: Complete the square in the joint Gaussian density with respect to \(\mathbf{x}_1\) while treating \(\mathbf{x}_2\) as fixed. The resulting quadratic form identifies both the conditional mean (linear in \(\mathbf{x}_2\)) and the conditional covariance (independent of \(\mathbf{x}_2\)).

2. Gaussian Process Definition

A Gaussian Process is a collection of random variables, any finite subset of which has a joint Gaussian distribution. It is fully specified by a mean function and a covariance (kernel) function:

\[ f \sim \mathcal{GP}(m(\mathbf{x}),\; k(\mathbf{x}, \mathbf{x}')) \]

For any finite set of inputs \(\mathbf{X} = \{x_1,\ldots,x_n\}\), the function values are jointly Gaussian: \(\mathbf{f} \sim \mathcal{N}(\mathbf{m},\mathbf{K})\) where \(K_{ij} = k(x_i, x_j)\).

2.1 Kernel Functions

RBF (Squared Exponential) Kernel
\[ k_{\rm RBF}(x, x') = \sigma^2 \exp\!\left(-\frac{\|x-x'\|^2}{2\ell^2}\right) \]

Infinitely differentiable β€” models very smooth functions. \(\ell\) controls the length scale (how quickly the correlation decays), \(\sigma^2\) controls the output variance.

MatΓ©rn-3/2 Kernel
\[ k_{3/2}(r) = \sigma^2\!\left(1 + \frac{\sqrt{3}\,r}{\ell}\right)\exp\!\left(-\frac{\sqrt{3}\,r}{\ell}\right), \quad r = \|x - x'\| \]

Once differentiable β€” rougher than RBF, more realistic for physical processes and spatial data.

Periodic Kernel
\[ k_{\rm per}(x, x') = \sigma^2 \exp\!\left(-\frac{2\sin^2(\pi|x-x'|/p)}{\ell^2}\right) \]

Models periodic functions with period \(p\) β€” useful for time series with daily/seasonal patterns.

3. GP Regression: Full Posterior Derivation

Given training data \(\mathcal{D} = \{(\mathbf{X}, \mathbf{y})\}\) with noise model \(y_i = f(x_i) + \varepsilon_i\), \(\varepsilon_i \sim \mathcal{N}(0, \sigma_n^2)\), we want the posterior over function values \(\mathbf{f}_* = f(\mathbf{X}_*)\) at test points \(\mathbf{X}_*\).

The joint prior over training and test function values is:

\[ \begin{pmatrix} \mathbf{y} \\ \mathbf{f}_* \end{pmatrix} \sim \mathcal{N}\!\left(\mathbf{0},\; \begin{pmatrix} K(\mathbf{X},\mathbf{X}) + \sigma_n^2 I & K(\mathbf{X},\mathbf{X}_*) \\ K(\mathbf{X}_*,\mathbf{X}) & K(\mathbf{X}_*,\mathbf{X}_*) \end{pmatrix}\right) \]

Applying the conditional Gaussian formulas with \(\Sigma_{22} = K(\mathbf{X},\mathbf{X})+\sigma_n^2 I\), \(\Sigma_{12} = K(\mathbf{X}_*,\mathbf{X})\):

\[ \mathbf{f}_* \mid \mathbf{X}_*, \mathbf{X}, \mathbf{y} \;\sim\; \mathcal{N}(\boldsymbol{\mu}_*, \boldsymbol{\Sigma}_*) \]\[ \boldsymbol{\mu}_* = K(\mathbf{X}_*,\mathbf{X})\underbrace{\left[K(\mathbf{X},\mathbf{X}) + \sigma_n^2 I\right]^{-1} \mathbf{y}}_{\boldsymbol{\alpha}} \]\[ \boldsymbol{\Sigma}_* = K(\mathbf{X}_*,\mathbf{X}_*) - K(\mathbf{X}_*,\mathbf{X})\left[K(\mathbf{X},\mathbf{X}) + \sigma_n^2 I\right]^{-1}K(\mathbf{X},\mathbf{X}_*) \]

The vector \(\boldsymbol{\alpha} = (K + \sigma_n^2 I)^{-1}\mathbf{y}\) is computed once. The posterior mean \(\mu_*(x_*) = \mathbf{k}(x_*,\mathbf{X})\boldsymbol{\alpha}\) is a kernel-weighted sum over training points β€” a non-parametric function that passes close to the data.

Computational note

Naive inversion is \(O(n^3)\). In practice use Cholesky: \(L = \mathrm{chol}(K + \sigma_n^2 I)\), then solve \(L^\top L\boldsymbol{\alpha} = \mathbf{y}\) in \(O(n^2)\) per test point. Memory is \(O(n^2)\).

3.1 Marginal Likelihood for Hyperparameter Optimisation

The log marginal likelihood (integrating out \(\mathbf{f}\)) has a closed form:

\[ \log P(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\theta}) = -\frac{1}{2}\mathbf{y}^\top(K+\sigma_n^2 I)^{-1}\mathbf{y} - \frac{1}{2}\log|K+\sigma_n^2 I| - \frac{n}{2}\log 2\pi \]

The first term rewards data fit; the second penalises model complexity (log-determinant grows with model flexibility). Maximising this w.r.t. kernel hyperparameters \(\boldsymbol{\theta} = (\sigma^2, \ell, \sigma_n^2)\) via gradient descent automatically balances fit and complexity β€” no cross-validation needed.

4. GP Prior and Posterior Diagram

GP Prior f ~ GP(0, k)Wide uncertainty everywhere+data DGP Posterior f | DNarrow near data, wide far awayTraining data95% CIPosterior mean

5. Python Simulation: GP Regression from Scratch

We implement GP regression using only NumPy and SciPy. The simulation shows: (1) prior samples from the RBF kernel, (2) the posterior mean and 95% confidence band after observing 9 noisy data points, (3) posterior samples, and (4) a comparison of RBF, MatΓ©rn-3/2, and periodic kernels, each with their log marginal likelihood score.

Python
script.py135 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server