Part IV ยท Chapter 11

Dimensionality Reduction

High-dimensional data lies on lower-dimensional manifolds. We derive PCA rigorously from variance maximisation using Lagrange multipliers, extend it to the kernel setting, and explore nonlinear methods t-SNE and UMAP for visualising complex structure.

1. PCA โ€” Full Derivation from Variance Maximisation

Given centred data matrix \(\mathbf{X} \in \mathbb{R}^{N \times D}\) (rows are data points), the sample covariance matrix is:

\[ \mathbf{S} = \frac{1}{N-1}\mathbf{X}^\top \mathbf{X} \]

We seek a unit vector \(\mathbf{w} \in \mathbb{R}^D\) that maximises the variance of the projected data \(z = \mathbf{X}\mathbf{w}\):

\[ \mathrm{Var}[z] = \frac{1}{N-1}\sum_i (z_i - \bar{z})^2 = \mathbf{w}^\top \mathbf{S}\,\mathbf{w} \]

The constrained optimisation problem is:

\[ \mathbf{w}^* = \arg\max_{\mathbf{w}} \;\mathbf{w}^\top \mathbf{S}\,\mathbf{w} \quad \text{subject to} \quad \|\mathbf{w}\|^2 = 1 \]

Lagrangian formulation

Introduce Lagrange multiplier \(\lambda\):

\[ \mathcal{L}(\mathbf{w},\lambda) = \mathbf{w}^\top \mathbf{S}\,\mathbf{w} - \lambda(\mathbf{w}^\top\mathbf{w} - 1) \]

Setting the gradient to zero:

\[ \frac{\partial \mathcal{L}}{\partial \mathbf{w}} = 2\mathbf{S}\mathbf{w} - 2\lambda\mathbf{w} = \mathbf{0} \]
\[ \boxed{\mathbf{S}\mathbf{w} = \lambda\mathbf{w}} \]

This is an eigenvalue equation. The optimal direction \(\mathbf{w}^*\) is an eigenvector of the covariance matrix. The maximised variance is \(\mathbf{w}^{\top}\mathbf{S}\mathbf{w} = \lambda\), so we choose the eigenvector with the largest eigenvalue.

Principal Components

Sort eigenvalues \(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_D \geq 0\) and corresponding eigenvectors. The \(d\)-dimensional projection using the top \(d\) principal components captures a fraction \(\sum_{j=1}^d \lambda_j / \sum_{j=1}^D \lambda_j\) of total variance. This is the explained variance ratio.

2. PCA Projection Diagram

PCA2D projectionPC1PC2Original 3D dataProjected to 2D

PCA projects 3D data onto the 2D plane spanned by the two principal components (eigenvectors with largest eigenvalues).

3. Kernel PCA

Standard PCA is linear. Kernel PCA applies the kernel trick to perform PCA in a high-dimensional feature space \(\phi(\mathbf{x})\) defined implicitly by a kernel \(k(\mathbf{x},\mathbf{x}') = \phi(\mathbf{x})^\top\phi(\mathbf{x}')\).

The covariance in feature space is \(\mathbf{C} = \frac{1}{N}\sum_i \phi(\mathbf{x}_i)\phi(\mathbf{x}_i)^\top\). One can show that the eigenvectors of \(\mathbf{C}\) lie in the span of the \(\phi(\mathbf{x}_i)\), so the eigenvalue equation reduces to an \(N \times N\) system involving the kernel matrix\(K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)\):

\[ N\lambda\,\boldsymbol{\alpha} = \tilde{\mathbf{K}}\,\boldsymbol{\alpha} \]

where \(\tilde{\mathbf{K}}\) is the centred kernel matrix. This enables nonlinear dimensionality reduction using kernels such as RBF (\(k = e^{-\gamma\|\mathbf{x}-\mathbf{x}'\|^2}\)) without explicitly computing \(\phi\).

4. t-SNE โ€” Student-t Kernel & KL Minimisation

t-SNE (van der Maaten & Hinton 2008) models high-dimensional affinities with a Gaussian and low-dimensional affinities with a Student-t distribution (heavier tails), then minimises the KL divergence between them.

High-dimensional affinities (Gaussian)

\[ p_{j|i} = \frac{\exp(-\|\mathbf{x}_i - \mathbf{x}_j\|^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\|\mathbf{x}_i - \mathbf{x}_k\|^2 / 2\sigma_i^2)}, \qquad p_{ij} = \frac{p_{j|i} + p_{i|j}}{2N} \]

Bandwidth \(\sigma_i\) is set via binary search to achieve a specified perplexity.

Low-dimensional affinities (Student-t with 1 d.f.)

\[ q_{ij} = \frac{\left(1 + \|\mathbf{y}_i - \mathbf{y}_j\|^2\right)^{-1}}{\displaystyle\sum_{k \neq l}\left(1 + \|\mathbf{y}_k - \mathbf{y}_l\|^2\right)^{-1}} \]

The Student-t kernel alleviates the crowding problem: in 2D there is less space than in high-D, so nearby points crowd together. Heavier tails allow moderate distances to be modelled faithfully.

Objective: KL divergence minimisation

\[ C = \mathrm{KL}(P \,\|\, Q) = \sum_{i \neq j} p_{ij}\,\log\frac{p_{ij}}{q_{ij}} \]
\[ \frac{\partial C}{\partial \mathbf{y}_i} = 4\sum_j (p_{ij} - q_{ij})\,(\mathbf{y}_i - \mathbf{y}_j)\,\left(1 + \|\mathbf{y}_i - \mathbf{y}_j\|^2\right)^{-1} \]

UMAP Overview

UMAP (McInnes et al. 2018) is grounded in Riemannian geometry and fuzzy simplicial sets. It models the data manifold as a fuzzy topological structure, constructs a weighted graph, and finds a low-dimensional representation that minimises the cross-entropy between the high- and low-dimensional fuzzy sets. Compared to t-SNE it is faster, better preserves global structure, and scales to millions of points.

5. Python: PCA vs t-SNE from Scratch

We implement PCA (eigendecomposition of the covariance matrix) and a simplified t-SNE entirely in NumPy, apply both to a 3D spiral and a synthetic 6D four-class dataset, and compare the projections.

Python
script.py170 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server