Part IV · Chapter 11

Dimensionality Reduction

High-dimensional data lies on lower-dimensional manifolds. We derive PCA rigorously from variance maximisation using Lagrange multipliers, extend it to the kernel setting, and explore nonlinear methods t-SNE and UMAP for visualising complex structure.

1. PCA — Full Derivation from Variance Maximisation

Given centred data matrix \(\mathbf{X} \in \mathbb{R}^{N \times D}\) (rows are data points), the sample covariance matrix is:

\[ \mathbf{S} = \frac{1}{N-1}\mathbf{X}^\top \mathbf{X} \]

We seek a unit vector \(\mathbf{w} \in \mathbb{R}^D\) that maximises the variance of the projected data \(z = \mathbf{X}\mathbf{w}\):

\[ \mathrm{Var}[z] = \frac{1}{N-1}\sum_i (z_i - \bar{z})^2 = \mathbf{w}^\top \mathbf{S}\,\mathbf{w} \]

The constrained optimisation problem is:

\[ \mathbf{w}^* = \arg\max_{\mathbf{w}} \;\mathbf{w}^\top \mathbf{S}\,\mathbf{w} \quad \text{subject to} \quad \|\mathbf{w}\|^2 = 1 \]

Lagrangian formulation

Introduce Lagrange multiplier \(\lambda\):

\[ \mathcal{L}(\mathbf{w},\lambda) = \mathbf{w}^\top \mathbf{S}\,\mathbf{w} - \lambda(\mathbf{w}^\top\mathbf{w} - 1) \]

Setting the gradient to zero:

\[ \frac{\partial \mathcal{L}}{\partial \mathbf{w}} = 2\mathbf{S}\mathbf{w} - 2\lambda\mathbf{w} = \mathbf{0} \]

\[ \boxed{\mathbf{S}\mathbf{w} = \lambda\mathbf{w}} \]

This is an eigenvalue equation. The optimal direction \(\mathbf{w}^*\) is an eigenvector of the covariance matrix. The maximised variance is \(\mathbf{w}^{\top}\mathbf{S}\mathbf{w} = \lambda\), so we choose the eigenvector with the largest eigenvalue.

Principal Components

Sort eigenvalues \(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_D \geq 0\) and corresponding eigenvectors. The \(d\)-dimensional projection using the top \(d\) principal components captures a fraction \(\sum_{j=1}^d \lambda_j / \sum_{j=1}^D \lambda_j\) of total variance. This is the explained variance ratio.

2. PCA Projection Diagram

PCA projects 3D data onto the 2D plane spanned by the two principal components (eigenvectors with largest eigenvalues).

3. Kernel PCA

Standard PCA is linear. Kernel PCA applies the kernel trick to perform PCA in a high-dimensional feature space \(\phi(\mathbf{x})\) defined implicitly by a kernel \(k(\mathbf{x},\mathbf{x}') = \phi(\mathbf{x})^\top\phi(\mathbf{x}')\).

The covariance in feature space is \(\mathbf{C} = \frac{1}{N}\sum_i \phi(\mathbf{x}_i)\phi(\mathbf{x}_i)^\top\). One can show that the eigenvectors of \(\mathbf{C}\) lie in the span of the \(\phi(\mathbf{x}_i)\), so the eigenvalue equation reduces to an \(N \times N\) system involving the kernel matrix\(K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)\):

\[ N\lambda\,\boldsymbol{\alpha} = \tilde{\mathbf{K}}\,\boldsymbol{\alpha} \]

where \(\tilde{\mathbf{K}}\) is the centred kernel matrix. This enables nonlinear dimensionality reduction using kernels such as RBF (\(k = e^{-\gamma\|\mathbf{x}-\mathbf{x}'\|^2}\)) without explicitly computing \(\phi\).

4. t-SNE — Student-t Kernel & KL Minimisation

t-SNE (van der Maaten & Hinton 2008) models high-dimensional affinities with a Gaussian and low-dimensional affinities with a Student-t distribution (heavier tails), then minimises the KL divergence between them.

High-dimensional affinities (Gaussian)

\[ p_{j|i} = \frac{\exp(-\|\mathbf{x}_i - \mathbf{x}_j\|^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\|\mathbf{x}_i - \mathbf{x}_k\|^2 / 2\sigma_i^2)}, \qquad p_{ij} = \frac{p_{j|i} + p_{i|j}}{2N} \]

Bandwidth \(\sigma_i\) is set via binary search to achieve a specified perplexity.

Low-dimensional affinities (Student-t with 1 d.f.)

\[ q_{ij} = \frac{\left(1 + \|\mathbf{y}_i - \mathbf{y}_j\|^2\right)^{-1}}{\displaystyle\sum_{k \neq l}\left(1 + \|\mathbf{y}_k - \mathbf{y}_l\|^2\right)^{-1}} \]

The Student-t kernel alleviates the crowding problem: in 2D there is less space than in high-D, so nearby points crowd together. Heavier tails allow moderate distances to be modelled faithfully.

Objective: KL divergence minimisation

\[ C = \mathrm{KL}(P \,\|\, Q) = \sum_{i \neq j} p_{ij}\,\log\frac{p_{ij}}{q_{ij}} \]

\[ \frac{\partial C}{\partial \mathbf{y}_i} = 4\sum_j (p_{ij} - q_{ij})\,(\mathbf{y}_i - \mathbf{y}_j)\,\left(1 + \|\mathbf{y}_i - \mathbf{y}_j\|^2\right)^{-1} \]

UMAP Overview

UMAP (McInnes et al. 2018) is grounded in Riemannian geometry and fuzzy simplicial sets. It models the data manifold as a fuzzy topological structure, constructs a weighted graph, and finds a low-dimensional representation that minimises the cross-entropy between the high- and low-dimensional fuzzy sets. Compared to t-SNE it is faster, better preserves global structure, and scales to millions of points.

5. Python: PCA vs t-SNE from Scratch

We implement PCA (eigendecomposition of the covariance matrix) and a simplified t-SNE entirely in NumPy, apply both to a 3D spiral and a synthetic 6D four-class dataset, and compare the projections.

Python

script.py170 lines

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec

np.random.seed(42)

# ── PCA from scratch ──────────────────────────────────────────────────────────
def pca(X, n_components):
    X_c = X - X.mean(axis=0)
    S = X_c.T @ X_c / (X_c.shape[0] - 1)
    eigenvalues, eigenvectors = np.linalg.eigh(S)
    idx = np.argsort(eigenvalues)[::-1]
    eigenvalues = eigenvalues[idx]
    eigenvectors = eigenvectors[:, idx]
    W = eigenvectors[:, :n_components]
    return X_c @ W, eigenvalues, eigenvectors

# ── t-SNE (simplified BH t-SNE) ───────────────────────────────────────────────
def tsne_simple(X, n_iter=800, perplexity=30, lr=150, early_exag=12):
    N, D = X.shape
    # Pairwise Euclidean distances in high-D
    sum_X = np.sum(X ** 2, axis=1)
    D2 = sum_X[:, None] + sum_X[None, :] - 2 * X @ X.T
    np.fill_diagonal(D2, 0)

# Gaussian affinities (simplified: single bandwidth)
    P = np.exp(-D2 / (2 * np.median(D2[D2 > 0]) * (perplexity / 30.0)))
    np.fill_diagonal(P, 0)
    P = (P + P.T) / (2 * N)
    P = np.maximum(P, 1e-12)

# Initialise low-D from PCA
    Y, _, _ = pca(X, 2)
    Y = Y * 0.0001
    Y_old = Y.copy()
    gains = np.ones_like(Y)
    momentum = 0.5

for t in range(1, n_iter + 1):
        # t-Student affinities in low-D
        sum_Y = np.sum(Y ** 2, axis=1)
        num = 1.0 / (1.0 + sum_Y[:, None] + sum_Y[None, :] - 2 * Y @ Y.T)
        np.fill_diagonal(num, 0)
        Q = num / (num.sum() + 1e-12)
        Q = np.maximum(Q, 1e-12)

exag = early_exag if t < 100 else 1.0
        PQ = (exag * P - Q) * num

# gradient dC/dY
        dY = np.zeros_like(Y)
        for i in range(N):
            dY[i] = 4 * ((PQ[i, :, None] * (Y[i] - Y)).sum(axis=0))

gains = (gains + 0.2) * ((dY > 0) != (Y - Y_old > 0)) + gains * 0.8 * ((dY > 0) == (Y - Y_old > 0))
        gains = np.maximum(gains, 0.01)
        Y_new = Y - lr * gains * dY + momentum * (Y - Y_old)
        Y_old = Y.copy()
        Y = Y_new
        if t == 100:
            momentum = 0.8
    return Y

# ── Synthetic 3D data: S-curve-like ───────────────────────────────────────────
t_vals = np.linspace(0, 4 * np.pi, 400)
X3d = np.column_stack([
    np.sin(t_vals),
    np.cos(t_vals) * 0.8,
    t_vals / (4 * np.pi)
]) + np.random.randn(400, 3) * 0.06

# ── Apply PCA (3D -> 2D) ──────────────────────────────────────────────────────
X_pca2, evals, evecs = pca(X3d, 2)
X_pca1, _, _         = pca(X3d, 1)

explained = evals / evals.sum()

# ── Apply t-SNE ───────────────────────────────────────────────────────────────
X_tsne = tsne_simple(X3d, n_iter=600)

# ── Digits-like synthetic data (4 clusters in high-D) ────────────────────────
from itertools import product
n_per_class = 60
centers_hd = [
    np.array([3.0, 3.0, 0.0, 0.0, 1.0, 0.0]),
    np.array([-3.0, 2.0, 1.0, 0.5, 0.0, 1.0]),
    np.array([1.0,-3.0, 2.0,-1.0, 0.5,-0.5]),
    np.array([-2.0,-2.0,-1.0, 2.0,-1.0, 0.5]),
]
X_hd  = np.vstack([c + np.random.randn(n_per_class, 6) * 0.6 for c in centers_hd])
y_hd  = np.repeat([0,1,2,3], n_per_class)
X_hd_pca,  ev_hd, _ = pca(X_hd, 2)
X_hd_tsne = tsne_simple(X_hd, n_iter=500)

# ── Figure ─────────────────────────────────────────────────────────────────────
fig = plt.figure(figsize=(15, 10), facecolor='#0f0f1a')
gs = GridSpec(2, 4, figure=fig, hspace=0.42, wspace=0.38)

pal = ['#a78bfa', '#4ade80', '#fb923c', '#f472b6']

# panel 1: 3D data (first 2 dims + colour as 3rd)
ax1 = fig.add_subplot(gs[0, 0])
ax1.set_facecolor('#0f0f1a')
sc = ax1.scatter(X3d[:,0], X3d[:,1], c=t_vals, cmap='plasma', s=8, alpha=0.8)
ax1.set_title('3D Data (2D projection)', color='#ddd6fe', fontsize=10)
ax1.set_xlabel('dim 1', color='#94a3b8', fontsize=8); ax1.set_ylabel('dim 2', color='#94a3b8', fontsize=8)
ax1.tick_params(colors='#94a3b8', labelsize=7)
for sp in ax1.spines.values(): sp.set_edgecolor('#4c1d95')

# panel 2: PCA 2D
ax2 = fig.add_subplot(gs[0, 1])
ax2.set_facecolor('#0f0f1a')
ax2.scatter(X_pca2[:,0], X_pca2[:,1], c=t_vals, cmap='plasma', s=8, alpha=0.8)
ax2.set_title('PCA: 3D → 2D', color='#ddd6fe', fontsize=10)
ax2.set_xlabel('PC1', color='#94a3b8', fontsize=8); ax2.set_ylabel('PC2', color='#94a3b8', fontsize=8)
ax2.tick_params(colors='#94a3b8', labelsize=7)
for sp in ax2.spines.values(): sp.set_edgecolor('#4c1d95')

# panel 3: explained variance
ax3 = fig.add_subplot(gs[0, 2])
ax3.set_facecolor('#0f0f1a')
ax3.bar(range(1, len(explained)+1), explained*100, color='#7c3aed', alpha=0.8, edgecolor='#a78bfa')
ax3.plot(range(1, len(explained)+1), np.cumsum(explained)*100, 'o-', color='#4ade80', linewidth=1.5, markersize=4, label='Cumulative')
ax3.set_xlabel('Component', color='#94a3b8', fontsize=8); ax3.set_ylabel('Variance (%)', color='#94a3b8', fontsize=8)
ax3.set_title('Explained Variance Ratio', color='#ddd6fe', fontsize=10)
ax3.tick_params(colors='#94a3b8', labelsize=7)
for sp in ax3.spines.values(): sp.set_edgecolor('#4c1d95')
ax3.legend(facecolor='#1e1b4b', edgecolor='#7c3aed', labelcolor='#ddd6fe', fontsize=8)

# panel 4: t-SNE on 3D curve
ax4 = fig.add_subplot(gs[0, 3])
ax4.set_facecolor('#0f0f1a')
ax4.scatter(X_tsne[:,0], X_tsne[:,1], c=t_vals, cmap='plasma', s=8, alpha=0.8)
ax4.set_title('t-SNE: 3D → 2D', color='#ddd6fe', fontsize=10)
ax4.set_xlabel('dim 1', color='#94a3b8', fontsize=8); ax4.set_ylabel('dim 2', color='#94a3b8', fontsize=8)
ax4.tick_params(colors='#94a3b8', labelsize=7)
for sp in ax4.spines.values(): sp.set_edgecolor('#4c1d95')

# panel 5: high-D PCA
ax5 = fig.add_subplot(gs[1, 0:2])
ax5.set_facecolor('#0f0f1a')
for k in range(4):
    mask = y_hd == k
    ax5.scatter(X_hd_pca[mask,0], X_hd_pca[mask,1], c=pal[k], s=14, alpha=0.75, label=f'Class {k}')
ax5.set_title('PCA on 4-class 6D Data', color='#ddd6fe', fontsize=10)
ax5.set_xlabel('PC1', color='#94a3b8', fontsize=8); ax5.set_ylabel('PC2', color='#94a3b8', fontsize=8)
ax5.tick_params(colors='#94a3b8', labelsize=7)
for sp in ax5.spines.values(): sp.set_edgecolor('#4c1d95')
ax5.legend(facecolor='#1e1b4b', edgecolor='#7c3aed', labelcolor='#ddd6fe', fontsize=8, loc='best')

# panel 6: high-D t-SNE
ax6 = fig.add_subplot(gs[1, 2:4])
ax6.set_facecolor('#0f0f1a')
for k in range(4):
    mask = y_hd == k
    ax6.scatter(X_hd_tsne[mask,0], X_hd_tsne[mask,1], c=pal[k], s=14, alpha=0.75, label=f'Class {k}')
ax6.set_title('t-SNE on 4-class 6D Data', color='#ddd6fe', fontsize=10)
ax6.set_xlabel('dim 1', color='#94a3b8', fontsize=8); ax6.set_ylabel('dim 2', color='#94a3b8', fontsize=8)
ax6.tick_params(colors='#94a3b8', labelsize=7)
for sp in ax6.spines.values(): sp.set_edgecolor('#4c1d95')
ax6.legend(facecolor='#1e1b4b', edgecolor='#7c3aed', labelcolor='#ddd6fe', fontsize=8, loc='best')

plt.suptitle('Dimensionality Reduction: PCA vs t-SNE', color='white', fontsize=14, fontweight='bold')
plt.savefig('output.png', dpi=130, bbox_inches='tight', facecolor='#0f0f1a')
print(f"Top-3 explained variance: {explained[:3].round(3)}")
print(f"Cumulative (top-2): {explained[:2].sum():.3f}")
print("Saved output.png")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Share:X Reddit LinkedIn

← Ch 10: Clustering Ch 12: Autoencoders →