Machine Learning

Part IV: Unsupervised Learning

Unsupervised learning uncovers structure in data without labels. This part covers the three pillars of the field: clustering (finding groups), dimensionality reduction (finding compact representations), and generative modelling (learning the data distribution itself). Every algorithm is derived from first principles with full mathematical rigour.

Chapter 10: Clustering: K-Means & GMM

K-means objective and Lloyd's algorithm, Gaussian Mixture Models, and the full EM algorithm derivation from the ELBO — plus the deep connection between K-means and GMM.

K-means: objective J = Σ||xᵢ − μₖ||²Lloyd's algorithm: E-step assign, M-step updateGMM: p(x) = Σ πₖ N(x|μₖ, Σₖ)EM: full derivation of γ, μ, Σ, π updatesK-means as hard-assignment EM limit

Chapter 11: Dimensionality Reduction

PCA from variance maximisation via eigendecomposition, kernel PCA, and manifold methods t-SNE and UMAP for non-linear structure discovery.

PCA: max wᵀSw s.t. ||w||=1 → Sw = λwPrincipal components are eigenvectors of covarianceExplained variance ratio and scree plotKernel PCA: kernelised covariance matrixt-SNE: Student-t kernel + KL minimisation

Chapter 12: Autoencoders & VAEs

Vanilla autoencoders, denoising autoencoders, and the full VAE derivation: ELBO, reparameterisation trick, and KL divergence between Gaussians.

Autoencoder: encoder + decoder, reconstruction lossVAE: log p(x) = ELBO + KL(q || p)ELBO = E_q[log p(x|z)] − KL(q(z|x) || p(z))Reparameterisation: z = μ + σ ⊙ εKL divergence for two Gaussians (closed form)

What you will learn

✓Derive the K-means update rule from the objective function

✓Prove that Lloyd's algorithm converges monotonically

✓Implement EM for Gaussian Mixture Models from scratch

✓Derive PCA from variance maximisation using Lagrange multipliers

✓Understand why principal components are eigenvectors of the covariance

✓Explain the t-SNE crowding problem and the Student-t kernel solution

✓Derive the VAE ELBO from log-likelihood and the reparameterisation trick

✓Compute the KL divergence between two Gaussians in closed form

Prerequisites

Parts I–III. You should be comfortable with matrix eigendecomposition (Part I), maximum likelihood estimation and Bayes' theorem (Part I), and the concept of gradient descent (Part I). Familiarity with the Gaussian distribution is essential for Chapters 10 and 12.

Share:X Reddit LinkedIn

Part III: Neural Networks Chapter 10: Clustering