Part IV ยท Chapter 10

Clustering: K-Means & GMM

Clustering partitions data into groups without labels. We derive K-means from its objective function, prove convergence of Lloyd's algorithm, build Gaussian Mixture Models as a probabilistic extension, and derive the full Expectation-Maximisation algorithm from first principles.

1. K-Means: Objective and Lloyd's Algorithm

Given \(N\) points \(\{\mathbf{x}_i\}_{i=1}^N\) and \(K\) clusters, K-means minimises the within-cluster sum of squared distances:

\[ J = \sum_{k=1}^K \sum_{i:\, r_{ik}=1} \left\|\mathbf{x}_i - \boldsymbol{\mu}_k\right\|^2 \]

where \(r_{ik} \in \{0,1\}\) is the assignment indicator (\(r_{ik}=1\) iff point \(i\)belongs to cluster \(k\)), and \(\boldsymbol{\mu}_k\) is the centroid of cluster \(k\).

E-step: Assignment (fixing \(\boldsymbol{\mu}_k\))

Minimise \(J\) over \(\{r_{ik}\}\) with centroids fixed. For each point, assign to the nearest centroid:

\[ r_{ik} = \begin{cases} 1 & \text{if } k = \arg\min_j \|\mathbf{x}_i - \boldsymbol{\mu}_j\|^2 \\ 0 & \text{otherwise} \end{cases} \]

M-step: Update centroids (fixing \(r_{ik}\))

Minimise \(J\) over \(\{\boldsymbol{\mu}_k\}\) with assignments fixed. Taking \(\partial J/\partial \boldsymbol{\mu}_k = 0\):

\[ \frac{\partial J}{\partial \boldsymbol{\mu}_k} = -2\sum_{i:\, r_{ik}=1}(\mathbf{x}_i - \boldsymbol{\mu}_k) = 0 \]
\[ \Rightarrow\quad \boldsymbol{\mu}_k = \frac{\sum_i r_{ik}\,\mathbf{x}_i}{\sum_i r_{ik}} \]

The new centroid is simply the mean of all points assigned to cluster \(k\).

Convergence proof sketch

Each E-step can only decrease or preserve \(J\) (reassigning to nearer centroid). Each M-step can only decrease or preserve \(J\) (mean minimises sum of squared distances). Since \(J \geq 0\) and there are finitely many assignments (\(K^N\)), the algorithm must converge. However, it may converge to a local minimum, not the global minimum.

2. Gaussian Mixture Models

A Gaussian Mixture Model (GMM) is a probabilistic model where data is assumed to come from one of \(K\) Gaussian components:

\[ p(\mathbf{x}) = \sum_{k=1}^K \pi_k\,\mathcal{N}\!\left(\mathbf{x}\,\big|\,\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k\right) \]

where \(\pi_k \geq 0\), \(\sum_k \pi_k = 1\) are the mixing coefficients,\(\boldsymbol{\mu}_k\) are the component means, and \(\boldsymbol{\Sigma}_k\) are the covariance matrices.

Introduce latent variable \(\mathbf{z}_i \in \{0,1\}^K\) with \(p(z_{ik}=1) = \pi_k\). Then \(p(\mathbf{x}_i | z_{ik}=1) = \mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)\). The marginal \(p(\mathbf{x}) = \sum_k p(z_{ik}=1)\,p(\mathbf{x}|z_{ik}=1)\) recovers the GMM.

3. EM Algorithm โ€” Full Derivation

We want to maximise the log-likelihood \(\log p(\mathbf{X};\boldsymbol{\theta})\). With latent variables \(\mathbf{Z}\), direct maximisation is intractable because of the sum inside the log. EM maximises a lower bound instead.

ELBO decomposition

Introduce any distribution \(q(\mathbf{Z})\). By Jensen's inequality:

\[ \log p(\mathbf{X}) = \mathcal{L}(q, \boldsymbol{\theta}) + \mathrm{KL}\bigl(q(\mathbf{Z}) \,\|\, p(\mathbf{Z}|\mathbf{X};\boldsymbol{\theta})\bigr) \]

Since \(\mathrm{KL} \geq 0\), we have \(\log p(\mathbf{X}) \geq \mathcal{L}(q,\boldsymbol{\theta})\) (the ELBO). EM alternates between tightening the bound (E-step: set \(q = p(\mathbf{Z}|\mathbf{X})\)) and maximising it (M-step: maximise over \(\boldsymbol{\theta}\)).

E-step: Compute responsibilities

By Bayes' theorem:

\[ \gamma_{nk} := p(z_{nk}=1 | \mathbf{x}_n;\boldsymbol{\theta}) = \frac{\pi_k\,\mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k)}{\displaystyle\sum_{j=1}^K \pi_j\,\mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_j,\boldsymbol{\Sigma}_j)} \]

\(\gamma_{nk}\) is the soft assignment โ€” the probability that data point \(n\) came from component \(k\).

M-step: Derive \(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k, \pi_k\) updates

Maximise the expected complete-data log-likelihood \(Q = \sum_n \sum_k \gamma_{nk} \log[\pi_k\,\mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k)]\). Setting \(\partial Q/\partial \boldsymbol{\mu}_k = 0\):

\[ -\sum_n \gamma_{nk}\,\boldsymbol{\Sigma}_k^{-1}(\mathbf{x}_n - \boldsymbol{\mu}_k) = 0 \;\Rightarrow\; \boldsymbol{\mu}_k^{\mathrm{new}} = \frac{\sum_n \gamma_{nk}\,\mathbf{x}_n}{N_k} \]

Setting \(\partial Q/\partial \boldsymbol{\Sigma}_k^{-1} = 0\):

\[ \boldsymbol{\Sigma}_k^{\mathrm{new}} = \frac{\sum_n \gamma_{nk}\,(\mathbf{x}_n - \boldsymbol{\mu}_k^{\mathrm{new}})(\mathbf{x}_n - \boldsymbol{\mu}_k^{\mathrm{new}})^\top}{N_k} \]

Maximising over \(\pi_k\) subject to \(\sum_k \pi_k = 1\) (Lagrange multiplier \(\lambda\)):

\[ \frac{\partial}{\partial \pi_k}\left[Q + \lambda\!\left(\sum_k\pi_k - 1\right)\right] = \frac{N_k}{\pi_k} + \lambda = 0 \;\Rightarrow\; \pi_k^{\mathrm{new}} = \frac{N_k}{N} \]

where \(N_k = \sum_n \gamma_{nk}\) is the effective count of points assigned to component \(k\).

Connection: K-means is Hard-Assignment EM

As we take the covariances \(\boldsymbol{\Sigma}_k \to \varepsilon^2 \mathbf{I}\) with \(\varepsilon \to 0\), the responsibilities \(\gamma_{nk}\) become one-hot (hard assignments): all probability mass concentrates on the nearest centroid. The GMM M-step for \(\boldsymbol{\mu}_k\) then becomes identical to the K-means M-step. K-means is therefore the limit of EM for isotropic equal-variance Gaussians.

4. EM Algorithm Diagram

E-step: assign responsibilities ฮณ_nkฮณM-step: update ฮผ_k, ฮฃ_k, ฯ€_kComponent 1Component 2Component 3โœ• = centroid ฮผ_k

Three overlapping Gaussian components. E-step computes soft assignments; M-step updates the mean (โœ•), covariance (ellipse), and mixing weight.

5. Python: EM for GMM from Scratch

Full NumPy implementation of EM for a 3-component GMM. We visualise the evolving cluster assignments, covariance ellipses, and convergence of the log-likelihood.

Python
script.py138 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server