Clustering: K-Means & GMM
Clustering partitions data into groups without labels. We derive K-means from its objective function, prove convergence of Lloyd's algorithm, build Gaussian Mixture Models as a probabilistic extension, and derive the full Expectation-Maximisation algorithm from first principles.
1. K-Means: Objective and Lloyd's Algorithm
Given \(N\) points \(\{\mathbf{x}_i\}_{i=1}^N\) and \(K\) clusters, K-means minimises the within-cluster sum of squared distances:
where \(r_{ik} \in \{0,1\}\) is the assignment indicator (\(r_{ik}=1\) iff point \(i\)belongs to cluster \(k\)), and \(\boldsymbol{\mu}_k\) is the centroid of cluster \(k\).
E-step: Assignment (fixing \(\boldsymbol{\mu}_k\))
Minimise \(J\) over \(\{r_{ik}\}\) with centroids fixed. For each point, assign to the nearest centroid:
M-step: Update centroids (fixing \(r_{ik}\))
Minimise \(J\) over \(\{\boldsymbol{\mu}_k\}\) with assignments fixed. Taking \(\partial J/\partial \boldsymbol{\mu}_k = 0\):
The new centroid is simply the mean of all points assigned to cluster \(k\).
Convergence proof sketch
Each E-step can only decrease or preserve \(J\) (reassigning to nearer centroid). Each M-step can only decrease or preserve \(J\) (mean minimises sum of squared distances). Since \(J \geq 0\) and there are finitely many assignments (\(K^N\)), the algorithm must converge. However, it may converge to a local minimum, not the global minimum.
2. Gaussian Mixture Models
A Gaussian Mixture Model (GMM) is a probabilistic model where data is assumed to come from one of \(K\) Gaussian components:
where \(\pi_k \geq 0\), \(\sum_k \pi_k = 1\) are the mixing coefficients,\(\boldsymbol{\mu}_k\) are the component means, and \(\boldsymbol{\Sigma}_k\) are the covariance matrices.
Introduce latent variable \(\mathbf{z}_i \in \{0,1\}^K\) with \(p(z_{ik}=1) = \pi_k\). Then \(p(\mathbf{x}_i | z_{ik}=1) = \mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)\). The marginal \(p(\mathbf{x}) = \sum_k p(z_{ik}=1)\,p(\mathbf{x}|z_{ik}=1)\) recovers the GMM.
3. EM Algorithm โ Full Derivation
We want to maximise the log-likelihood \(\log p(\mathbf{X};\boldsymbol{\theta})\). With latent variables \(\mathbf{Z}\), direct maximisation is intractable because of the sum inside the log. EM maximises a lower bound instead.
ELBO decomposition
Introduce any distribution \(q(\mathbf{Z})\). By Jensen's inequality:
Since \(\mathrm{KL} \geq 0\), we have \(\log p(\mathbf{X}) \geq \mathcal{L}(q,\boldsymbol{\theta})\) (the ELBO). EM alternates between tightening the bound (E-step: set \(q = p(\mathbf{Z}|\mathbf{X})\)) and maximising it (M-step: maximise over \(\boldsymbol{\theta}\)).
E-step: Compute responsibilities
By Bayes' theorem:
\(\gamma_{nk}\) is the soft assignment โ the probability that data point \(n\) came from component \(k\).
M-step: Derive \(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k, \pi_k\) updates
Maximise the expected complete-data log-likelihood \(Q = \sum_n \sum_k \gamma_{nk} \log[\pi_k\,\mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k)]\). Setting \(\partial Q/\partial \boldsymbol{\mu}_k = 0\):
Setting \(\partial Q/\partial \boldsymbol{\Sigma}_k^{-1} = 0\):
Maximising over \(\pi_k\) subject to \(\sum_k \pi_k = 1\) (Lagrange multiplier \(\lambda\)):
where \(N_k = \sum_n \gamma_{nk}\) is the effective count of points assigned to component \(k\).
Connection: K-means is Hard-Assignment EM
As we take the covariances \(\boldsymbol{\Sigma}_k \to \varepsilon^2 \mathbf{I}\) with \(\varepsilon \to 0\), the responsibilities \(\gamma_{nk}\) become one-hot (hard assignments): all probability mass concentrates on the nearest centroid. The GMM M-step for \(\boldsymbol{\mu}_k\) then becomes identical to the K-means M-step. K-means is therefore the limit of EM for isotropic equal-variance Gaussians.
4. EM Algorithm Diagram
Three overlapping Gaussian components. E-step computes soft assignments; M-step updates the mean (โ), covariance (ellipse), and mixing weight.
5. Python: EM for GMM from Scratch
Full NumPy implementation of EM for a 3-component GMM. We visualise the evolving cluster assignments, covariance ellipses, and convergence of the log-likelihood.
Click Run to execute the Python code
Code will be executed with Python 3 on the server