Part IV · Chapter 10

Clustering: K-Means & GMM

Clustering partitions data into groups without labels. We derive K-means from its objective function, prove convergence of Lloyd's algorithm, build Gaussian Mixture Models as a probabilistic extension, and derive the full Expectation-Maximisation algorithm from first principles.

1. K-Means: Objective and Lloyd's Algorithm

Given \(N\) points \(\{\mathbf{x}_i\}_{i=1}^N\) and \(K\) clusters, K-means minimises the within-cluster sum of squared distances:

\[ J = \sum_{k=1}^K \sum_{i:\, r_{ik}=1} \left\|\mathbf{x}_i - \boldsymbol{\mu}_k\right\|^2 \]

where \(r_{ik} \in \{0,1\}\) is the assignment indicator (\(r_{ik}=1\) iff point \(i\)belongs to cluster \(k\)), and \(\boldsymbol{\mu}_k\) is the centroid of cluster \(k\).

E-step: Assignment (fixing \(\boldsymbol{\mu}_k\))

Minimise \(J\) over \(\{r_{ik}\}\) with centroids fixed. For each point, assign to the nearest centroid:

\[ r_{ik} = \begin{cases} 1 & \text{if } k = \arg\min_j \|\mathbf{x}_i - \boldsymbol{\mu}_j\|^2 \\ 0 & \text{otherwise} \end{cases} \]

M-step: Update centroids (fixing \(r_{ik}\))

Minimise \(J\) over \(\{\boldsymbol{\mu}_k\}\) with assignments fixed. Taking \(\partial J/\partial \boldsymbol{\mu}_k = 0\):

\[ \frac{\partial J}{\partial \boldsymbol{\mu}_k} = -2\sum_{i:\, r_{ik}=1}(\mathbf{x}_i - \boldsymbol{\mu}_k) = 0 \]

\[ \Rightarrow\quad \boldsymbol{\mu}_k = \frac{\sum_i r_{ik}\,\mathbf{x}_i}{\sum_i r_{ik}} \]

The new centroid is simply the mean of all points assigned to cluster \(k\).

Convergence proof sketch

Each E-step can only decrease or preserve \(J\) (reassigning to nearer centroid). Each M-step can only decrease or preserve \(J\) (mean minimises sum of squared distances). Since \(J \geq 0\) and there are finitely many assignments (\(K^N\)), the algorithm must converge. However, it may converge to a local minimum, not the global minimum.

2. Gaussian Mixture Models

A Gaussian Mixture Model (GMM) is a probabilistic model where data is assumed to come from one of \(K\) Gaussian components:

\[ p(\mathbf{x}) = \sum_{k=1}^K \pi_k\,\mathcal{N}\!\left(\mathbf{x}\,\big|\,\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k\right) \]

where \(\pi_k \geq 0\), \(\sum_k \pi_k = 1\) are the mixing coefficients,\(\boldsymbol{\mu}_k\) are the component means, and \(\boldsymbol{\Sigma}_k\) are the covariance matrices.

Introduce latent variable \(\mathbf{z}_i \in \{0,1\}^K\) with \(p(z_{ik}=1) = \pi_k\). Then \(p(\mathbf{x}_i | z_{ik}=1) = \mathcal{N}(\mathbf{x}_i|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)\). The marginal \(p(\mathbf{x}) = \sum_k p(z_{ik}=1)\,p(\mathbf{x}|z_{ik}=1)\) recovers the GMM.

3. EM Algorithm — Full Derivation

We want to maximise the log-likelihood \(\log p(\mathbf{X};\boldsymbol{\theta})\). With latent variables \(\mathbf{Z}\), direct maximisation is intractable because of the sum inside the log. EM maximises a lower bound instead.

ELBO decomposition

Introduce any distribution \(q(\mathbf{Z})\). By Jensen's inequality:

\[ \log p(\mathbf{X}) = \mathcal{L}(q, \boldsymbol{\theta}) + \mathrm{KL}\bigl(q(\mathbf{Z}) \,\|\, p(\mathbf{Z}|\mathbf{X};\boldsymbol{\theta})\bigr) \]

Since \(\mathrm{KL} \geq 0\), we have \(\log p(\mathbf{X}) \geq \mathcal{L}(q,\boldsymbol{\theta})\) (the ELBO). EM alternates between tightening the bound (E-step: set \(q = p(\mathbf{Z}|\mathbf{X})\)) and maximising it (M-step: maximise over \(\boldsymbol{\theta}\)).

E-step: Compute responsibilities

By Bayes' theorem:

\[ \gamma_{nk} := p(z_{nk}=1 | \mathbf{x}_n;\boldsymbol{\theta}) = \frac{\pi_k\,\mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k)}{\displaystyle\sum_{j=1}^K \pi_j\,\mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_j,\boldsymbol{\Sigma}_j)} \]

\(\gamma_{nk}\) is the soft assignment — the probability that data point \(n\) came from component \(k\).

M-step: Derive \(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k, \pi_k\) updates

Maximise the expected complete-data log-likelihood \(Q = \sum_n \sum_k \gamma_{nk} \log[\pi_k\,\mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k)]\). Setting \(\partial Q/\partial \boldsymbol{\mu}_k = 0\):

\[ -\sum_n \gamma_{nk}\,\boldsymbol{\Sigma}_k^{-1}(\mathbf{x}_n - \boldsymbol{\mu}_k) = 0 \;\Rightarrow\; \boldsymbol{\mu}_k^{\mathrm{new}} = \frac{\sum_n \gamma_{nk}\,\mathbf{x}_n}{N_k} \]

Setting \(\partial Q/\partial \boldsymbol{\Sigma}_k^{-1} = 0\):

\[ \boldsymbol{\Sigma}_k^{\mathrm{new}} = \frac{\sum_n \gamma_{nk}\,(\mathbf{x}_n - \boldsymbol{\mu}_k^{\mathrm{new}})(\mathbf{x}_n - \boldsymbol{\mu}_k^{\mathrm{new}})^\top}{N_k} \]

Maximising over \(\pi_k\) subject to \(\sum_k \pi_k = 1\) (Lagrange multiplier \(\lambda\)):

\[ \frac{\partial}{\partial \pi_k}\left[Q + \lambda\!\left(\sum_k\pi_k - 1\right)\right] = \frac{N_k}{\pi_k} + \lambda = 0 \;\Rightarrow\; \pi_k^{\mathrm{new}} = \frac{N_k}{N} \]

where \(N_k = \sum_n \gamma_{nk}\) is the effective count of points assigned to component \(k\).

Connection: K-means is Hard-Assignment EM

As we take the covariances \(\boldsymbol{\Sigma}_k \to \varepsilon^2 \mathbf{I}\) with \(\varepsilon \to 0\), the responsibilities \(\gamma_{nk}\) become one-hot (hard assignments): all probability mass concentrates on the nearest centroid. The GMM M-step for \(\boldsymbol{\mu}_k\) then becomes identical to the K-means M-step. K-means is therefore the limit of EM for isotropic equal-variance Gaussians.

4. EM Algorithm Diagram

Three overlapping Gaussian components. E-step computes soft assignments; M-step updates the mean (✕), covariance (ellipse), and mixing weight.

5. Python: EM for GMM from Scratch

Full NumPy implementation of EM for a 3-component GMM. We visualise the evolving cluster assignments, covariance ellipses, and convergence of the log-likelihood.

Python

script.py138 lines

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

np.random.seed(42)

# ── Generate 2D Gaussian mixture data ─────────────────────────────────────────
def sample_gmm(means, covs, pis, n=300):
    points = []
    labels = []
    for k, (mu, cov, pi) in enumerate(zip(means, covs, pis)):
        nk = int(n * pi)
        pts = np.random.multivariate_normal(mu, cov, nk)
        points.append(pts)
        labels.extend([k] * nk)
    return np.vstack(points), np.array(labels)

means_true = [[-2.0, -2.0], [2.0, 1.0], [0.0, 3.5]]
covs_true  = [[[0.8, 0.3],[0.3, 0.6]], [[0.7,-0.2],[-0.2,0.9]], [[0.5,0.0],[0.0,0.5]]]
pis_true   = [0.35, 0.40, 0.25]
X, true_labels = sample_gmm(means_true, covs_true, pis_true, n=360)
N, D = X.shape
K = 3

# ── Multivariate Gaussian density ─────────────────────────────────────────────
def mvn_pdf(x, mu, cov):
    d = x.shape[1]
    diff = x - mu
    inv_cov = np.linalg.inv(cov + 1e-6 * np.eye(d))
    det_cov = np.linalg.det(cov + 1e-6 * np.eye(d))
    exponent = -0.5 * np.einsum('nd,dd,nd->n', diff, inv_cov, diff)
    return np.exp(exponent) / (np.sqrt((2*np.pi)**d * np.abs(det_cov)) + 1e-300)

# ── EM Algorithm for GMM ───────────────────────────────────────────────────────
# Initialise with K-means-like random assignment
idx = np.random.choice(N, K, replace=False)
mu    = X[idx].copy()
sigma = [np.eye(D) * 0.8 for _ in range(K)]
pi    = np.ones(K) / K

log_likelihoods = []
snapshots = []

for iteration in range(60):
    # ── E-step: compute responsibilities gamma[n, k] ──────────────────────────
    r = np.zeros((N, K))
    for k in range(K):
        r[:, k] = pi[k] * mvn_pdf(X, mu[k], sigma[k])
    r_sum = r.sum(axis=1, keepdims=True) + 1e-300
    r /= r_sum
    ll = np.log(r_sum + 1e-300).sum()
    log_likelihoods.append(float(ll))

# save snapshot at iterations 0, 5, 20
    if iteration in (0, 5, 20, 55):
        snapshots.append((iteration, r.copy(), mu.copy(), [s.copy() for s in sigma]))

# ── M-step: update parameters ─────────────────────────────────────────────
    Nk = r.sum(axis=0)  # effective count for each component
    for k in range(K):
        mu[k]    = (r[:, k:k+1] * X).sum(axis=0) / (Nk[k] + 1e-9)
        diff     = X - mu[k]
        sigma[k] = (r[:, k:k+1] * diff).T @ diff / (Nk[k] + 1e-9)
        sigma[k] += 1e-4 * np.eye(D)  # regularise
    pi = Nk / N

# ── Figure ─────────────────────────────────────────────────────────────────────
fig, axes = plt.subplots(2, 3, figsize=(14, 9), facecolor='#0f0f1a')
colors_k  = ['#a78bfa', '#4ade80', '#fb923c']

def plot_ellipse(ax, mu, cov, color, alpha=0.3, lw=1.5):
    from matplotlib.patches import Ellipse
    vals, vecs = np.linalg.eigh(cov)
    vals = np.maximum(vals, 1e-6)
    angle = np.degrees(np.arctan2(vecs[1, 1], vecs[0, 1]))
    w, h = 2 * np.sqrt(vals) * 2.0
    ell = Ellipse(xy=mu, width=w, height=h, angle=angle,
                  facecolor=color, alpha=alpha, edgecolor=color, linewidth=lw)
    ax.add_patch(ell)

# Top row: EM snapshots
for plot_idx, (it, r_snap, mu_snap, sig_snap) in enumerate(snapshots[:3]):
    ax = axes[0, plot_idx]
    ax.set_facecolor('#0f0f1a')
    hard_assign = r_snap.argmax(axis=1)
    for k in range(K):
        mask = hard_assign == k
        ax.scatter(X[mask, 0], X[mask, 1], c=colors_k[k], s=12, alpha=0.65)
        plot_ellipse(ax, mu_snap[k], sig_snap[k], colors_k[k], alpha=0.2)
        ax.scatter(*mu_snap[k], c=colors_k[k], s=120, marker='X', edgecolors='white', linewidths=0.8, zorder=5)
    ax.set_title(f'EM Iteration {it}', color='#ddd6fe', fontsize=11)
    ax.tick_params(colors='#94a3b8', labelsize=8)
    for sp in ax.spines.values(): sp.set_edgecolor('#4c1d95')

# Bottom-left: final assignment
ax = axes[1, 0]
ax.set_facecolor('#0f0f1a')
final_assign = r.argmax(axis=1)
for k in range(K):
    mask = final_assign == k
    ax.scatter(X[mask,0], X[mask,1], c=colors_k[k], s=12, alpha=0.7)
    plot_ellipse(ax, mu[k], sigma[k], colors_k[k], alpha=0.25)
    ax.scatter(*mu[k], c=colors_k[k], s=140, marker='X', edgecolors='white', linewidths=1, zorder=5)
ax.set_title('Final GMM Clusters', color='#ddd6fe', fontsize=11)
ax.tick_params(colors='#94a3b8', labelsize=8)
for sp in ax.spines.values(): sp.set_edgecolor('#4c1d95')

# Bottom-middle: log-likelihood
ax = axes[1, 1]
ax.set_facecolor('#0f0f1a')
ax.plot(log_likelihoods, color='#a78bfa', linewidth=2)
ax.set_xlabel('Iteration', color='#e2e8f0'); ax.set_ylabel('Log-Likelihood', color='#e2e8f0')
ax.set_title('EM Convergence', color='#ddd6fe', fontsize=11)
ax.tick_params(colors='#94a3b8', labelsize=8)
for sp in ax.spines.values(): sp.set_edgecolor('#4c1d95')

# Bottom-right: responsibility heatmap
ax = axes[1, 2]
ax.set_facecolor('#0f0f1a')
im = ax.imshow(r[:60].T, aspect='auto', cmap='plasma', vmin=0, vmax=1)
ax.set_xlabel('Data point (first 60)', color='#e2e8f0')
ax.set_ylabel('Component k', color='#e2e8f0')
ax.set_title('Responsibilities gamma[n,k]', color='#ddd6fe', fontsize=11)
ax.set_yticks([0,1,2]); ax.set_yticklabels(['k=0','k=1','k=2'], color='#94a3b8')
ax.tick_params(colors='#94a3b8', labelsize=8)
plt.colorbar(im, ax=ax, fraction=0.04)

plt.suptitle('EM Algorithm for Gaussian Mixture Model', color='white', fontsize=14, fontweight='bold', y=1.01)
plt.tight_layout()
plt.savefig('output.png', dpi=130, bbox_inches='tight', facecolor='#0f0f1a')

acc = (final_assign == true_labels).mean()
print(f"Final log-likelihood: {log_likelihoods[-1]:.2f}")
print(f"Learned pi: {pi.round(3)}")
print(f"Learned means: {mu.round(2)}")
print("Saved output.png")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Share:X Reddit LinkedIn

← Ch 9: CNNs Ch 11: Dimensionality Reduction →