Part II ยท Chapter 6

Support Vector Machines

SVMs find the maximum-margin hyperplane separating two classes. Through Lagrangian duality, the problem transforms into a beautiful dual formulation where only the support vectors matter โ€” and the kernel trick extends SVMs to non-linear boundaries without explicit feature computation.

1. Maximum Margin Classifier โ€” Primal Formulation

A linear classifier predicts \(\hat{y} = \mathrm{sign}(\mathbf{w}^\top\mathbf{x} + b)\). For linearly separable data we can scale \(\mathbf{w}\) so that:

\[ y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1 \quad \forall i \]

Deriving the margin: The distance from the decision boundary \(\mathbf{w}^\top\mathbf{x}+b=0\) to a point on the margin \(\mathbf{w}^\top\mathbf{x}+b=1\) is \(1/\|\mathbf{w}\|\). The total margin between the two classes is therefore:

\[ \text{margin} = \frac{2}{\|\mathbf{w}\|} \]

Maximising \(2/\|\mathbf{w}\|\) is equivalent to minimising \(\|\mathbf{w}\|^2/2\). The hard-margin SVM primal problem is:

\[ \min_{\mathbf{w}, b} \frac{1}{2}\|\mathbf{w}\|^2 \quad \text{subject to} \quad y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1, \; i=1,\ldots,n \]
wยทx+b=0wยทx+b=โˆ’1wยทx+b=+12/||w||y = +1y = โˆ’1SVSV

The SVM maximises the margin 2/||w|| between the two classes. Only support vectors (circled) determine the decision boundary.

2. Lagrangian Dual โ€” Full Derivation

Introduce Lagrange multipliers \(\alpha_i \geq 0\) for each constraint and form the Lagrangian:

\[ \mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \frac{1}{2}\|\mathbf{w}\|^2 - \sum_{i=1}^n \alpha_i\Big[y_i(\mathbf{w}^\top\mathbf{x}_i + b) - 1\Big] \]

Step 1: Stationarity โ€” set \(\partial\mathcal{L}/\partial\mathbf{w} = 0\):

\[ \mathbf{w} = \sum_{i=1}^n \alpha_i y_i \mathbf{x}_i \]

Step 2: Set \(\partial\mathcal{L}/\partial b = 0\):

\[ \sum_{i=1}^n \alpha_i y_i = 0 \]

Step 3: Substitute back into \(\mathcal{L}\) to get the dual objective. Using \(\|\mathbf{w}\|^2 = \mathbf{w}^\top\mathbf{w} = (\sum_i \alpha_i y_i \mathbf{x}_i)^\top(\sum_j \alpha_j y_j \mathbf{x}_j) = \sum_{i,j}\alpha_i\alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}_j\):

\[ \max_{\boldsymbol{\alpha}} \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n \alpha_i\alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}_j \]
\(\text{subject to } \alpha_i \geq 0,\; \sum_i \alpha_i y_i = 0\)

The dual only involves inner products \(\mathbf{x}_i^\top\mathbf{x}_j\) โ€” the key insight enabling the kernel trick. After solving, the prediction is:

\[ \hat{y} = \mathrm{sign}\!\left(\sum_{i} \alpha_i y_i \mathbf{x}_i^\top\mathbf{x} + b\right) \]

3. KKT Conditions for SVMs

The KKT complementary slackness conditions reveal which points are support vectors:

\[ \alpha_i\Big[y_i(\mathbf{w}^\top\mathbf{x}_i + b) - 1\Big] = 0 \quad \forall i \]

This means either \(\alpha_i = 0\) (the point is not a support vector and lies strictly inside the margin) or \(y_i(\mathbf{w}^\top\mathbf{x}_i + b) = 1\) (the point lies exactly on the margin โ€” a support vector with \(\alpha_i > 0\)). The weight vector is therefore a sparse combination of only the support vectors:

\[ \mathbf{w} = \sum_{i \in \mathcal{S}} \alpha_i y_i \mathbf{x}_i \quad \text{where } \mathcal{S} = \{i : \alpha_i > 0\} \]

4. Soft Margin SVM (Non-Separable Data)

For non-separable data, introduce slack variables \(\xi_i \geq 0\) that allow misclassification at a cost:

\[ y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0 \]
\[ \min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_{i=1}^n \xi_i \quad \text{s.t.} \quad y_i(\mathbf{w}^\top\mathbf{x}_i+b) \geq 1 - \xi_i,\; \xi_i \geq 0 \]

The hyperparameter \(C\) controls the tradeoff: large \(C\) penalises violations heavily (hard margin), small \(C\) allows more violations (wide margin). The dual has the same form as before but with box constraints \(0 \leq \alpha_i \leq C\).

The soft-margin objective \(\frac{1}{2}\|\mathbf{w}\|^2 + C\sum_i \max(0, 1-y_i(\mathbf{w}^\top\mathbf{x}_i+b))\) is known as the hinge loss formulation.

5. The Kernel Trick

The dual objective only involves inner products \(\mathbf{x}_i^\top\mathbf{x}_j\). We can replace these with a kernel function \(K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^\top\phi(\mathbf{x}_j)\) that implicitly computes inner products in a high-dimensional (or infinite-dimensional) feature space \(\phi\):

\[ \max_{\boldsymbol{\alpha}} \sum_i \alpha_i - \frac{1}{2}\sum_{i,j}\alpha_i\alpha_j y_i y_j K(\mathbf{x}_i, \mathbf{x}_j) \]

Common kernels:

Linear
\(K(\mathbf{x}, \mathbf{x}') = \mathbf{x}^\top\mathbf{x}'\)
RBF (Gaussian)
\(K(\mathbf{x},\mathbf{x}') = \exp\!\left(-\frac{\|\mathbf{x}-\mathbf{x}'\|^2}{2\gamma^2}\right)\)
Polynomial
\(K(\mathbf{x},\mathbf{x}') = (\mathbf{x}^\top\mathbf{x}' + c)^d\)

The RBF kernel corresponds to an infinite-dimensional feature map (Gaussian RKHS), allowing SVMs to learn arbitrarily complex boundaries. By Mercer's theorem, any positive semi-definite function is a valid kernel.

6. SMO Algorithm Overview

The Sequential Minimal Optimisation (SMO) algorithm solves the SVM dual efficiently by decomposing it into the smallest possible sub-problems. At each iteration it selects two multipliers \(\alpha_i, \alpha_j\) and optimises them analytically while keeping all others fixed. The two-variable constraint \(\alpha_i y_i + \alpha_j y_j = \text{const}\) (from \(\sum_i \alpha_i y_i = 0\)) makes this a 1D problem with an analytic solution:

\[ \alpha_j^{\mathrm{new}} = \alpha_j^{\mathrm{old}} + \frac{y_j(E_i - E_j)}{\eta}, \quad \eta = K_{ii} + K_{jj} - 2K_{ij} \]

where \(E_i = \hat{y}_i - y_i\) is the prediction error. The result is clipped to the box \([0, C]\) and the \(\sum_i \alpha_i y_i = 0\) constraint. SMO has \(O(n)\) memory and converges in \(O(n^2)\) to \(O(n^3)\) time.

Python: SVM โ€” Linear vs RBF Kernel

We implement a linear SVM via subgradient descent from scratch, compare it to a scikit-learn RBF kernel SVM on a non-linearly separable dataset, and visualise decision boundaries, margins, and support vectors.

Python
script.py130 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Key Takeaways

  • โœ“ SVM maximises margin \(2/\|\mathbf{w}\|\) by minimising \(\|\mathbf{w}\|^2/2\) subject to \(y_i(\mathbf{w}^\top\mathbf{x}_i+b)\geq 1\).
  • โœ“ The Lagrangian dual involves only inner products, giving: \(\max_\alpha \sum\alpha_i - \frac{1}{2}\sum_{i,j}\alpha_i\alpha_j y_i y_j\mathbf{x}_i^\top\mathbf{x}_j\).
  • โœ“ KKT complementary slackness identifies support vectors: those with \(\alpha_i > 0\) lying exactly on the margin.
  • โœ“ Soft margin adds slack variables \(\xi_i\) and a cost \(C\) to handle non-separable data via hinge loss.
  • โœ“ The kernel trick replaces \(\mathbf{x}_i^\top\mathbf{x}_j\) with \(K(\mathbf{x}_i,\mathbf{x}_j)\), enabling non-linear boundaries with RBF and polynomial kernels.