Support Vector Machines
SVMs find the maximum-margin hyperplane separating two classes. Through Lagrangian duality, the problem transforms into a beautiful dual formulation where only the support vectors matter โ and the kernel trick extends SVMs to non-linear boundaries without explicit feature computation.
1. Maximum Margin Classifier โ Primal Formulation
A linear classifier predicts \(\hat{y} = \mathrm{sign}(\mathbf{w}^\top\mathbf{x} + b)\). For linearly separable data we can scale \(\mathbf{w}\) so that:
Deriving the margin: The distance from the decision boundary \(\mathbf{w}^\top\mathbf{x}+b=0\) to a point on the margin \(\mathbf{w}^\top\mathbf{x}+b=1\) is \(1/\|\mathbf{w}\|\). The total margin between the two classes is therefore:
Maximising \(2/\|\mathbf{w}\|\) is equivalent to minimising \(\|\mathbf{w}\|^2/2\). The hard-margin SVM primal problem is:
The SVM maximises the margin 2/||w|| between the two classes. Only support vectors (circled) determine the decision boundary.
2. Lagrangian Dual โ Full Derivation
Introduce Lagrange multipliers \(\alpha_i \geq 0\) for each constraint and form the Lagrangian:
Step 1: Stationarity โ set \(\partial\mathcal{L}/\partial\mathbf{w} = 0\):
Step 2: Set \(\partial\mathcal{L}/\partial b = 0\):
Step 3: Substitute back into \(\mathcal{L}\) to get the dual objective. Using \(\|\mathbf{w}\|^2 = \mathbf{w}^\top\mathbf{w} = (\sum_i \alpha_i y_i \mathbf{x}_i)^\top(\sum_j \alpha_j y_j \mathbf{x}_j) = \sum_{i,j}\alpha_i\alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}_j\):
The dual only involves inner products \(\mathbf{x}_i^\top\mathbf{x}_j\) โ the key insight enabling the kernel trick. After solving, the prediction is:
3. KKT Conditions for SVMs
The KKT complementary slackness conditions reveal which points are support vectors:
This means either \(\alpha_i = 0\) (the point is not a support vector and lies strictly inside the margin) or \(y_i(\mathbf{w}^\top\mathbf{x}_i + b) = 1\) (the point lies exactly on the margin โ a support vector with \(\alpha_i > 0\)). The weight vector is therefore a sparse combination of only the support vectors:
4. Soft Margin SVM (Non-Separable Data)
For non-separable data, introduce slack variables \(\xi_i \geq 0\) that allow misclassification at a cost:
The hyperparameter \(C\) controls the tradeoff: large \(C\) penalises violations heavily (hard margin), small \(C\) allows more violations (wide margin). The dual has the same form as before but with box constraints \(0 \leq \alpha_i \leq C\).
The soft-margin objective \(\frac{1}{2}\|\mathbf{w}\|^2 + C\sum_i \max(0, 1-y_i(\mathbf{w}^\top\mathbf{x}_i+b))\) is known as the hinge loss formulation.
5. The Kernel Trick
The dual objective only involves inner products \(\mathbf{x}_i^\top\mathbf{x}_j\). We can replace these with a kernel function \(K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^\top\phi(\mathbf{x}_j)\) that implicitly computes inner products in a high-dimensional (or infinite-dimensional) feature space \(\phi\):
Common kernels:
The RBF kernel corresponds to an infinite-dimensional feature map (Gaussian RKHS), allowing SVMs to learn arbitrarily complex boundaries. By Mercer's theorem, any positive semi-definite function is a valid kernel.
6. SMO Algorithm Overview
The Sequential Minimal Optimisation (SMO) algorithm solves the SVM dual efficiently by decomposing it into the smallest possible sub-problems. At each iteration it selects two multipliers \(\alpha_i, \alpha_j\) and optimises them analytically while keeping all others fixed. The two-variable constraint \(\alpha_i y_i + \alpha_j y_j = \text{const}\) (from \(\sum_i \alpha_i y_i = 0\)) makes this a 1D problem with an analytic solution:
where \(E_i = \hat{y}_i - y_i\) is the prediction error. The result is clipped to the box \([0, C]\) and the \(\sum_i \alpha_i y_i = 0\) constraint. SMO has \(O(n)\) memory and converges in \(O(n^2)\) to \(O(n^3)\) time.
Python: SVM โ Linear vs RBF Kernel
We implement a linear SVM via subgradient descent from scratch, compare it to a scikit-learn RBF kernel SVM on a non-linearly separable dataset, and visualise decision boundaries, margins, and support vectors.
Click Run to execute the Python code
Code will be executed with Python 3 on the server
Key Takeaways
- โ SVM maximises margin \(2/\|\mathbf{w}\|\) by minimising \(\|\mathbf{w}\|^2/2\) subject to \(y_i(\mathbf{w}^\top\mathbf{x}_i+b)\geq 1\).
- โ The Lagrangian dual involves only inner products, giving: \(\max_\alpha \sum\alpha_i - \frac{1}{2}\sum_{i,j}\alpha_i\alpha_j y_i y_j\mathbf{x}_i^\top\mathbf{x}_j\).
- โ KKT complementary slackness identifies support vectors: those with \(\alpha_i > 0\) lying exactly on the margin.
- โ Soft margin adds slack variables \(\xi_i\) and a cost \(C\) to handle non-separable data via hinge loss.
- โ The kernel trick replaces \(\mathbf{x}_i^\top\mathbf{x}_j\) with \(K(\mathbf{x}_i,\mathbf{x}_j)\), enabling non-linear boundaries with RBF and polynomial kernels.