Part II · Chapter 6

Support Vector Machines

SVMs find the maximum-margin hyperplane separating two classes. Through Lagrangian duality, the problem transforms into a beautiful dual formulation where only the support vectors matter — and the kernel trick extends SVMs to non-linear boundaries without explicit feature computation.

1. Maximum Margin Classifier — Primal Formulation

A linear classifier predicts \(\hat{y} = \mathrm{sign}(\mathbf{w}^\top\mathbf{x} + b)\). For linearly separable data we can scale \(\mathbf{w}\) so that:

\[ y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1 \quad \forall i \]

Deriving the margin: The distance from the decision boundary \(\mathbf{w}^\top\mathbf{x}+b=0\) to a point on the margin \(\mathbf{w}^\top\mathbf{x}+b=1\) is \(1/\|\mathbf{w}\|\). The total margin between the two classes is therefore:

\[ \text{margin} = \frac{2}{\|\mathbf{w}\|} \]

Maximising \(2/\|\mathbf{w}\|\) is equivalent to minimising \(\|\mathbf{w}\|^2/2\). The hard-margin SVM primal problem is:

\[ \min_{\mathbf{w}, b} \frac{1}{2}\|\mathbf{w}\|^2 \quad \text{subject to} \quad y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1, \; i=1,\ldots,n \]

The SVM maximises the margin 2/||w|| between the two classes. Only support vectors (circled) determine the decision boundary.

2. Lagrangian Dual — Full Derivation

Introduce Lagrange multipliers \(\alpha_i \geq 0\) for each constraint and form the Lagrangian:

\[ \mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha}) = \frac{1}{2}\|\mathbf{w}\|^2 - \sum_{i=1}^n \alpha_i\Big[y_i(\mathbf{w}^\top\mathbf{x}_i + b) - 1\Big] \]

Step 1: Stationarity — set \(\partial\mathcal{L}/\partial\mathbf{w} = 0\):

\[ \mathbf{w} = \sum_{i=1}^n \alpha_i y_i \mathbf{x}_i \]

Step 2: Set \(\partial\mathcal{L}/\partial b = 0\):

\[ \sum_{i=1}^n \alpha_i y_i = 0 \]

Step 3: Substitute back into \(\mathcal{L}\) to get the dual objective. Using \(\|\mathbf{w}\|^2 = \mathbf{w}^\top\mathbf{w} = (\sum_i \alpha_i y_i \mathbf{x}_i)^\top(\sum_j \alpha_j y_j \mathbf{x}_j) = \sum_{i,j}\alpha_i\alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}_j\):

\[ \max_{\boldsymbol{\alpha}} \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n \alpha_i\alpha_j y_i y_j \mathbf{x}_i^\top\mathbf{x}_j \]

\(\text{subject to } \alpha_i \geq 0,\; \sum_i \alpha_i y_i = 0\)

The dual only involves inner products \(\mathbf{x}_i^\top\mathbf{x}_j\) — the key insight enabling the kernel trick. After solving, the prediction is:

\[ \hat{y} = \mathrm{sign}\!\left(\sum_{i} \alpha_i y_i \mathbf{x}_i^\top\mathbf{x} + b\right) \]

3. KKT Conditions for SVMs

The KKT complementary slackness conditions reveal which points are support vectors:

\[ \alpha_i\Big[y_i(\mathbf{w}^\top\mathbf{x}_i + b) - 1\Big] = 0 \quad \forall i \]

This means either \(\alpha_i = 0\) (the point is not a support vector and lies strictly inside the margin) or \(y_i(\mathbf{w}^\top\mathbf{x}_i + b) = 1\) (the point lies exactly on the margin — a support vector with \(\alpha_i > 0\)). The weight vector is therefore a sparse combination of only the support vectors:

\[ \mathbf{w} = \sum_{i \in \mathcal{S}} \alpha_i y_i \mathbf{x}_i \quad \text{where } \mathcal{S} = \{i : \alpha_i > 0\} \]

4. Soft Margin SVM (Non-Separable Data)

For non-separable data, introduce slack variables \(\xi_i \geq 0\) that allow misclassification at a cost:

\[ y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0 \]

\[ \min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_{i=1}^n \xi_i \quad \text{s.t.} \quad y_i(\mathbf{w}^\top\mathbf{x}_i+b) \geq 1 - \xi_i,\; \xi_i \geq 0 \]

The hyperparameter \(C\) controls the tradeoff: large \(C\) penalises violations heavily (hard margin), small \(C\) allows more violations (wide margin). The dual has the same form as before but with box constraints \(0 \leq \alpha_i \leq C\).

The soft-margin objective \(\frac{1}{2}\|\mathbf{w}\|^2 + C\sum_i \max(0, 1-y_i(\mathbf{w}^\top\mathbf{x}_i+b))\) is known as the hinge loss formulation.

5. The Kernel Trick

The dual objective only involves inner products \(\mathbf{x}_i^\top\mathbf{x}_j\). We can replace these with a kernel function \(K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^\top\phi(\mathbf{x}_j)\) that implicitly computes inner products in a high-dimensional (or infinite-dimensional) feature space \(\phi\):

\[ \max_{\boldsymbol{\alpha}} \sum_i \alpha_i - \frac{1}{2}\sum_{i,j}\alpha_i\alpha_j y_i y_j K(\mathbf{x}_i, \mathbf{x}_j) \]

Common kernels:

Linear

\(K(\mathbf{x}, \mathbf{x}') = \mathbf{x}^\top\mathbf{x}'\)

RBF (Gaussian)

\(K(\mathbf{x},\mathbf{x}') = \exp\!\left(-\frac{\|\mathbf{x}-\mathbf{x}'\|^2}{2\gamma^2}\right)\)

Polynomial

\(K(\mathbf{x},\mathbf{x}') = (\mathbf{x}^\top\mathbf{x}' + c)^d\)

The RBF kernel corresponds to an infinite-dimensional feature map (Gaussian RKHS), allowing SVMs to learn arbitrarily complex boundaries. By Mercer's theorem, any positive semi-definite function is a valid kernel.

6. SMO Algorithm Overview

The Sequential Minimal Optimisation (SMO) algorithm solves the SVM dual efficiently by decomposing it into the smallest possible sub-problems. At each iteration it selects two multipliers \(\alpha_i, \alpha_j\) and optimises them analytically while keeping all others fixed. The two-variable constraint \(\alpha_i y_i + \alpha_j y_j = \text{const}\) (from \(\sum_i \alpha_i y_i = 0\)) makes this a 1D problem with an analytic solution:

\[ \alpha_j^{\mathrm{new}} = \alpha_j^{\mathrm{old}} + \frac{y_j(E_i - E_j)}{\eta}, \quad \eta = K_{ii} + K_{jj} - 2K_{ij} \]

where \(E_i = \hat{y}_i - y_i\) is the prediction error. The result is clipped to the box \([0, C]\) and the \(\sum_i \alpha_i y_i = 0\) constraint. SMO has \(O(n)\) memory and converges in \(O(n^2)\) to \(O(n^3)\) time.

Python: SVM — Linear vs RBF Kernel

We implement a linear SVM via subgradient descent from scratch, compare it to a scikit-learn RBF kernel SVM on a non-linearly separable dataset, and visualise decision boundaries, margins, and support vectors.

Python

script.py130 lines

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

rng = np.random.default_rng(5)

# ── Simple SVM solver via subgradient descent ──
def svm_sgd(X, y, C=1.0, lr=0.01, n_epochs=1000):
    n, p = X.shape
    w = np.zeros(p)
    b = 0.0
    losses = []
    for epoch in range(n_epochs):
        # Hinge loss gradient
        margins = y * (X @ w + b)
        loss = 0.5 * np.dot(w, w) + C * np.mean(np.maximum(0, 1 - margins))
        losses.append(loss)
        for i in range(n):
            if margins[i] < 1:
                w = w * (1 - lr) + lr * C * y[i] * X[i]
                b = b + lr * C * y[i]
            else:
                w = w * (1 - lr)
        lr_t = lr / (1 + 0.01 * epoch)
    return w, b, losses

def svm_predict(X, w, b):
    return np.sign(X @ w + b)

# ── RBF kernel SVM via sklearn (for comparison) ──
try:
    from sklearn.svm import SVC
    sklearn_available = True
except ImportError:
    sklearn_available = False

# ── Dataset 1: Linearly separable ──
n = 100
X_lin = np.vstack([
    rng.multivariate_normal([2, 2], [[0.8, 0], [0, 0.8]], n // 2),
    rng.multivariate_normal([-2, -2], [[0.8, 0], [0, 0.8]], n // 2)
])
y_lin = np.array([1.0] * (n // 2) + [-1.0] * (n // 2))

# ── Dataset 2: Non-linearly separable (circles) ──
theta = rng.uniform(0, 2 * np.pi, n)
r_inner = 1.0 + 0.2 * rng.standard_normal(n // 2)
r_outer = 2.5 + 0.2 * rng.standard_normal(n // 2)
X_nl = np.vstack([
    np.column_stack([r_inner * np.cos(theta[:n//2]), r_inner * np.sin(theta[:n//2])]),
    np.column_stack([r_outer * np.cos(theta[n//2:]), r_outer * np.sin(theta[n//2:])])
])
y_nl = np.array([1.0] * (n // 2) + [-1.0] * (n // 2))

# ── Train linear SVM ──
w, b, losses = svm_sgd(X_lin, y_lin, C=1.0, lr=0.005, n_epochs=800)
preds = svm_predict(X_lin, w, b)
acc = np.mean(preds == y_lin)
print(f'Linear SVM accuracy: {acc * 100:.1f}%')
print(f'w = {np.round(w, 4)}, b = {np.round(b, 4)}')
margin = 2.0 / np.linalg.norm(w)
print(f'Margin (2/||w||) = {margin:.4f}')

fig, axes = plt.subplots(1, 3, figsize=(16, 5), facecolor='#0f0f1a')

# ── Plot 1: Linear SVM with margin ──
ax = axes[0]
ax.set_facecolor('#0f0f1a')
xx, yy = np.meshgrid(np.linspace(-5, 5, 200), np.linspace(-5, 5, 200))
Z = (xx.ravel() * w[0] + yy.ravel() * w[1] + b).reshape(xx.shape)
ax.contourf(xx, yy, Z, levels=[-100, 0, 100], colors=['#4c1d95', '#1e3a5f'], alpha=0.4)
ax.contour(xx, yy, Z, levels=[-1, 0, 1], colors=['#7c3aed', '#a78bfa', '#7c3aed'], linewidths=[1.5, 2.5, 1.5], linestyles=['--', '-', '--'])
pos = X_lin[y_lin == 1]
neg = X_lin[y_lin == -1]
ax.scatter(pos[:, 0], pos[:, 1], c='#34d399', s=25, alpha=0.8, label='+1')
ax.scatter(neg[:, 0], neg[:, 1], c='#f472b6', s=25, alpha=0.8, label='-1')
# Highlight support vectors (approximately)
sv_mask = np.abs(y_lin * (X_lin @ w + b) - 1) < 0.3
ax.scatter(X_lin[sv_mask, 0], X_lin[sv_mask, 1], s=120, facecolors='none', edgecolors='#fbbf24', linewidths=2, label='support vectors')
ax.set_xlim(-5, 5)
ax.set_ylim(-5, 5)
ax.set_title(f'Linear SVM  (margin={margin:.2f})', color='#c4b5fd', fontsize=11, fontweight='bold')
ax.legend(facecolor='#1e1b4b', labelcolor='#e2e8f0', edgecolor='#4c1d95', fontsize=8)
ax.tick_params(colors='#94a3b8')
for spine in ax.spines.values():
    spine.set_edgecolor('#334155')

# ── Plot 2: RBF kernel SVM on circles ──
ax = axes[1]
ax.set_facecolor('#0f0f1a')
if sklearn_available:
    svc_rbf = SVC(kernel='rbf', C=1.0, gamma='scale')
    svc_rbf.fit(X_nl, y_nl)
    Z2 = svc_rbf.decision_function(np.column_stack([xx.ravel(), yy.ravel()])).reshape(xx.shape)
    ax.contourf(xx, yy, Z2, levels=[-100, 0, 100], colors=['#4c1d95', '#1e3a5f'], alpha=0.4)
    ax.contour(xx, yy, Z2, levels=[-1, 0, 1], colors=['#7c3aed', '#a78bfa', '#7c3aed'], linewidths=[1.5, 2.5, 1.5], linestyles=['--', '-', '--'])
    sv = svc_rbf.support_vectors_
    ax.scatter(sv[:, 0], sv[:, 1], s=120, facecolors='none', edgecolors='#fbbf24', linewidths=2, label='support vectors')
    acc_rbf = svc_rbf.score(X_nl, y_nl)
    ax.set_title(f'RBF Kernel SVM (acc={acc_rbf*100:.0f}%)', color='#c4b5fd', fontsize=11, fontweight='bold')
else:
    ax.set_title('RBF Kernel SVM (sklearn not available)', color='#c4b5fd', fontsize=10)
pos_nl = X_nl[y_nl == 1]
neg_nl = X_nl[y_nl == -1]
ax.scatter(pos_nl[:, 0], pos_nl[:, 1], c='#34d399', s=25, alpha=0.8, label='+1')
ax.scatter(neg_nl[:, 0], neg_nl[:, 1], c='#f472b6', s=25, alpha=0.8, label='-1')
ax.set_xlim(-5, 5)
ax.set_ylim(-5, 5)
ax.legend(facecolor='#1e1b4b', labelcolor='#e2e8f0', edgecolor='#4c1d95', fontsize=8)
ax.tick_params(colors='#94a3b8')
for spine in ax.spines.values():
    spine.set_edgecolor('#334155')

# ── Plot 3: Loss convergence ──
ax = axes[2]
ax.set_facecolor('#0f0f1a')
ax.plot(losses, color='#a78bfa', linewidth=2)
ax.set_xlabel('Epoch', color='#e2e8f0')
ax.set_ylabel('Hinge loss', color='#e2e8f0')
ax.set_title('SVM Training Loss', color='#c4b5fd', fontsize=11, fontweight='bold')
ax.tick_params(colors='#94a3b8')
for spine in ax.spines.values():
    spine.set_edgecolor('#334155')

plt.suptitle('Support Vector Machines: Linear vs RBF Kernel', color='#ddd6fe', fontsize=13, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('output.png', dpi=130, bbox_inches='tight', facecolor='#0f0f1a')
print('Saved output.png')

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Key Takeaways

✓ SVM maximises margin \(2/\|\mathbf{w}\|\) by minimising \(\|\mathbf{w}\|^2/2\) subject to \(y_i(\mathbf{w}^\top\mathbf{x}_i+b)\geq 1\).
✓ The Lagrangian dual involves only inner products, giving: \(\max_\alpha \sum\alpha_i - \frac{1}{2}\sum_{i,j}\alpha_i\alpha_j y_i y_j\mathbf{x}_i^\top\mathbf{x}_j\).
✓ KKT complementary slackness identifies support vectors: those with \(\alpha_i > 0\) lying exactly on the margin.
✓ Soft margin adds slack variables \(\xi_i\) and a cost \(C\) to handle non-separable data via hinge loss.
✓ The kernel trick replaces \(\mathbf{x}_i^\top\mathbf{x}_j\) with \(K(\mathbf{x}_i,\mathbf{x}_j)\), enabling non-linear boundaries with RBF and polynomial kernels.

Share:X Reddit LinkedIn

← Logistic Regression Next: Perceptrons & Backprop