Part III · Chapter 9

Convolutional Neural Networks

Convolutional networks exploit the spatial structure of images through local connectivity and parameter sharing. This chapter derives the convolution operation rigorously, analyses translation equivariance, and traces the architectural evolution from LeNet to EfficientNet.

1. Discrete Convolution

The 1D discrete convolution of signal \(x\) with filter (kernel) \(h\) is:

\[ (x * h)[n] = \sum_{k=-\infty}^{\infty} x[k]\,h[n-k] \]

In deep learning, we commonly use cross-correlation (the filter is not flipped), but the operation is still called convolution by convention:

\[ (x \star h)[n] = \sum_{k=0}^{K-1} x[n+k]\,h[k] \]

2D Convolution for Images

For a 2D input \(\mathbf{X} \in \mathbb{R}^{H \times W}\) and kernel \(\mathbf{K} \in \mathbb{R}^{K_h \times K_w}\):

\[ \mathbf{Y}[i,j] = \sum_{m=0}^{K_h-1}\sum_{n=0}^{K_w-1} \mathbf{X}[i \cdot s + m,\; j \cdot s + n]\;\mathbf{K}[m,n] \]

where \(s\) is the stride. With zero-padding \(p\), the output size is:

\[ H_\mathrm{out} = \left\lfloor \frac{H + 2p - K_h}{s} \right\rfloor + 1, \qquad W_\mathrm{out} = \left\lfloor \frac{W + 2p - K_w}{s} \right\rfloor + 1 \]

Common choices: valid padding (\(p=0\), output shrinks), same padding (\(p = (K-1)/2\) for \(s=1\), output size preserved).

2. Convolution Operation Illustrated

Applying a 3×3 vertical-edge (Sobel-x) kernel to a 5×5 input (valid padding, stride 1) yields a 3×3 feature map. Purple region: the receptive field for output element (0,0).

3. Parameter Sharing & Translation Equivariance

In a fully connected layer, the weight matrix has \(H \cdot W \cdot H' \cdot W'\) parameters. A convolutional layer with \(C\) filters of size \(K_h \times K_w\) has only\(C \cdot K_h \cdot K_w\) parameters — independent of the spatial size. The same kernel is applied at every spatial location: this is parameter sharing.

Translation Equivariance

A function \(f\) is equivariant to translation \(T_\delta\) if:

\[ f(T_\delta[\mathbf{X}]) = T_\delta[f(\mathbf{X})] \]

Convolution satisfies this: shifting the input shifts the feature map by the same amount. This is why CNNs detect features wherever they appear in the image. Note: max-pooling introduces approximate translation invariance (small shifts don't change the output).

Pooling Layers

Pooling reduces spatial dimensions, providing spatial compression and limited translation invariance.

Max Pooling

\[ y[i,j] = \max_{m,n \in \mathcal{R}(i,j)} x[m,n] \]

Selects the most activated feature in each region. Gradient flows only to the maximally activated unit.

Average Pooling

\[ y[i,j] = \frac{1}{|\mathcal{R}|}\sum_{m,n \in \mathcal{R}(i,j)} x[m,n] \]

Smoother; gradient distributes uniformly. Global average pooling replaces flatten + FC layers in modern architectures.

4. Backprop Through Convolution

Given upstream gradient \(\partial\mathcal{L}/\partial\mathbf{Y}\), we need gradients with respect to both the kernel and the input.

Gradient w.r.t. kernel \(\mathbf{K}\)

\[ \frac{\partial \mathcal{L}}{\partial K[m,n]} = \sum_{i,j} \frac{\partial \mathcal{L}}{\partial Y[i,j]} \cdot X[i \cdot s + m,\; j \cdot s + n] \]

This is itself a (strided) cross-correlation of the input with the upstream gradient.

Gradient w.r.t. input \(\mathbf{X}\)

\[ \frac{\partial \mathcal{L}}{\partial X[p,q]} = \sum_{m,n} \frac{\partial \mathcal{L}}{\partial Y\!\left[\frac{p-m}{s}, \frac{q-n}{s}\right]} \cdot K[m,n] \]

This is a full convolution (with the flipped kernel) of the upstream gradient — equivalently, a transposed convolution (deconvolution).

5. Modern CNN Architectures

LeNet-5 (LeCun 1998)

First successful CNN. Two conv layers (5×5, tanh), two avg-pool layers, three FC layers. ~60K params. Digit recognition on MNIST.

AlexNet (Krizhevsky 2012)

Won ImageNet 2012 by a large margin. 5 conv + 3 FC, ReLU activations, Dropout, data augmentation. ~60M params. Launched the deep learning era.

VGGNet (Simonyan 2015)

All 3×3 kernels, deeper (16–19 layers). Key insight: two 3×3 convolutions have the same receptive field as one 5×5 but with fewer parameters and an extra nonlinearity.

ResNet (He 2016)

Residual connections allow training 50–152 layers. Won ImageNet 2015. Introduced BatchNorm as standard. Still widely used as backbone.

EfficientNet (Tan & Le 2019)

Compound scaling: jointly scale depth, width, and resolution by a fixed ratio derived from a neural architecture search baseline. State-of-the-art accuracy/efficiency tradeoff.

6. Python: 2D Convolution from Scratch

We implement 2D cross-correlation in pure NumPy (no libraries), apply four kernels (horizontal edge, vertical edge, blur, sharpen) to a synthetic image, and visualise the feature maps and max-pooled outputs.

Python

script.py100 lines

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

# ── 2D convolution from scratch ───────────────────────────────────────────────
def conv2d(image, kernel, stride=1, padding=0):
    H, W = image.shape
    Kh, Kw = kernel.shape
    if padding > 0:
        image = np.pad(image, padding, mode='constant')
    H_out = (image.shape[0] - Kh) // stride + 1
    W_out = (image.shape[1] - Kw) // stride + 1
    out = np.zeros((H_out, W_out))
    for i in range(H_out):
        for j in range(W_out):
            patch = image[i*stride:i*stride+Kh, j*stride:j*stride+Kw]
            out[i, j] = (patch * kernel).sum()
    return out

# ── synthetic image: checkerboard + gradient ─────────────────────────────────
np.random.seed(0)
size = 64
img = np.zeros((size, size))
for i in range(size):
    for j in range(size):
        if (i // 8 + j // 8) % 2 == 0:
            img[i, j] = 0.9
        else:
            img[i, j] = 0.1
img += np.random.randn(size, size) * 0.05  # slight noise

# ── kernels ───────────────────────────────────────────────────────────────────
kernels = {
    'Horizontal Edge': np.array([[-1,-2,-1],[0,0,0],[1,2,1]], dtype=float),
    'Vertical Edge':   np.array([[-1,0,1],[-2,0,2],[-1,0,1]], dtype=float),
    'Blur (3x3)':      np.ones((3,3), dtype=float) / 9.0,
    'Sharpen':         np.array([[0,-1,0],[-1,5,-1],[0,-1,0]], dtype=float),
}

feature_maps = {name: conv2d(img, k, padding=1) for name, k in kernels.items()}

# ── max pooling ───────────────────────────────────────────────────────────────
def max_pool2d(fm, pool=2):
    H, W = fm.shape
    out = np.zeros((H // pool, W // pool))
    for i in range(0, H - pool + 1, pool):
        for j in range(0, W - pool + 1, pool):
            out[i//pool, j//pool] = fm[i:i+pool, j:j+pool].max()
    return out

pooled = {name: max_pool2d(fm) for name, fm in feature_maps.items()}

# ── plot ──────────────────────────────────────────────────────────────────────
fig, axes = plt.subplots(3, 5, figsize=(15, 9), facecolor='#0f0f1a')

# original
ax = axes[0, 0]
ax.imshow(img, cmap='gray', vmin=0, vmax=1)
ax.set_title('Original Image', color='#ddd6fe', fontsize=10)
ax.axis('off')

# kernels
for idx, (name, k) in enumerate(kernels.items()):
    ax = axes[0, idx+1]
    im = ax.imshow(k, cmap='RdBu_r')
    ax.set_title(f'Kernel: {name}', color='#a78bfa', fontsize=9)
    ax.axis('off')

# feature maps
for idx, (name, fm) in enumerate(feature_maps.items()):
    ax = axes[1, idx+1]
    ax.imshow(fm, cmap='plasma')
    ax.set_title(f'Feature Map\n{name}', color='#ddd6fe', fontsize=9)
    ax.axis('off')
axes[1, 0].axis('off')
axes[1, 0].set_facecolor('#0f0f1a')

# pooled
for idx, (name, pm) in enumerate(pooled.items()):
    ax = axes[2, idx+1]
    ax.imshow(pm, cmap='viridis')
    ax.set_title(f'After MaxPool\n{name}', color='#86efac', fontsize=9)
    ax.axis('off')
axes[2, 0].axis('off')
axes[2, 0].set_facecolor('#0f0f1a')

for row in axes:
    for ax in row:
        ax.set_facecolor('#0f0f1a')

plt.suptitle('2D Convolution from Scratch: Kernels, Feature Maps, and Max Pooling', color='white', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.savefig('output.png', dpi=130, bbox_inches='tight', facecolor='#0f0f1a')

# ── print feature map stats ───────────────────────────────────────────────────
for name, fm in feature_maps.items():
    print(f"{name:20s}  shape={fm.shape}  min={fm.min():.3f}  max={fm.max():.3f}")
print("Saved output.png")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Share:X Reddit LinkedIn

← Ch 8: Deep Networks Ch 10: Clustering →