Part III ยท Chapter 9

Convolutional Neural Networks

Convolutional networks exploit the spatial structure of images through local connectivity and parameter sharing. This chapter derives the convolution operation rigorously, analyses translation equivariance, and traces the architectural evolution from LeNet to EfficientNet.

1. Discrete Convolution

The 1D discrete convolution of signal \(x\) with filter (kernel) \(h\) is:

\[ (x * h)[n] = \sum_{k=-\infty}^{\infty} x[k]\,h[n-k] \]

In deep learning, we commonly use cross-correlation (the filter is not flipped), but the operation is still called convolution by convention:

\[ (x \star h)[n] = \sum_{k=0}^{K-1} x[n+k]\,h[k] \]

2D Convolution for Images

For a 2D input \(\mathbf{X} \in \mathbb{R}^{H \times W}\) and kernel \(\mathbf{K} \in \mathbb{R}^{K_h \times K_w}\):

\[ \mathbf{Y}[i,j] = \sum_{m=0}^{K_h-1}\sum_{n=0}^{K_w-1} \mathbf{X}[i \cdot s + m,\; j \cdot s + n]\;\mathbf{K}[m,n] \]

where \(s\) is the stride. With zero-padding \(p\), the output size is:

\[ H_\mathrm{out} = \left\lfloor \frac{H + 2p - K_h}{s} \right\rfloor + 1, \qquad W_\mathrm{out} = \left\lfloor \frac{W + 2p - K_w}{s} \right\rfloor + 1 \]

Common choices: valid padding (\(p=0\), output shrinks), same padding (\(p = (K-1)/2\) for \(s=1\), output size preserved).

2. Convolution Operation Illustrated

2103113210021323102110312Input 5ร—5โ˜…-101-202-101Kernel 3ร—3 (Sobel-x)=โˆ’8212โˆ’408โˆ’8212Output 3ร—3Element (1,1): (โˆ’1)ยท3+(0)ยท2+(1)ยท1+(โˆ’2)ยท2+(0)ยท1+(2)ยท3+(โˆ’1)ยท1+(0)ยท0+(1)ยท2 = 2

Applying a 3ร—3 vertical-edge (Sobel-x) kernel to a 5ร—5 input (valid padding, stride 1) yields a 3ร—3 feature map. Purple region: the receptive field for output element (0,0).

3. Parameter Sharing & Translation Equivariance

In a fully connected layer, the weight matrix has \(H \cdot W \cdot H' \cdot W'\) parameters. A convolutional layer with \(C\) filters of size \(K_h \times K_w\) has only\(C \cdot K_h \cdot K_w\) parameters โ€” independent of the spatial size. The same kernel is applied at every spatial location: this is parameter sharing.

Translation Equivariance

A function \(f\) is equivariant to translation \(T_\delta\) if:

\[ f(T_\delta[\mathbf{X}]) = T_\delta[f(\mathbf{X})] \]

Convolution satisfies this: shifting the input shifts the feature map by the same amount. This is why CNNs detect features wherever they appear in the image. Note: max-pooling introduces approximate translation invariance (small shifts don't change the output).

Pooling Layers

Pooling reduces spatial dimensions, providing spatial compression and limited translation invariance.

Max Pooling

\[ y[i,j] = \max_{m,n \in \mathcal{R}(i,j)} x[m,n] \]

Selects the most activated feature in each region. Gradient flows only to the maximally activated unit.

Average Pooling

\[ y[i,j] = \frac{1}{|\mathcal{R}|}\sum_{m,n \in \mathcal{R}(i,j)} x[m,n] \]

Smoother; gradient distributes uniformly. Global average pooling replaces flatten + FC layers in modern architectures.

4. Backprop Through Convolution

Given upstream gradient \(\partial\mathcal{L}/\partial\mathbf{Y}\), we need gradients with respect to both the kernel and the input.

Gradient w.r.t. kernel \(\mathbf{K}\)

\[ \frac{\partial \mathcal{L}}{\partial K[m,n]} = \sum_{i,j} \frac{\partial \mathcal{L}}{\partial Y[i,j]} \cdot X[i \cdot s + m,\; j \cdot s + n] \]

This is itself a (strided) cross-correlation of the input with the upstream gradient.

Gradient w.r.t. input \(\mathbf{X}\)

\[ \frac{\partial \mathcal{L}}{\partial X[p,q]} = \sum_{m,n} \frac{\partial \mathcal{L}}{\partial Y\!\left[\frac{p-m}{s}, \frac{q-n}{s}\right]} \cdot K[m,n] \]

This is a full convolution (with the flipped kernel) of the upstream gradient โ€” equivalently, a transposed convolution (deconvolution).

5. Modern CNN Architectures

LeNet-5 (LeCun 1998)

First successful CNN. Two conv layers (5ร—5, tanh), two avg-pool layers, three FC layers. ~60K params. Digit recognition on MNIST.

AlexNet (Krizhevsky 2012)

Won ImageNet 2012 by a large margin. 5 conv + 3 FC, ReLU activations, Dropout, data augmentation. ~60M params. Launched the deep learning era.

VGGNet (Simonyan 2015)

All 3ร—3 kernels, deeper (16โ€“19 layers). Key insight: two 3ร—3 convolutions have the same receptive field as one 5ร—5 but with fewer parameters and an extra nonlinearity.

ResNet (He 2016)

Residual connections allow training 50โ€“152 layers. Won ImageNet 2015. Introduced BatchNorm as standard. Still widely used as backbone.

EfficientNet (Tan & Le 2019)

Compound scaling: jointly scale depth, width, and resolution by a fixed ratio derived from a neural architecture search baseline. State-of-the-art accuracy/efficiency tradeoff.

6. Python: 2D Convolution from Scratch

We implement 2D cross-correlation in pure NumPy (no libraries), apply four kernels (horizontal edge, vertical edge, blur, sharpen) to a synthetic image, and visualise the feature maps and max-pooled outputs.

Python
script.py102 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server