ML for Science/Part II: Deep Learning

CNNs for Image & Signal Data

Exploiting translation equivariance — from convolution theory to ResNets and scientific imaging

Introduction

Convolutional Neural Networks (CNNs) are the dominant architecture for processing spatially structured data: images, spectra, time series, and fields on regular grids. Their key inductive bias — translation equivariance — means that the same features are detected regardless of spatial position, dramatically reducing the number of parameters compared to fully connected networks.

In science, CNNs have revolutionized analysis of microscopy images, astronomical surveys, particle detector readouts, and spectroscopic data. Understanding the mathematical foundations of convolution and pooling operations is essential for designing architectures tailored to scientific problems.

Key Topics

  • 1. The Convolution Operation
  • 2. Output Dimension Formula
  • 3. Pooling Operations
  • 4. Classic Architectures: LeNet to ResNet
  • 5. Transfer Learning
  • 6. Scientific Applications

1. The Convolution Operation

Continuous Convolution

In signal processing, the convolution of functions $f$ and $g$ is:

$$(f * g)(t) = \int_{-\infty}^{\infty} f(\tau)g(t - \tau)\,d\tau$$

This operation is commutative ($f*g = g*f$), associative ($(f*g)*h = f*(g*h)$), and distributive over addition. In the Fourier domain, convolution becomes multiplication: $\mathcal{F}(f*g) = \mathcal{F}(f)\cdot\mathcal{F}(g)$.

Discrete 2D Convolution (Cross-Correlation)

In CNNs, we actually compute cross-correlation (convolution without flipping the kernel). For input $\mathbf{I} \in \mathbb{R}^{H \times W}$ and kernel$\mathbf{K} \in \mathbb{R}^{k_H \times k_W}$:

$$\boxed{(\mathbf{I} \star \mathbf{K})_{ij} = \sum_{m=0}^{k_H-1}\sum_{n=0}^{k_W-1} I_{i+m,\, j+n} \cdot K_{m,n}}$$

Each output pixel is the inner product of a local patch of the input with the kernel. Since the kernel weights are shared across all spatial positions, CNNs have far fewer parameters than fully connected layers.

Multi-Channel Convolution

For $C_{\text{in}}$ input channels and $C_{\text{out}}$ output channels, the kernel is a 4D tensor $\mathbf{K} \in \mathbb{R}^{C_{\text{out}} \times C_{\text{in}} \times k_H \times k_W}$:

$$\text{out}_{c,i,j} = b_c + \sum_{c'=0}^{C_{\text{in}}-1}\sum_{m=0}^{k_H-1}\sum_{n=0}^{k_W-1} K_{c,c',m,n} \cdot I_{c',i+m,j+n}$$

The number of parameters per convolutional layer is $C_{\text{out}} \times (C_{\text{in}} \times k_H \times k_W + 1)$.

Translation Equivariance

Let $T_{\mathbf{a}}$ denote translation by vector $\mathbf{a}$. Convolution is equivariant:

$$(T_{\mathbf{a}}\mathbf{I}) \star \mathbf{K} = T_{\mathbf{a}}(\mathbf{I} \star \mathbf{K})$$

A feature detected at position $(i,j)$ in the input produces the same response at position$(i,j)$ in the output. This is a form of weight sharingthat encodes the prior belief that the same local patterns can appear anywhere in the input.

2. Output Dimension Formula

Derivation

For an input of size $H$, kernel size $k$, stride $s$, and padding $p$:

$$\boxed{H_{\text{out}} = \left\lfloor\frac{H + 2p - k}{s}\right\rfloor + 1}$$

Derivation: The first valid position of the kernel is at index $0$ (after padding). The last valid position is at index $H + 2p - k$. With stride $s$, the number of valid positions is $\lfloor(H + 2p - k)/s\rfloor + 1$.

Special cases:

  • "Valid" convolution ($p=0, s=1$): $H_{\text{out}} = H - k + 1$ (output shrinks)
  • "Same" convolution ($p=\lfloor k/2\rfloor, s=1$): $H_{\text{out}} = H$ (output same size)
  • Strided ($s=2, p=\lfloor k/2\rfloor$): $H_{\text{out}} \approx H/2$ (downsampling)

Receptive Field

The receptive field of a neuron in layer $L$ is the region in the input that influences its value. For $L$ layers each with kernel size $k$ and stride 1:

$$\text{RF} = 1 + L(k-1)$$

The receptive field grows linearly with depth. With $3 \times 3$ kernels, each layer adds 2 pixels to the receptive field. Deeper networks "see" larger spatial contexts.

Dilated (Atrous) Convolution

Dilated convolution inserts gaps in the kernel with dilation rate $d$:

$$H_{\text{out}} = \left\lfloor\frac{H + 2p - d(k-1) - 1}{s}\right\rfloor + 1$$

The effective kernel size becomes $k_{\text{eff}} = d(k-1) + 1$, allowing large receptive fields without increasing the number of parameters.

3. Pooling Operations

Pooling reduces spatial dimensions, providing a form of translation invariance(not just equivariance) and controlling the number of parameters in subsequent layers.

Max Pooling

For a pooling window of size $k \times k$ with stride $s$:

$$\text{MaxPool}_{ij} = \max_{0 \leq m,n < k} I_{si+m, sj+n}$$

Max pooling selects the strongest activation in each window, providing invariance to small translations (up to $k-1$ pixels).

Average Pooling

$$\text{AvgPool}_{ij} = \frac{1}{k^2}\sum_{m=0}^{k-1}\sum_{n=0}^{k-1} I_{si+m, sj+n}$$

Average pooling is a low-pass filter that smooths the feature map. Global Average Pooling (GAP) averages over the entire spatial extent, producing a single value per channel — commonly used before the final classification layer.

Backprop Through Pooling

Max pooling: Gradient flows only through the maximum element (winner-take-all):

$$\frac{\partial \text{MaxPool}}{\partial I_{si+m,sj+n}} = \begin{cases} 1 & \text{if } (m,n) = \arg\max \\ 0 & \text{otherwise} \end{cases}$$

Average pooling: Gradient is distributed equally to all elements in the window.

4. Classic Architectures

LeNet-5 (LeCun, 1998)

The original CNN for digit recognition. Architecture: Conv(6, 5x5) → Pool(2x2) → Conv(16, 5x5) → Pool(2x2) → FC(120) → FC(84) → FC(10).

~60K parameters. Established the Conv-Pool-Conv-Pool-FC pattern that dominated for 15 years.

VGGNet (2014)

Key insight: small 3x3 kernels stacked deeply. Two 3x3 convolutions have the same receptive field as one 5x5, but with fewer parameters and more nonlinearity:

$$\text{params}(5\times5) = 25C^2 \quad \text{vs} \quad \text{params}(2 \times 3\times3) = 18C^2$$

VGG-16 uses 16 weight layers with 138M parameters. Showed that depth matters.

ResNet (He et al., 2015)

The fundamental breakthrough that enabled very deep networks (100+ layers). The residual connection adds the input to the output of a block:

$$\boxed{\mathbf{h}^{(\ell+1)} = \phi\left(\mathbf{h}^{(\ell)} + \mathcal{F}(\mathbf{h}^{(\ell)}; \theta_\ell)\right)}$$

where $\mathcal{F}$ is typically two Conv-BN-ReLU layers. The gradient through the skip connection is:

$$\frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(\ell)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(L)}}\prod_{k=\ell}^{L-1}\left(\mathbf{I} + \frac{\partial \mathcal{F}_k}{\partial \mathbf{h}^{(k)}}\right)$$

The identity term $\mathbf{I}$ ensures gradients can flow directly from loss to early layers, solving the vanishing gradient problem.

ResNet as an ODE

In the limit of many layers with small updates, ResNet approximates an ODE:

$$\frac{d\mathbf{h}(t)}{dt} = \mathcal{F}(\mathbf{h}(t), t; \theta(t))$$

This connection to Neural ODEs (Chen et al., 2018) provides a continuous-depth interpretation and enables adaptive computation.

5. Transfer Learning

Transfer learning leverages features learned on large datasets (e.g., ImageNet) for new tasks with limited data — a critical technique in scientific applications where labeled data is scarce.

Feature Hierarchy

CNNs learn increasingly abstract features at deeper layers:

  • Layer 1: Edges, gradients, color patches — universal low-level features
  • Layer 2-3: Textures, corners, simple shapes — still fairly general
  • Layer 4-5: Object parts, complex textures — becoming task-specific
  • Final layers: Whole objects, scene categories — highly task-specific

Transfer Strategies

  • Feature extraction: Freeze all pretrained layers; train only new classification head
  • Fine-tuning: Unfreeze later layers; train with small learning rate
  • Progressive unfreezing: Gradually unfreeze layers from top to bottom

Rule of thumb: more data and more different target domain = more layers to fine-tune.

6. Scientific Applications

Microscopy & Cell Biology

U-Net (Ronneberger 2015) for cell segmentation. CNNs classify cell types, detect mitotic events, and segment organelles in electron microscopy images. Data augmentation (rotation, elastic deformation) is critical for small datasets.

Particle Physics

CNNs classify particle jet images at the LHC, distinguishing quark vs gluon jets. Detector readouts are naturally image-like (calorimeter hit patterns). 1D CNNs process time-series from scintillator detectors.

Astronomy

Galaxy morphology classification (Galaxy Zoo), gravitational lens detection, transient identification in sky surveys. Transfer learning from ImageNet works surprisingly well for astronomical images.

Medical Imaging

Pathology slide analysis, X-ray/CT classification, retinal disease detection. CNNs now match or exceed specialist radiologists on many diagnostic tasks.

7. Python Simulation: CNN from Scratch

This simulation implements a convolution layer and demonstrates how CNNs process image data, including output dimension verification and feature map computation.

CNN Convolution & Pooling from Scratch

Python
script.py217 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

8. Advanced CNN Techniques

1x1 Convolutions (Network-in-Network)

A $1 \times 1$ convolution performs a linear combination across channels at each spatial position:

$$\text{out}_{c, i, j} = \sum_{c'=0}^{C_{\text{in}}-1} K_{c, c'} \cdot I_{c', i, j} + b_c$$

This is equivalent to a fully connected layer applied independently to each spatial position. Uses include:

  • Channel reduction: Reduce $C_{\text{in}}$ to $C_{\text{out}} < C_{\text{in}}$ channels (bottleneck layers in ResNet)
  • Feature mixing: Learn cross-channel interactions
  • Adding nonlinearity: 1x1 conv + ReLU increases the network's expressiveness without changing spatial dimensions

Depthwise Separable Convolutions

Standard convolution has $C_{\text{out}} \times C_{\text{in}} \times k^2$ parameters. Depthwise separable convolutions (MobileNet) decompose this into:

  1. Depthwise: One $k \times k$ filter per input channel ($C_{\text{in}} \times k^2$ params)
  2. Pointwise: A $1 \times 1$ convolution across channels ($C_{\text{out}} \times C_{\text{in}}$ params)

Parameter reduction ratio: $\frac{C_{\text{in}} k^2 + C_{\text{out}} C_{\text{in}}}{C_{\text{out}} C_{\text{in}} k^2} = \frac{1}{C_{\text{out}}} + \frac{1}{k^2}$. For $C_{\text{out}} = 64$ and $k = 3$, this is roughly $8\times$ fewer parameters.

Transposed Convolution (Deconvolution)

To upsample feature maps (e.g., in U-Net decoders), we use transposed convolution. If standard convolution with stride $s$ maps $H \to H/s$, transposed convolution maps $H \to Hs$:

$$H_{\text{out}} = s(H_{\text{in}} - 1) + k - 2p$$

This is the gradient of the convolution operation — it redistributes values from the low-resolution output back to the high-resolution input space.

Attention in CNNs

Squeeze-and-Excitation (SE) blocks add channel-wise attention:

  1. Squeeze: Global average pooling reduces each channel to a scalar
  2. Excitation: A small FC network produces per-channel scaling factors
  3. Scale: Multiply each channel by its learned importance weight

This allows the network to adaptively recalibrate channel responses, focusing on the most informative features for each input.

9. Data Augmentation for CNNs

Data augmentation applies label-preserving transformations to training images, effectively increasing the dataset size and improving generalization.

Common Augmentations

  • Geometric: Random rotation, horizontal/vertical flip, crop, scale, shear
  • Color: Brightness, contrast, saturation jitter, color channel shuffling
  • Noise: Gaussian noise, salt-and-pepper noise, blur
  • Cutout/Erasing: Randomly mask rectangular regions to force robust feature learning
  • Mixup: Blend two training images and their labels: $\tilde{x} = \lambda x_i + (1-\lambda)x_j$

Science-Specific Augmentations

  • Microscopy: Elastic deformation, intensity variation (staining differences)
  • Astronomy: PSF convolution, noise injection matching telescope characteristics
  • Particle physics: Rotations in $\eta$-$\phi$ space, pileup simulation
  • Medical imaging: Anatomically-aware deformations that preserve tissue topology

Domain-specific augmentations encode prior knowledge about what variations are physically meaningful, improving sample efficiency dramatically.

10. Convolution as Matrix Multiplication

im2col

In practice, convolution is implemented as matrix multiplication using the im2coltransformation. Each local patch is unrolled into a row of a matrix:

  • Input: $\mathbf{I} \in \mathbb{R}^{C \times H \times W}$
  • Patches matrix: $\mathbf{P} \in \mathbb{R}^{(H_{\text{out}} \cdot W_{\text{out}}) \times (C \cdot k_H \cdot k_W)}$
  • Kernel matrix: $\mathbf{K} \in \mathbb{R}^{(C \cdot k_H \cdot k_W) \times C_{\text{out}}}$
  • Output: $\mathbf{O} = \mathbf{P} \cdot \mathbf{K}$, then reshaped

This trades memory for speed by leveraging highly optimized BLAS matrix multiplication routines. Modern GPU implementations use variants like implicit GEMM or Winograd convolution for further speedups.

FFT Convolution

For large kernels, convolution can be computed efficiently in the Fourier domain:

$$\mathbf{I} \star \mathbf{K} = \mathcal{F}^{-1}(\mathcal{F}(\mathbf{I}) \odot \mathcal{F}(\mathbf{K}))$$

This reduces the complexity from $O(H^2 k^2)$ to $O(H^2 \log H)$, beneficial when $k$ is large. However, for the small $3 \times 3$ kernels common in modern architectures, spatial convolution with Winograd transforms is usually faster.

Summary

  • Convolution: Shared-weight, local operation that provides translation equivariance
  • Output dimensions: $H_{\text{out}} = \lfloor(H + 2p - k)/s\rfloor + 1$
  • Pooling: Spatial downsampling that provides partial translation invariance
  • ResNet: Skip connections solve vanishing gradients; enable 100+ layer networks
  • Transfer learning: Pretrained features transfer well even to scientific domains
  • Parameter efficiency: CNNs use orders of magnitude fewer parameters than FC networks for spatial data