ML for Science/Part I: Foundations

Backpropagation & Optimization

The chain rule as an algorithm — from computational graphs to Adam and beyond

Introduction

Backpropagation (Rumelhart, Hinton & Williams, 1986) is the algorithm that makes deep learning possible. It efficiently computes the gradient of a loss function with respect to every parameter in a neural network by applying the chain rule of calculus in reverse through the computational graph. Combined with stochastic optimization methods, it enables training networks with millions (or billions) of parameters.

Key Topics

  • 1. The Chain Rule and Computational Graphs
  • 2. Deriving Backpropagation
  • 3. Stochastic Gradient Descent (SGD)
  • 4. Momentum
  • 5. Adam Optimizer
  • 6. Learning Rate Schedules
  • 7. Batch Normalization

1. The Chain Rule and Computational Graphs

Multivariate Chain Rule

If $y = f(u_1, u_2, \ldots, u_m)$ and each $u_k = g_k(x_1, \ldots, x_n)$, then:

$$\frac{\partial y}{\partial x_i} = \sum_{k=1}^{m} \frac{\partial y}{\partial u_k}\frac{\partial u_k}{\partial x_i}$$

In matrix form, this is the chain of Jacobians: if $\mathbf{y} = f(\mathbf{u})$ and $\mathbf{u} = g(\mathbf{x})$, then $\frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \frac{\partial \mathbf{y}}{\partial \mathbf{u}} \cdot \frac{\partial \mathbf{u}}{\partial \mathbf{x}}$.

Computational Graph View

A neural network defines a directed acyclic graph (DAG) of operations. Each node computes a function of its inputs and passes the result to its children. The key insight of backprop is that we can compute gradients by passing "error signals" backward through this graph.

Forward pass: Compute all intermediate values from inputs to loss.

Backward pass: Starting from $\partial L/\partial L = 1$, propagate gradients backward using the chain rule at each node. Each node multiplies incoming gradient by its local Jacobian.

Forward vs Reverse Mode

For a function $f: \mathbb{R}^n \to \mathbb{R}^m$:

  • Forward mode (tangent): Computes one column of the Jacobian per pass — efficient when $n \ll m$
  • Reverse mode (adjoint): Computes one row of the Jacobian per pass — efficient when $m \ll n$

Since neural network losses are scalar ($m=1$) with many parameters ($n \gg 1$), reverse mode (backpropagation) is optimal: it computes all$n$ partial derivatives in a single backward pass, at roughly the same cost as the forward pass.

2. Deriving Backpropagation

Consider a network with $L$ layers. Layer $\ell$ computes:

$$\mathbf{z}^{(\ell)} = \mathbf{W}^{(\ell)}\mathbf{h}^{(\ell-1)} + \mathbf{b}^{(\ell)}, \quad \mathbf{h}^{(\ell)} = \phi(\mathbf{z}^{(\ell)})$$

Step 1: Define the Error Signal

For each layer $\ell$, define the error signal (upstream gradient):

$$\boxed{\boldsymbol{\delta}^{(\ell)} \equiv \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(\ell)}}}$$

Step 2: Output Layer Error

For MSE loss $\mathcal{L} = \frac{1}{2}\|\mathbf{y} - \mathbf{h}^{(L)}\|^2$ with identity output activation:

$$\boldsymbol{\delta}^{(L)} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(L)}} = (\mathbf{h}^{(L)} - \mathbf{y})$$

For cross-entropy loss with softmax output: $\boldsymbol{\delta}^{(L)} = \hat{\mathbf{p}} - \mathbf{y}$ (one-hot).

Step 3: Backward Recursion

For hidden layers $\ell = L-1, L-2, \ldots, 1$, apply the chain rule:

$$\boldsymbol{\delta}^{(\ell)} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(\ell)}} = \frac{\partial \mathbf{z}^{(\ell+1)}}{\partial \mathbf{h}^{(\ell)}} \cdot \frac{\partial \mathbf{h}^{(\ell)}}{\partial \mathbf{z}^{(\ell)}} \cdot \boldsymbol{\delta}^{(\ell+1)}$$

Since $\mathbf{z}^{(\ell+1)} = \mathbf{W}^{(\ell+1)}\mathbf{h}^{(\ell)} + \mathbf{b}^{(\ell+1)}$and $\mathbf{h}^{(\ell)} = \phi(\mathbf{z}^{(\ell)})$:

$$\boxed{\boldsymbol{\delta}^{(\ell)} = \left((\mathbf{W}^{(\ell+1)})^T \boldsymbol{\delta}^{(\ell+1)}\right) \odot \phi'(\mathbf{z}^{(\ell)})}$$

where $\odot$ is element-wise multiplication (Hadamard product).

Step 4: Parameter Gradients

Once we have $\boldsymbol{\delta}^{(\ell)}$ for each layer, the parameter gradients are:

$$\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(\ell)}} = \boldsymbol{\delta}^{(\ell)}(\mathbf{h}^{(\ell-1)})^T$$
$$\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(\ell)}} = \boldsymbol{\delta}^{(\ell)}$$

Each gradient is an outer product of the error signal and the input to that layer. This is why backprop requires storing all forward-pass activations — memory scales linearly with depth.

Computational Complexity

For a network with $P$ total parameters:

  • Forward pass: $O(P)$ multiply-add operations
  • Backward pass: $O(P)$ multiply-add operations (roughly 2x forward)
  • Memory: Must store all $\mathbf{h}^{(\ell)}$ and $\mathbf{z}^{(\ell)}$ from the forward pass

Without backprop, computing gradients by finite differences would cost $O(P^2)$ — intractable for modern networks.

3. Stochastic Gradient Descent (SGD)

From Full-Batch to Mini-Batch

The full gradient over $n$ samples is:

$$\nabla\mathcal{L}(\mathbf{w}) = \frac{1}{n}\sum_{i=1}^{n}\nabla\ell_i(\mathbf{w})$$

SGD approximates this with a random mini-batch $\mathcal{B} \subset \{1,\ldots,n\}$ of size $B$:

$$\hat{\nabla}\mathcal{L}(\mathbf{w}) = \frac{1}{B}\sum_{i \in \mathcal{B}}\nabla\ell_i(\mathbf{w})$$

This is an unbiased estimate: $\mathbb{E}[\hat{\nabla}\mathcal{L}] = \nabla\mathcal{L}$.

The variance of the estimator is $\text{Var}(\hat{\nabla}\mathcal{L}) = \frac{\sigma^2}{B}$, where $\sigma^2$ is the per-sample gradient variance. Larger batches reduce variance but cost more computation per step.

SGD Convergence

For convex functions with learning rate $\eta_t$:

  • Fixed LR: SGD converges to a neighborhood of the optimum, with radius $\propto \eta\sigma^2$
  • Decreasing LR: If $\sum_t \eta_t = \infty$ and $\sum_t \eta_t^2 < \infty$, SGD converges exactly
  • Common schedule: $\eta_t = \eta_0 / (1 + \alpha t)$ satisfies both conditions

4. Momentum

Classical Momentum (Polyak, 1964)

Momentum accelerates SGD by accumulating an exponentially decaying moving average of past gradients:

$$\boxed{\mathbf{v}_{t+1} = \beta\mathbf{v}_t + \nabla\mathcal{L}(\mathbf{w}_t), \quad \mathbf{w}_{t+1} = \mathbf{w}_t - \eta\mathbf{v}_{t+1}}$$

where $\beta \in [0, 1)$ is the momentum coefficient (typically 0.9). The velocity $\mathbf{v}$acts like physical momentum, building up speed along consistent gradient directions.

The effective learning rate along consistent directions is amplified by $1/(1-\beta)$. For $\beta = 0.9$, this is a 10x amplification. Oscillating components get damped.

Nesterov Accelerated Gradient (NAG)

Nesterov momentum evaluates the gradient at the "look-ahead" position:

$$\mathbf{v}_{t+1} = \beta\mathbf{v}_t + \nabla\mathcal{L}(\mathbf{w}_t - \eta\beta\mathbf{v}_t), \quad \mathbf{w}_{t+1} = \mathbf{w}_t - \eta\mathbf{v}_{t+1}$$

This provides a "correction" that improves convergence rate from $O(1/\sqrt{\kappa})$(classical momentum) to $O(1/\kappa)$ steps for condition number $\kappa$ of the loss surface.

5. Adam Optimizer

Adam (Adaptive Moment Estimation, Kingma & Ba 2015) combines momentum with per-parameter adaptive learning rates using second moment estimates.

The Adam Algorithm

Initialize $\mathbf{m}_0 = \mathbf{0}$, $\mathbf{v}_0 = \mathbf{0}$, $t = 0$. At each step:

$$\mathbf{g}_t = \nabla\mathcal{L}(\mathbf{w}_t) \quad \text{(compute gradient)}$$
$$\mathbf{m}_t = \beta_1\mathbf{m}_{t-1} + (1-\beta_1)\mathbf{g}_t \quad \text{(first moment estimate)}$$
$$\mathbf{v}_t = \beta_2\mathbf{v}_{t-1} + (1-\beta_2)\mathbf{g}_t^2 \quad \text{(second moment estimate)}$$
$$\hat{\mathbf{m}}_t = \frac{\mathbf{m}_t}{1 - \beta_1^t}, \quad \hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1 - \beta_2^t} \quad \text{(bias correction)}$$
$$\boxed{\mathbf{w}_{t+1} = \mathbf{w}_t - \eta\frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon}}$$

Default hyperparameters: $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$, $\eta = 0.001$.

Why Bias Correction?

Since $\mathbf{m}_0 = \mathbf{0}$, early estimates are biased toward zero:

$$\mathbb{E}[\mathbf{m}_t] = (1-\beta_1^t)\mathbb{E}[\mathbf{g}_t]$$

Dividing by $(1-\beta_1^t)$ corrects this. For $\beta_1 = 0.9$, the correction is significant for the first ~10 steps, then becomes negligible.

AdamW: Decoupled Weight Decay

Loshchilov & Hutter (2019) showed that L2 regularization and weight decay are not equivalent with Adam. The correct formulation is:

$$\mathbf{w}_{t+1} = (1 - \eta\lambda)\mathbf{w}_t - \eta\frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon}$$

This decouples the weight decay from the adaptive gradient, leading to better generalization. AdamW is the default optimizer for modern transformers.

6. Learning Rate Schedules

Common Schedules

  • Step Decay:
    $\eta_t = \eta_0 \cdot \gamma^{\lfloor t/s \rfloor}$

    Reduce LR by factor $\gamma$ every $s$ epochs

  • Cosine Annealing:
    $\eta_t = \eta_{\min} + \frac{1}{2}(\eta_0 - \eta_{\min})\left(1 + \cos\frac{\pi t}{T}\right)$

    Smooth decay to $\eta_{\min}$ over $T$ steps

  • Linear Warmup + Decay:
    $\eta_t = \begin{cases} \eta_0 \cdot t/T_w & t < T_w \\ \eta_0 \cdot (T-t)/(T-T_w) & t \geq T_w \end{cases}$

    Used in transformer training; warmup stabilizes early training with Adam

  • Exponential Decay:
    $\eta_t = \eta_0 \cdot e^{-\alpha t}$

    Continuous version of step decay

7. Batch Normalization

Batch normalization (Ioffe & Szegedy, 2015) normalizes layer inputs across the mini-batch, stabilizing training and enabling higher learning rates.

The BatchNorm Transform

For a mini-batch $\{z_i\}_{i=1}^{B}$ at a given layer:

$$\mu_B = \frac{1}{B}\sum_{i=1}^{B} z_i, \quad \sigma_B^2 = \frac{1}{B}\sum_{i=1}^{B}(z_i - \mu_B)^2$$
$$\hat{z}_i = \frac{z_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$$
$$\boxed{\text{BN}(z_i) = \gamma\hat{z}_i + \beta}$$

where $\gamma$ and $\beta$ are learnable scale and shift parameters that restore the network's representational capacity.

Backprop Through BatchNorm

The gradient through BatchNorm requires careful derivation. Let $\delta_i = \partial\mathcal{L}/\partial\hat{z}_i$:

$$\frac{\partial\mathcal{L}}{\partial z_i} = \frac{1}{\sqrt{\sigma_B^2+\epsilon}}\left(\delta_i - \frac{1}{B}\sum_j\delta_j - \frac{\hat{z}_i}{B}\sum_j\delta_j\hat{z}_j\right)$$

This ensures gradients are properly normalized, preventing both vanishing and exploding gradients.

8. Python Simulation: Backprop & Optimizer Comparison

This simulation implements backpropagation from scratch and compares SGD, Momentum, and Adam optimizers on a nonlinear regression task.

Backpropagation & Optimizer Comparison

Python
script.py231 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

9. Vanishing and Exploding Gradients

The Gradient Product Problem

The backpropagation recursion involves a product of matrices and activation derivatives:

$$\frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(1)}} = \left(\prod_{\ell=1}^{L-1} \text{diag}(\phi'(\mathbf{z}^{(\ell)})) \cdot \mathbf{W}^{(\ell+1)T}\right) \boldsymbol{\delta}^{(L)}$$

If the spectral norm $\|\text{diag}(\phi') \cdot \mathbf{W}^T\| < 1$ at each layer, gradients vanish exponentially with depth. If the norm $> 1$, gradients explode exponentially.

Why Sigmoid Causes Vanishing Gradients

The sigmoid derivative satisfies $\sigma'(z) \leq 0.25$ for all $z$. At each layer, the gradient is multiplied by at most 0.25. After $L$ layers:

$$\|\boldsymbol{\delta}^{(1)}\| \leq 0.25^{L-1} \cdot \|\mathbf{W}\|^{L-1} \cdot \|\boldsymbol{\delta}^{(L)}\|$$

For $L = 10$ layers with well-initialized weights ($\|\mathbf{W}\| \approx 1$), gradients shrink by a factor of $0.25^9 \approx 4 \times 10^{-6}$. This is why ReLU ($\phi'(z) = 1$ for $z > 0$) revolutionized deep learning.

Solutions

  • ReLU activation: Gradient is exactly 1 for active neurons
  • Residual connections: Skip connections add an identity path for gradients
  • Batch normalization: Keeps pre-activations in a well-conditioned range
  • Gradient clipping: Cap gradient norms to prevent explosion: $\mathbf{g} \leftarrow \mathbf{g} \cdot \min(1, c/\|\mathbf{g}\|)$
  • Proper initialization: He/Xavier initialization maintains gradient variance across layers
  • Layer normalization: Normalize across features rather than batch, used in transformers

10. Automatic Differentiation

Backpropagation is a specific case of reverse-mode automatic differentiation (AD), a general technique for computing derivatives of programs.

Forward Mode vs Reverse Mode

For a function $f: \mathbb{R}^n \to \mathbb{R}^m$ composed of $T$ elementary operations:

PropertyForward ModeReverse Mode
ComputesJacobian-vector product $\mathbf{J}\mathbf{v}$Vector-Jacobian product $\mathbf{v}^T\mathbf{J}$
Cost per pass$O(T)$$O(T)$
Full Jacobian$n$ passes$m$ passes
Best when$n \ll m$$m \ll n$ (neural nets)

AD Frameworks

Modern deep learning frameworks implement reverse-mode AD automatically:

  • PyTorch: Dynamic computation graph (define-by-run), eager execution
  • JAX: Functional transformations, supports forward and reverse mode, JIT compilation
  • TensorFlow: Static computation graph (with eager mode), XLA compilation

These frameworks compute exact gradients (up to floating point) — not numerical approximations. The user only writes the forward pass; the backward pass is generated automatically.

Summary

  • Backpropagation: Reverse-mode automatic differentiation; computes all gradients in $O(P)$ time
  • Error signal: $\boldsymbol{\delta}^{(\ell)} = ((\mathbf{W}^{(\ell+1)})^T\boldsymbol{\delta}^{(\ell+1)}) \odot \phi'(\mathbf{z}^{(\ell)})$
  • SGD: Unbiased gradient estimate from mini-batches; variance decreases as $1/B$
  • Momentum: Accumulates gradient history; accelerates convergence along consistent directions
  • Adam: Adaptive per-parameter learning rates with bias correction; the default optimizer in deep learning
  • BatchNorm: Normalizes layer inputs; stabilizes training and enables higher learning rates