Backpropagation & Optimization
The chain rule as an algorithm — from computational graphs to Adam and beyond
Introduction
Backpropagation (Rumelhart, Hinton & Williams, 1986) is the algorithm that makes deep learning possible. It efficiently computes the gradient of a loss function with respect to every parameter in a neural network by applying the chain rule of calculus in reverse through the computational graph. Combined with stochastic optimization methods, it enables training networks with millions (or billions) of parameters.
Key Topics
- 1. The Chain Rule and Computational Graphs
- 2. Deriving Backpropagation
- 3. Stochastic Gradient Descent (SGD)
- 4. Momentum
- 5. Adam Optimizer
- 6. Learning Rate Schedules
- 7. Batch Normalization
1. The Chain Rule and Computational Graphs
Multivariate Chain Rule
If $y = f(u_1, u_2, \ldots, u_m)$ and each $u_k = g_k(x_1, \ldots, x_n)$, then:
In matrix form, this is the chain of Jacobians: if $\mathbf{y} = f(\mathbf{u})$ and $\mathbf{u} = g(\mathbf{x})$, then $\frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \frac{\partial \mathbf{y}}{\partial \mathbf{u}} \cdot \frac{\partial \mathbf{u}}{\partial \mathbf{x}}$.
Computational Graph View
A neural network defines a directed acyclic graph (DAG) of operations. Each node computes a function of its inputs and passes the result to its children. The key insight of backprop is that we can compute gradients by passing "error signals" backward through this graph.
Forward pass: Compute all intermediate values from inputs to loss.
Backward pass: Starting from $\partial L/\partial L = 1$, propagate gradients backward using the chain rule at each node. Each node multiplies incoming gradient by its local Jacobian.
Forward vs Reverse Mode
For a function $f: \mathbb{R}^n \to \mathbb{R}^m$:
- Forward mode (tangent): Computes one column of the Jacobian per pass — efficient when $n \ll m$
- Reverse mode (adjoint): Computes one row of the Jacobian per pass — efficient when $m \ll n$
Since neural network losses are scalar ($m=1$) with many parameters ($n \gg 1$), reverse mode (backpropagation) is optimal: it computes all$n$ partial derivatives in a single backward pass, at roughly the same cost as the forward pass.
2. Deriving Backpropagation
Consider a network with $L$ layers. Layer $\ell$ computes:
Step 1: Define the Error Signal
For each layer $\ell$, define the error signal (upstream gradient):
Step 2: Output Layer Error
For MSE loss $\mathcal{L} = \frac{1}{2}\|\mathbf{y} - \mathbf{h}^{(L)}\|^2$ with identity output activation:
For cross-entropy loss with softmax output: $\boldsymbol{\delta}^{(L)} = \hat{\mathbf{p}} - \mathbf{y}$ (one-hot).
Step 3: Backward Recursion
For hidden layers $\ell = L-1, L-2, \ldots, 1$, apply the chain rule:
Since $\mathbf{z}^{(\ell+1)} = \mathbf{W}^{(\ell+1)}\mathbf{h}^{(\ell)} + \mathbf{b}^{(\ell+1)}$and $\mathbf{h}^{(\ell)} = \phi(\mathbf{z}^{(\ell)})$:
where $\odot$ is element-wise multiplication (Hadamard product).
Step 4: Parameter Gradients
Once we have $\boldsymbol{\delta}^{(\ell)}$ for each layer, the parameter gradients are:
Each gradient is an outer product of the error signal and the input to that layer. This is why backprop requires storing all forward-pass activations — memory scales linearly with depth.
Computational Complexity
For a network with $P$ total parameters:
- Forward pass: $O(P)$ multiply-add operations
- Backward pass: $O(P)$ multiply-add operations (roughly 2x forward)
- Memory: Must store all $\mathbf{h}^{(\ell)}$ and $\mathbf{z}^{(\ell)}$ from the forward pass
Without backprop, computing gradients by finite differences would cost $O(P^2)$ — intractable for modern networks.
3. Stochastic Gradient Descent (SGD)
From Full-Batch to Mini-Batch
The full gradient over $n$ samples is:
SGD approximates this with a random mini-batch $\mathcal{B} \subset \{1,\ldots,n\}$ of size $B$:
This is an unbiased estimate: $\mathbb{E}[\hat{\nabla}\mathcal{L}] = \nabla\mathcal{L}$.
The variance of the estimator is $\text{Var}(\hat{\nabla}\mathcal{L}) = \frac{\sigma^2}{B}$, where $\sigma^2$ is the per-sample gradient variance. Larger batches reduce variance but cost more computation per step.
SGD Convergence
For convex functions with learning rate $\eta_t$:
- Fixed LR: SGD converges to a neighborhood of the optimum, with radius $\propto \eta\sigma^2$
- Decreasing LR: If $\sum_t \eta_t = \infty$ and $\sum_t \eta_t^2 < \infty$, SGD converges exactly
- Common schedule: $\eta_t = \eta_0 / (1 + \alpha t)$ satisfies both conditions
4. Momentum
Classical Momentum (Polyak, 1964)
Momentum accelerates SGD by accumulating an exponentially decaying moving average of past gradients:
where $\beta \in [0, 1)$ is the momentum coefficient (typically 0.9). The velocity $\mathbf{v}$acts like physical momentum, building up speed along consistent gradient directions.
The effective learning rate along consistent directions is amplified by $1/(1-\beta)$. For $\beta = 0.9$, this is a 10x amplification. Oscillating components get damped.
Nesterov Accelerated Gradient (NAG)
Nesterov momentum evaluates the gradient at the "look-ahead" position:
This provides a "correction" that improves convergence rate from $O(1/\sqrt{\kappa})$(classical momentum) to $O(1/\kappa)$ steps for condition number $\kappa$ of the loss surface.
5. Adam Optimizer
Adam (Adaptive Moment Estimation, Kingma & Ba 2015) combines momentum with per-parameter adaptive learning rates using second moment estimates.
The Adam Algorithm
Initialize $\mathbf{m}_0 = \mathbf{0}$, $\mathbf{v}_0 = \mathbf{0}$, $t = 0$. At each step:
Default hyperparameters: $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$, $\eta = 0.001$.
Why Bias Correction?
Since $\mathbf{m}_0 = \mathbf{0}$, early estimates are biased toward zero:
Dividing by $(1-\beta_1^t)$ corrects this. For $\beta_1 = 0.9$, the correction is significant for the first ~10 steps, then becomes negligible.
AdamW: Decoupled Weight Decay
Loshchilov & Hutter (2019) showed that L2 regularization and weight decay are not equivalent with Adam. The correct formulation is:
This decouples the weight decay from the adaptive gradient, leading to better generalization. AdamW is the default optimizer for modern transformers.
6. Learning Rate Schedules
Common Schedules
- Step Decay:$\eta_t = \eta_0 \cdot \gamma^{\lfloor t/s \rfloor}$
Reduce LR by factor $\gamma$ every $s$ epochs
- Cosine Annealing:$\eta_t = \eta_{\min} + \frac{1}{2}(\eta_0 - \eta_{\min})\left(1 + \cos\frac{\pi t}{T}\right)$
Smooth decay to $\eta_{\min}$ over $T$ steps
- Linear Warmup + Decay:$\eta_t = \begin{cases} \eta_0 \cdot t/T_w & t < T_w \\ \eta_0 \cdot (T-t)/(T-T_w) & t \geq T_w \end{cases}$
Used in transformer training; warmup stabilizes early training with Adam
- Exponential Decay:$\eta_t = \eta_0 \cdot e^{-\alpha t}$
Continuous version of step decay
7. Batch Normalization
Batch normalization (Ioffe & Szegedy, 2015) normalizes layer inputs across the mini-batch, stabilizing training and enabling higher learning rates.
The BatchNorm Transform
For a mini-batch $\{z_i\}_{i=1}^{B}$ at a given layer:
where $\gamma$ and $\beta$ are learnable scale and shift parameters that restore the network's representational capacity.
Backprop Through BatchNorm
The gradient through BatchNorm requires careful derivation. Let $\delta_i = \partial\mathcal{L}/\partial\hat{z}_i$:
This ensures gradients are properly normalized, preventing both vanishing and exploding gradients.
8. Python Simulation: Backprop & Optimizer Comparison
This simulation implements backpropagation from scratch and compares SGD, Momentum, and Adam optimizers on a nonlinear regression task.
Backpropagation & Optimizer Comparison
PythonClick Run to execute the Python code
Code will be executed with Python 3 on the server
9. Vanishing and Exploding Gradients
The Gradient Product Problem
The backpropagation recursion involves a product of matrices and activation derivatives:
If the spectral norm $\|\text{diag}(\phi') \cdot \mathbf{W}^T\| < 1$ at each layer, gradients vanish exponentially with depth. If the norm $> 1$, gradients explode exponentially.
Why Sigmoid Causes Vanishing Gradients
The sigmoid derivative satisfies $\sigma'(z) \leq 0.25$ for all $z$. At each layer, the gradient is multiplied by at most 0.25. After $L$ layers:
For $L = 10$ layers with well-initialized weights ($\|\mathbf{W}\| \approx 1$), gradients shrink by a factor of $0.25^9 \approx 4 \times 10^{-6}$. This is why ReLU ($\phi'(z) = 1$ for $z > 0$) revolutionized deep learning.
Solutions
- ReLU activation: Gradient is exactly 1 for active neurons
- Residual connections: Skip connections add an identity path for gradients
- Batch normalization: Keeps pre-activations in a well-conditioned range
- Gradient clipping: Cap gradient norms to prevent explosion: $\mathbf{g} \leftarrow \mathbf{g} \cdot \min(1, c/\|\mathbf{g}\|)$
- Proper initialization: He/Xavier initialization maintains gradient variance across layers
- Layer normalization: Normalize across features rather than batch, used in transformers
10. Automatic Differentiation
Backpropagation is a specific case of reverse-mode automatic differentiation (AD), a general technique for computing derivatives of programs.
Forward Mode vs Reverse Mode
For a function $f: \mathbb{R}^n \to \mathbb{R}^m$ composed of $T$ elementary operations:
| Property | Forward Mode | Reverse Mode |
|---|---|---|
| Computes | Jacobian-vector product $\mathbf{J}\mathbf{v}$ | Vector-Jacobian product $\mathbf{v}^T\mathbf{J}$ |
| Cost per pass | $O(T)$ | $O(T)$ |
| Full Jacobian | $n$ passes | $m$ passes |
| Best when | $n \ll m$ | $m \ll n$ (neural nets) |
AD Frameworks
Modern deep learning frameworks implement reverse-mode AD automatically:
- PyTorch: Dynamic computation graph (define-by-run), eager execution
- JAX: Functional transformations, supports forward and reverse mode, JIT compilation
- TensorFlow: Static computation graph (with eager mode), XLA compilation
These frameworks compute exact gradients (up to floating point) — not numerical approximations. The user only writes the forward pass; the backward pass is generated automatically.
Summary
- Backpropagation: Reverse-mode automatic differentiation; computes all gradients in $O(P)$ time
- Error signal: $\boldsymbol{\delta}^{(\ell)} = ((\mathbf{W}^{(\ell+1)})^T\boldsymbol{\delta}^{(\ell+1)}) \odot \phi'(\mathbf{z}^{(\ell)})$
- SGD: Unbiased gradient estimate from mini-batches; variance decreases as $1/B$
- Momentum: Accumulates gradient history; accelerates convergence along consistent directions
- Adam: Adaptive per-parameter learning rates with bias correction; the default optimizer in deep learning
- BatchNorm: Normalizes layer inputs; stabilizes training and enables higher learning rates