Backpropagation & Optimization

The chain rule as an algorithm — from computational graphs to Adam and beyond

Introduction

Backpropagation (Rumelhart, Hinton & Williams, 1986) is the algorithm that makes deep learning possible. It efficiently computes the gradient of a loss function with respect to every parameter in a neural network by applying the chain rule of calculus in reverse through the computational graph. Combined with stochastic optimization methods, it enables training networks with millions (or billions) of parameters.

Key Topics

1. The Chain Rule and Computational Graphs
2. Deriving Backpropagation
3. Stochastic Gradient Descent (SGD)
4. Momentum
5. Adam Optimizer
6. Learning Rate Schedules
7. Batch Normalization

1. The Chain Rule and Computational Graphs

Multivariate Chain Rule

If $y = f(u_1, u_2, \ldots, u_m)$ and each $u_k = g_k(x_1, \ldots, x_n)$, then:

$$\frac{\partial y}{\partial x_i} = \sum_{k=1}^{m} \frac{\partial y}{\partial u_k}\frac{\partial u_k}{\partial x_i}$$

In matrix form, this is the chain of Jacobians: if $\mathbf{y} = f(\mathbf{u})$ and $\mathbf{u} = g(\mathbf{x})$, then $\frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \frac{\partial \mathbf{y}}{\partial \mathbf{u}} \cdot \frac{\partial \mathbf{u}}{\partial \mathbf{x}}$.

Computational Graph View

A neural network defines a directed acyclic graph (DAG) of operations. Each node computes a function of its inputs and passes the result to its children. The key insight of backprop is that we can compute gradients by passing "error signals" backward through this graph.

Forward pass: Compute all intermediate values from inputs to loss.

Backward pass: Starting from $\partial L/\partial L = 1$, propagate gradients backward using the chain rule at each node. Each node multiplies incoming gradient by its local Jacobian.

Forward vs Reverse Mode

For a function $f: \mathbb{R}^n \to \mathbb{R}^m$:

Forward mode (tangent): Computes one column of the Jacobian per pass — efficient when $n \ll m$
Reverse mode (adjoint): Computes one row of the Jacobian per pass — efficient when $m \ll n$

Since neural network losses are scalar ($m=1$) with many parameters ($n \gg 1$), reverse mode (backpropagation) is optimal: it computes all$n$ partial derivatives in a single backward pass, at roughly the same cost as the forward pass.

2. Deriving Backpropagation

Consider a network with $L$ layers. Layer $\ell$ computes:

$$\mathbf{z}^{(\ell)} = \mathbf{W}^{(\ell)}\mathbf{h}^{(\ell-1)} + \mathbf{b}^{(\ell)}, \quad \mathbf{h}^{(\ell)} = \phi(\mathbf{z}^{(\ell)})$$

Step 1: Define the Error Signal

For each layer $\ell$, define the error signal (upstream gradient):

$$\boxed{\boldsymbol{\delta}^{(\ell)} \equiv \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(\ell)}}}$$

Step 2: Output Layer Error

For MSE loss $\mathcal{L} = \frac{1}{2}\|\mathbf{y} - \mathbf{h}^{(L)}\|^2$ with identity output activation:

$$\boldsymbol{\delta}^{(L)} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(L)}} = (\mathbf{h}^{(L)} - \mathbf{y})$$

For cross-entropy loss with softmax output: $\boldsymbol{\delta}^{(L)} = \hat{\mathbf{p}} - \mathbf{y}$ (one-hot).

Step 3: Backward Recursion

For hidden layers $\ell = L-1, L-2, \ldots, 1$, apply the chain rule:

$$\boldsymbol{\delta}^{(\ell)} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(\ell)}} = \frac{\partial \mathbf{z}^{(\ell+1)}}{\partial \mathbf{h}^{(\ell)}} \cdot \frac{\partial \mathbf{h}^{(\ell)}}{\partial \mathbf{z}^{(\ell)}} \cdot \boldsymbol{\delta}^{(\ell+1)}$$

Since $\mathbf{z}^{(\ell+1)} = \mathbf{W}^{(\ell+1)}\mathbf{h}^{(\ell)} + \mathbf{b}^{(\ell+1)}$and $\mathbf{h}^{(\ell)} = \phi(\mathbf{z}^{(\ell)})$:

$$\boxed{\boldsymbol{\delta}^{(\ell)} = \left((\mathbf{W}^{(\ell+1)})^T \boldsymbol{\delta}^{(\ell+1)}\right) \odot \phi'(\mathbf{z}^{(\ell)})}$$

where $\odot$ is element-wise multiplication (Hadamard product).

Step 4: Parameter Gradients

Once we have $\boldsymbol{\delta}^{(\ell)}$ for each layer, the parameter gradients are:

$$\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(\ell)}} = \boldsymbol{\delta}^{(\ell)}(\mathbf{h}^{(\ell-1)})^T$$

$$\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(\ell)}} = \boldsymbol{\delta}^{(\ell)}$$

Each gradient is an outer product of the error signal and the input to that layer. This is why backprop requires storing all forward-pass activations — memory scales linearly with depth.

Computational Complexity

For a network with $P$ total parameters:

Forward pass: $O(P)$ multiply-add operations
Backward pass: $O(P)$ multiply-add operations (roughly 2x forward)
Memory: Must store all $\mathbf{h}^{(\ell)}$ and $\mathbf{z}^{(\ell)}$ from the forward pass

Without backprop, computing gradients by finite differences would cost $O(P^2)$ — intractable for modern networks.

3. Stochastic Gradient Descent (SGD)

From Full-Batch to Mini-Batch

The full gradient over $n$ samples is:

$$\nabla\mathcal{L}(\mathbf{w}) = \frac{1}{n}\sum_{i=1}^{n}\nabla\ell_i(\mathbf{w})$$

SGD approximates this with a random mini-batch $\mathcal{B} \subset \{1,\ldots,n\}$ of size $B$:

$$\hat{\nabla}\mathcal{L}(\mathbf{w}) = \frac{1}{B}\sum_{i \in \mathcal{B}}\nabla\ell_i(\mathbf{w})$$

This is an unbiased estimate: $\mathbb{E}[\hat{\nabla}\mathcal{L}] = \nabla\mathcal{L}$.

The variance of the estimator is $\text{Var}(\hat{\nabla}\mathcal{L}) = \frac{\sigma^2}{B}$, where $\sigma^2$ is the per-sample gradient variance. Larger batches reduce variance but cost more computation per step.

SGD Convergence

For convex functions with learning rate $\eta_t$:

Fixed LR: SGD converges to a neighborhood of the optimum, with radius $\propto \eta\sigma^2$
Decreasing LR: If $\sum_t \eta_t = \infty$ and $\sum_t \eta_t^2 < \infty$, SGD converges exactly
Common schedule: $\eta_t = \eta_0 / (1 + \alpha t)$ satisfies both conditions

4. Momentum

Classical Momentum (Polyak, 1964)

Momentum accelerates SGD by accumulating an exponentially decaying moving average of past gradients:

$$\boxed{\mathbf{v}_{t+1} = \beta\mathbf{v}_t + \nabla\mathcal{L}(\mathbf{w}_t), \quad \mathbf{w}_{t+1} = \mathbf{w}_t - \eta\mathbf{v}_{t+1}}$$

where $\beta \in [0, 1)$ is the momentum coefficient (typically 0.9). The velocity $\mathbf{v}$acts like physical momentum, building up speed along consistent gradient directions.

The effective learning rate along consistent directions is amplified by $1/(1-\beta)$. For $\beta = 0.9$, this is a 10x amplification. Oscillating components get damped.

Nesterov Accelerated Gradient (NAG)

Nesterov momentum evaluates the gradient at the "look-ahead" position:

$$\mathbf{v}_{t+1} = \beta\mathbf{v}_t + \nabla\mathcal{L}(\mathbf{w}_t - \eta\beta\mathbf{v}_t), \quad \mathbf{w}_{t+1} = \mathbf{w}_t - \eta\mathbf{v}_{t+1}$$

This provides a "correction" that improves convergence rate from $O(1/\sqrt{\kappa})$(classical momentum) to $O(1/\kappa)$ steps for condition number $\kappa$ of the loss surface.

5. Adam Optimizer

Adam (Adaptive Moment Estimation, Kingma & Ba 2015) combines momentum with per-parameter adaptive learning rates using second moment estimates.

The Adam Algorithm

Initialize $\mathbf{m}_0 = \mathbf{0}$, $\mathbf{v}_0 = \mathbf{0}$, $t = 0$. At each step:

$$\mathbf{g}_t = \nabla\mathcal{L}(\mathbf{w}_t) \quad \text{(compute gradient)}$$

$$\mathbf{m}_t = \beta_1\mathbf{m}_{t-1} + (1-\beta_1)\mathbf{g}_t \quad \text{(first moment estimate)}$$

$$\mathbf{v}_t = \beta_2\mathbf{v}_{t-1} + (1-\beta_2)\mathbf{g}_t^2 \quad \text{(second moment estimate)}$$

$$\hat{\mathbf{m}}_t = \frac{\mathbf{m}_t}{1 - \beta_1^t}, \quad \hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1 - \beta_2^t} \quad \text{(bias correction)}$$

$$\boxed{\mathbf{w}_{t+1} = \mathbf{w}_t - \eta\frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon}}$$

Default hyperparameters: $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$, $\eta = 0.001$.

Why Bias Correction?

Since $\mathbf{m}_0 = \mathbf{0}$, early estimates are biased toward zero:

$$\mathbb{E}[\mathbf{m}_t] = (1-\beta_1^t)\mathbb{E}[\mathbf{g}_t]$$

Dividing by $(1-\beta_1^t)$ corrects this. For $\beta_1 = 0.9$, the correction is significant for the first ~10 steps, then becomes negligible.

AdamW: Decoupled Weight Decay

Loshchilov & Hutter (2019) showed that L2 regularization and weight decay are not equivalent with Adam. The correct formulation is:

$$\mathbf{w}_{t+1} = (1 - \eta\lambda)\mathbf{w}_t - \eta\frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon}$$

This decouples the weight decay from the adaptive gradient, leading to better generalization. AdamW is the default optimizer for modern transformers.

6. Learning Rate Schedules

Common Schedules

Step Decay:
$\eta_t = \eta_0 \cdot \gamma^{\lfloor t/s \rfloor}$
Reduce LR by factor $\gamma$ every $s$ epochs
Cosine Annealing:
$\eta_t = \eta_{\min} + \frac{1}{2}(\eta_0 - \eta_{\min})\left(1 + \cos\frac{\pi t}{T}\right)$
Smooth decay to $\eta_{\min}$ over $T$ steps
Linear Warmup + Decay:
$\eta_t = \begin{cases} \eta_0 \cdot t/T_w & t < T_w \\ \eta_0 \cdot (T-t)/(T-T_w) & t \geq T_w \end{cases}$
Used in transformer training; warmup stabilizes early training with Adam
Exponential Decay:
$\eta_t = \eta_0 \cdot e^{-\alpha t}$
Continuous version of step decay

7. Batch Normalization

Batch normalization (Ioffe & Szegedy, 2015) normalizes layer inputs across the mini-batch, stabilizing training and enabling higher learning rates.

The BatchNorm Transform

For a mini-batch $\{z_i\}_{i=1}^{B}$ at a given layer:

$$\mu_B = \frac{1}{B}\sum_{i=1}^{B} z_i, \quad \sigma_B^2 = \frac{1}{B}\sum_{i=1}^{B}(z_i - \mu_B)^2$$

$$\hat{z}_i = \frac{z_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$$

$$\boxed{\text{BN}(z_i) = \gamma\hat{z}_i + \beta}$$

where $\gamma$ and $\beta$ are learnable scale and shift parameters that restore the network's representational capacity.

Backprop Through BatchNorm

The gradient through BatchNorm requires careful derivation. Let $\delta_i = \partial\mathcal{L}/\partial\hat{z}_i$:

$$\frac{\partial\mathcal{L}}{\partial z_i} = \frac{1}{\sqrt{\sigma_B^2+\epsilon}}\left(\delta_i - \frac{1}{B}\sum_j\delta_j - \frac{\hat{z}_i}{B}\sum_j\delta_j\hat{z}_j\right)$$

This ensures gradients are properly normalized, preventing both vanishing and exploding gradients.

8. Python Simulation: Backprop & Optimizer Comparison

This simulation implements backpropagation from scratch and compares SGD, Momentum, and Adam optimizers on a nonlinear regression task.

Backpropagation & Optimizer Comparison

Python

script.py231 lines

import numpy as np

# ============================================================
# BACKPROPAGATION FROM SCRATCH
# ============================================================

def relu(z):
    return np.maximum(0, z)

def relu_deriv(z):
    return (z > 0).astype(float)

class MLP:
    """Multi-layer perceptron with explicit backpropagation."""
    def __init__(self, layer_sizes, seed=42):
        np.random.seed(seed)
        self.L = len(layer_sizes) - 1  # number of layers (excluding input)
        self.W = []
        self.b = []
        for l in range(self.L):
            scale = np.sqrt(2.0 / layer_sizes[l])  # He init
            self.W.append(np.random.randn(layer_sizes[l], layer_sizes[l+1]) * scale)
            self.b.append(np.zeros(layer_sizes[l+1]))

def forward(self, X):
        """Forward pass, storing activations and pre-activations."""
        self.h = [X]        # activations (h[0] = input)
        self.z = []          # pre-activations
        h = X
        for l in range(self.L):
            z = h @ self.W[l] + self.b[l]
            self.z.append(z)
            if l < self.L - 1:
                h = relu(z)
            else:
                h = z  # linear output
            self.h.append(h)
        return h

def backward(self, y):
        """Backward pass implementing the backprop equations."""
        n = y.shape[0]
        self.dW = [None] * self.L
        self.db = [None] * self.L

# Output layer error signal
        delta = (self.h[-1].flatten() - y).reshape(-1, 1) / n

# Backward through layers
        for l in range(self.L - 1, -1, -1):
            self.dW[l] = self.h[l].T @ delta
            self.db[l] = np.sum(delta, axis=0)
            if l > 0:
                delta = (delta @ self.W[l].T) * relu_deriv(self.z[l-1])

def get_params(self):
        return [(W.copy(), b.copy()) for W, b in zip(self.W, self.b)]

def set_params(self, params):
        for l, (W, b) in enumerate(params):
            self.W[l] = W.copy()
            self.b[l] = b.copy()

# --- Optimizers ---
class SGDOptimizer:
    def __init__(self, net, lr=0.01):
        self.net = net
        self.lr = lr

def step(self):
        for l in range(self.net.L):
            self.net.W[l] -= self.lr * self.net.dW[l]
            self.net.b[l] -= self.lr * self.net.db[l]

class MomentumOptimizer:
    def __init__(self, net, lr=0.01, beta=0.9):
        self.net = net
        self.lr = lr
        self.beta = beta
        self.vW = [np.zeros_like(W) for W in net.W]
        self.vb = [np.zeros_like(b) for b in net.b]

def step(self):
        for l in range(self.net.L):
            self.vW[l] = self.beta * self.vW[l] + self.net.dW[l]
            self.vb[l] = self.beta * self.vb[l] + self.net.db[l]
            self.net.W[l] -= self.lr * self.vW[l]
            self.net.b[l] -= self.lr * self.vb[l]

class AdamOptimizer:
    def __init__(self, net, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
        self.net = net
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        self.t = 0
        self.mW = [np.zeros_like(W) for W in net.W]
        self.mb = [np.zeros_like(b) for b in net.b]
        self.vW = [np.zeros_like(W) for W in net.W]
        self.vb = [np.zeros_like(b) for b in net.b]

def step(self):
        self.t += 1
        for l in range(self.net.L):
            # First moment
            self.mW[l] = self.beta1 * self.mW[l] + (1-self.beta1) * self.net.dW[l]
            self.mb[l] = self.beta1 * self.mb[l] + (1-self.beta1) * self.net.db[l]
            # Second moment
            self.vW[l] = self.beta2 * self.vW[l] + (1-self.beta2) * self.net.dW[l]**2
            self.vb[l] = self.beta2 * self.vb[l] + (1-self.beta2) * self.net.db[l]**2
            # Bias correction
            mW_hat = self.mW[l] / (1 - self.beta1**self.t)
            mb_hat = self.mb[l] / (1 - self.beta1**self.t)
            vW_hat = self.vW[l] / (1 - self.beta2**self.t)
            vb_hat = self.vb[l] / (1 - self.beta2**self.t)
            # Update
            self.net.W[l] -= self.lr * mW_hat / (np.sqrt(vW_hat) + self.eps)
            self.net.b[l] -= self.lr * mb_hat / (np.sqrt(vb_hat) + self.eps)

# --- Training setup ---
np.random.seed(42)
n = 150
x = np.linspace(-3, 3, n)
y = np.sin(2*x) + 0.5*np.cos(5*x) + 0.2*np.random.randn(n)
X = x.reshape(-1, 1)

# Save initial params for fair comparison
layers = [1, 64, 32, 1]
init_net = MLP(layers, seed=42)
init_params = init_net.get_params()

epochs = 1500

print("=" * 65)
print("BACKPROPAGATION & OPTIMIZER COMPARISON")
print("=" * 65)
print(f"Network: {layers}")
print(f"Total params: {sum(layers[i]*layers[i+1]+layers[i+1] for i in range(len(layers)-1))}")
print(f"Training data: {n} points, target: sin(2x) + 0.5*cos(5x) + noise")
print()

results = {}
for opt_name, OptClass, opt_kwargs in [
    ("SGD (lr=0.01)", SGDOptimizer, {"lr": 0.01}),
    ("Momentum (0.9)", MomentumOptimizer, {"lr": 0.01, "beta": 0.9}),
    ("Adam (lr=0.001)", AdamOptimizer, {"lr": 0.001}),
]:
    net = MLP(layers, seed=42)
    net.set_params(init_params)
    optimizer = OptClass(net, **opt_kwargs)

losses = []
    for epoch in range(epochs):
        y_pred = net.forward(X)
        loss = np.mean((y_pred.flatten() - y)**2)
        losses.append(loss)
        net.backward(y)
        optimizer.step()

results[opt_name] = losses

# Print comparison
print(f"{'Optimizer':<22} {'Loss @100':<14} {'Loss @500':<14} {'Loss @1500':<14} {'Converged?':<12}")
print("-" * 76)
for name, losses in results.items():
    conv = "Yes" if losses[-1] < 0.1 else "No"
    print(f"{name:<22} {losses[99]:<14.6f} {losses[499]:<14.6f} {losses[-1]:<14.6f} {conv:<12}")

# --- Gradient verification via finite differences ---
print()
print("=" * 65)
print("GRADIENT VERIFICATION (Finite Differences)")
print("=" * 65)
net = MLP([1, 8, 1], seed=0)
X_test = np.array([[1.5]])
y_test = np.array([0.7])

# Backprop gradient
net.forward(X_test)
net.backward(y_test)
backprop_grad = net.dW[0][0, 0]

# Finite difference gradient
eps = 1e-5
net.W[0][0, 0] += eps
y_plus = net.forward(X_test)
loss_plus = 0.5*(y_plus.flatten()[0] - y_test[0])**2

net.W[0][0, 0] -= 2*eps
y_minus = net.forward(X_test)
loss_minus = 0.5*(y_minus.flatten()[0] - y_test[0])**2
net.W[0][0, 0] += eps  # restore

fd_grad = (loss_plus - loss_minus) / (2*eps)

print(f"Backprop gradient:       {backprop_grad:.10f}")
print(f"Finite diff gradient:    {fd_grad:.10f}")
print(f"Relative error:          {abs(backprop_grad - fd_grad) / (abs(fd_grad) + 1e-15):.2e}")
print(f"Match: {abs(backprop_grad - fd_grad) / (abs(fd_grad) + 1e-15) < 1e-4}")

# --- Learning rate schedule demo ---
print()
print("=" * 65)
print("LEARNING RATE SCHEDULES (Adam + Cosine Annealing)")
print("=" * 65)

net = MLP(layers, seed=42)
eta_0 = 0.002
eta_min = 1e-5
T = 1000
optimizer = AdamOptimizer(net, lr=eta_0)
losses_cosine = []

for t in range(T):
    eta_t = eta_min + 0.5*(eta_0 - eta_min)*(1 + np.cos(np.pi * t / T))
    optimizer.lr = eta_t
    y_pred = net.forward(X)
    loss = np.mean((y_pred.flatten() - y)**2)
    losses_cosine.append(loss)
    net.backward(y)
    optimizer.step()

print(f"Cosine annealing: eta_0={eta_0}, eta_min={eta_min}, T={T}")
print(f"  Loss @100: {losses_cosine[99]:.6f}")
print(f"  Loss @500: {losses_cosine[499]:.6f}")
print(f"  Loss @1000: {losses_cosine[-1]:.6f}")
print(f"  LR at start: {eta_0:.6f}")
print(f"  LR at T/2: {eta_min + 0.5*(eta_0-eta_min)*(1+np.cos(np.pi*0.5)):.6f}")
print(f"  LR at end: {eta_min + 0.5*(eta_0-eta_min)*(1+np.cos(np.pi)):.6f}")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

9. Vanishing and Exploding Gradients

The Gradient Product Problem

The backpropagation recursion involves a product of matrices and activation derivatives:

$$\frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(1)}} = \left(\prod_{\ell=1}^{L-1} \text{diag}(\phi'(\mathbf{z}^{(\ell)})) \cdot \mathbf{W}^{(\ell+1)T}\right) \boldsymbol{\delta}^{(L)}$$

If the spectral norm $\|\text{diag}(\phi') \cdot \mathbf{W}^T\| < 1$ at each layer, gradients vanish exponentially with depth. If the norm $> 1$, gradients explode exponentially.

Why Sigmoid Causes Vanishing Gradients

The sigmoid derivative satisfies $\sigma'(z) \leq 0.25$ for all $z$. At each layer, the gradient is multiplied by at most 0.25. After $L$ layers:

$$\|\boldsymbol{\delta}^{(1)}\| \leq 0.25^{L-1} \cdot \|\mathbf{W}\|^{L-1} \cdot \|\boldsymbol{\delta}^{(L)}\|$$

For $L = 10$ layers with well-initialized weights ($\|\mathbf{W}\| \approx 1$), gradients shrink by a factor of $0.25^9 \approx 4 \times 10^{-6}$. This is why ReLU ($\phi'(z) = 1$ for $z > 0$) revolutionized deep learning.

Solutions

ReLU activation: Gradient is exactly 1 for active neurons
Residual connections: Skip connections add an identity path for gradients
Batch normalization: Keeps pre-activations in a well-conditioned range
Gradient clipping: Cap gradient norms to prevent explosion: $\mathbf{g} \leftarrow \mathbf{g} \cdot \min(1, c/\|\mathbf{g}\|)$
Proper initialization: He/Xavier initialization maintains gradient variance across layers
Layer normalization: Normalize across features rather than batch, used in transformers

10. Automatic Differentiation

Backpropagation is a specific case of reverse-mode automatic differentiation (AD), a general technique for computing derivatives of programs.

Forward Mode vs Reverse Mode

For a function $f: \mathbb{R}^n \to \mathbb{R}^m$ composed of $T$ elementary operations:

Property	Forward Mode	Reverse Mode
Computes	Jacobian-vector product $\mathbf{J}\mathbf{v}$	Vector-Jacobian product $\mathbf{v}^T\mathbf{J}$
Cost per pass	$O(T)$	$O(T)$
Full Jacobian	$n$ passes	$m$ passes
Best when	$n \ll m$	$m \ll n$ (neural nets)

AD Frameworks

Modern deep learning frameworks implement reverse-mode AD automatically:

PyTorch: Dynamic computation graph (define-by-run), eager execution
JAX: Functional transformations, supports forward and reverse mode, JIT compilation
TensorFlow: Static computation graph (with eager mode), XLA compilation

These frameworks compute exact gradients (up to floating point) — not numerical approximations. The user only writes the forward pass; the backward pass is generated automatically.

Summary

Backpropagation: Reverse-mode automatic differentiation; computes all gradients in $O(P)$ time
Error signal: $\boldsymbol{\delta}^{(\ell)} = ((\mathbf{W}^{(\ell+1)})^T\boldsymbol{\delta}^{(\ell+1)}) \odot \phi'(\mathbf{z}^{(\ell)})$
SGD: Unbiased gradient estimate from mini-batches; variance decreases as $1/B$
Momentum: Accumulates gradient history; accelerates convergence along consistent directions
Adam: Adaptive per-parameter learning rates with bias correction; the default optimizer in deep learning
BatchNorm: Normalizes layer inputs; stabilizes training and enables higher learning rates

Share:X Reddit LinkedIn