Machine Learning

Part III: Neural Networks

Neural networks are universal function approximators built from composing simple parametric transformations. This part derives everything from scratch: the perceptron learning rule, the full backpropagation algorithm, the engineering innovations that make deep networks trainable, and the convolutional inductive bias that powered the deep learning revolution in vision.

Chapter 7: Perceptrons & Backpropagation

From the single perceptron to multilayer networks — full derivation of the backpropagation algorithm via computational graphs and the chain rule.

Single perceptron: linear combination + activationUniversal approximation theoremFeedforward equations: z⁽ˡ⁾ = W⁽ˡ⁾a⁽ˡ⁻¹⁾ + b⁽ˡ⁾Full backprop derivation: δ⁽ˡ⁾, ∂L/∂W⁽ˡ⁾, ∂L/∂b⁽ˡ⁾XOR problem: implement from scratch + decision boundary

Chapter 8: Deep Neural Networks

Activation functions, vanishing gradients, Batch Normalisation, Dropout, residual connections and weight initialisation — the engineering science of depth.

ReLU, Leaky ReLU, GELU, Swish with derivativesVanishing/exploding gradients: sigmoid chain productBatchNorm: full derivation (normalise, scale, shift)Dropout: Bernoulli mask + inference scalingResidual connections & Xavier/He initialisation

Chapter 9: Convolutional Neural Networks

Discrete convolution, parameter sharing, translation equivariance, pooling and modern architectures from LeNet to EfficientNet.

2D convolution: output size formula, stride, paddingParameter sharing and translation equivarianceBackprop through convolutionMax/average poolingLeNet → AlexNet → VGG → ResNet → EfficientNet

What you will learn

✓Derive backpropagation from the chain rule of calculus

✓Implement a neural network for XOR from scratch in NumPy

✓Explain why sigmoid activations cause vanishing gradients

✓Derive the BatchNorm normalisation, scale, and shift updates

✓Prove that residual connections preserve gradient magnitude

✓Derive Xavier and He weight initialisation from variance arguments

✓Implement 2D convolution from scratch and apply edge-detection kernels

✓Understand the design evolution from LeNet to EfficientNet

Prerequisites

Part I (linear algebra, calculus, probability) and Part II (supervised learning, gradient descent). You should be comfortable with matrix calculus and the concept of a loss function before beginning.

Share:X Reddit LinkedIn

Part II: Supervised Learning Chapter 7: Perceptrons & Backprop