Machine Learning

Part VI: Sequence Models

Language, speech, time series, and video are all sequences: order matters and context accumulates. This part traces the evolution of sequence modelling from recurrent networks through the attention revolution to the large language models that now power modern AI. Every architecture is derived from first principles, with the mathematical intuition for why each innovation was necessary.

What you will learn

โœ“Derive backpropagation through time (BPTT) and explain the vanishing gradient problem
โœ“Write out all four LSTM gate equations and explain what each gate controls
โœ“Derive scaled dot-product attention and prove why scaling by โˆšd_k stabilises training
โœ“Build a multi-head attention layer and understand its role in the Transformer
โœ“Construct sinusoidal positional encodings and explain their properties
โœ“Distinguish GPT (causal) from BERT (bidirectional) training objectives
โœ“Explain the empirical scaling laws for language models
โœ“Describe the RLHF pipeline: reward model, PPO, and alignment

Prerequisites

Part III: Deep Learning
Backpropagation, chain rule, feedforward networks, batch normalisation
Part I: Linear Algebra
Matrix multiplication, softmax, and the geometry of dot products
Part I: Probability
Cross-entropy, KL divergence, and language model perplexity