Part VI: Sequence Models
Language, speech, time series, and video are all sequences: order matters and context accumulates. This part traces the evolution of sequence modelling from recurrent networks through the attention revolution to the large language models that now power modern AI. Every architecture is derived from first principles, with the mathematical intuition for why each innovation was necessary.
Chapter 16: Recurrent Networks
RNNs for sequential data, BPTT derivation of vanishing gradients, LSTM gates with full equations, GRU simplification, and bidirectional architectures.
Chapter 17: Attention & Transformers
Scaled dot-product attention derived from first principles, multi-head attention, sinusoidal positional encoding, and the full Transformer encoder-decoder.
Chapter 18: Large Language Models
Autoregressive language modelling, GPT and BERT architectures, scaling laws, RLHF training pipeline, emergent abilities, and tokenisation.