Machine Learning

Part VI: Sequence Models

Language, speech, time series, and video are all sequences: order matters and context accumulates. This part traces the evolution of sequence modelling from recurrent networks through the attention revolution to the large language models that now power modern AI. Every architecture is derived from first principles, with the mathematical intuition for why each innovation was necessary.

Chapter 16: Recurrent Networks

RNNs for sequential data, BPTT derivation of vanishing gradients, LSTM gates with full equations, GRU simplification, and bidirectional architectures.

RNN: h_t = tanh(W_hh h_{t-1} + W_xh x_t + b)BPTT: vanishing gradient as product of JacobiansLSTM: forget/input/output gates full equationsGRU & bidirectional RNNs

Chapter 17: Attention & Transformers

Scaled dot-product attention derived from first principles, multi-head attention, sinusoidal positional encoding, and the full Transformer encoder-decoder.

Attention(Q,K,V) = softmax(QKᵀ/√d_k)VWhy scale by √d_k: variance of dot productMulti-head attention & positional encodingTransformer encoder/decoder & layer norm