Chapter 17: Attention & Transformers
The attention mechanism broke the bottleneck of fixed-length context vectors; the Transformer architecture replaced recurrence entirely with attention. “Attention Is All You Need” (Vaswani et al., 2017) is now the foundation of virtually every state-of-the-art NLP, vision, and multimodal system.
1. Motivation: Breaking the Information Bottleneck
In seq2seq RNNs, the encoder must compress an entire sentence into one fixed-length vector — information loss is inevitable for long sequences. Attention lets the decoder look directly at all encoder hidden states, computing a weighted sum based on relevance:
2. Scaled Dot-Product Attention: Full Derivation
The Transformer generalises attention to three distinct roles. Given input matrix \(\mathbf{X} \in \mathbb{R}^{T \times d}\), project to queries, keys, and values:
The attention output is:
2.1 Why scale by \(\sqrt{d_k}\)?
Suppose \(\mathbf{q}, \mathbf{k} \sim \mathcal{N}(\mathbf{0}, I)\) independently. The dot product is:
Each term \(q_i k_i\) has mean 0 and variance 1, so \(\mathrm{Var}(\mathbf{q}\cdot\mathbf{k}) = d_k\). Without scaling, for large \(d_k\) (e.g., 512), the dot products have standard deviation \(\sqrt{d_k} \approx 22\). These large values push softmax into saturation regions where gradients are near zero.
Scaling by \(1/\sqrt{d_k}\) restores unit variance: \(\mathrm{Var}(\mathbf{q}\cdot\mathbf{k}/\sqrt{d_k}) = 1\). The softmax operates in a stable gradient region, and training converges reliably.
3. Multi-Head Attention
Rather than computing one attention function, project to \(h\) different \((Q,K,V)\) subspaces and concatenate:
With \(d_k = d_v = d_{\rm model}/h\), the total computation cost matches a single-head model. Different heads learn to attend to different types of relationships simultaneously (syntactic, semantic, positional).
4. Sinusoidal Positional Encoding
Attention is permutation-equivariant — it treats the sequence as a bag of tokens. Positional information is injected by adding positional encodings to the token embeddings:
This choice has a key property: \(PE(\text{pos}+k)\) can be expressed as a linear function of \(PE(\text{pos})\) for any fixed offset \(k\), enabling the model to easily learn relative positions. Additionally, sinusoidal encodings generalise to sequence lengths unseen during training.
5. Transformer Architecture
Each Transformer encoder layer consists of:
- Multi-head self-attention: each token attends to all others
- Add & LayerNorm: residual connection + layer normalisation
- Position-wise FFN: \(\text{FFN}(\mathbf{x}) = \max(0, \mathbf{x}W_1+\mathbf{b}_1)W_2+\mathbf{b}_2\) applied independently to each position
- Add & LayerNorm again
Each decoder layer adds a third sub-layer: cross-attention over the encoder output.
5.1 Layer Normalisation
LayerNorm normalises across the feature dimension (not batch), making it independent of batch size:\(\text{LN}(\mathbf{x}) = \frac{\mathbf{x} - \mu}{\sigma + \varepsilon} \odot \boldsymbol{\gamma} + \boldsymbol{\beta}\). Pre-norm (normalise before each sub-layer, GPT-2 style) is more training-stable than the original post-norm for very deep models.
6. Transformer Encoder-Decoder Diagram
7. Python: Self-Attention from Scratch
We implement scaled dot-product attention and multi-head attention from scratch using NumPy. Visualisations show: (1) full attention weight matrix, (2) causal masked attention, (3) multi-head diversity, (4) the scaling variance argument, (5) sinusoidal positional encoding heatmap, and (6) PE cosine similarity showing nearby positions are more similar.
Click Run to execute the Python code
Code will be executed with Python 3 on the server