Machine Learning/Part VI: Sequence Models/Chapter 17

Chapter 17: Attention & Transformers

The attention mechanism broke the bottleneck of fixed-length context vectors; the Transformer architecture replaced recurrence entirely with attention. “Attention Is All You Need” (Vaswani et al., 2017) is now the foundation of virtually every state-of-the-art NLP, vision, and multimodal system.

1. Motivation: Breaking the Information Bottleneck

In seq2seq RNNs, the encoder must compress an entire sentence into one fixed-length vector — information loss is inevitable for long sequences. Attention lets the decoder look directly at all encoder hidden states, computing a weighted sum based on relevance:

\[ \mathbf{c}_t = \sum_{s=1}^{T_{\rm enc}} \alpha_{ts}\,\mathbf{h}_s^{\rm enc}, \qquad \alpha_{ts} = \frac{\exp(e_{ts})}{\sum_{s'}\exp(e_{ts'})}, \quad e_{ts} = \text{score}(\mathbf{h}_{t-1}^{\rm dec}, \mathbf{h}_s^{\rm enc}) \]

2. Scaled Dot-Product Attention: Full Derivation

The Transformer generalises attention to three distinct roles. Given input matrix \(\mathbf{X} \in \mathbb{R}^{T \times d}\), project to queries, keys, and values:

\[ \mathbf{Q} = \mathbf{X}W_Q \in \mathbb{R}^{T \times d_k}, \quad \mathbf{K} = \mathbf{X}W_K \in \mathbb{R}^{T \times d_k}, \quad \mathbf{V} = \mathbf{X}W_V \in \mathbb{R}^{T \times d_v} \]

The attention output is:

\[ \text{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}}\right)\mathbf{V} \]

2.1 Why scale by \(\sqrt{d_k}\)?

Suppose \(\mathbf{q}, \mathbf{k} \sim \mathcal{N}(\mathbf{0}, I)\) independently. The dot product is:

\[ \mathbf{q} \cdot \mathbf{k} = \sum_{i=1}^{d_k} q_i k_i \]

Each term \(q_i k_i\) has mean 0 and variance 1, so \(\mathrm{Var}(\mathbf{q}\cdot\mathbf{k}) = d_k\). Without scaling, for large \(d_k\) (e.g., 512), the dot products have standard deviation \(\sqrt{d_k} \approx 22\). These large values push softmax into saturation regions where gradients are near zero.

Scaling by \(1/\sqrt{d_k}\) restores unit variance: \(\mathrm{Var}(\mathbf{q}\cdot\mathbf{k}/\sqrt{d_k}) = 1\). The softmax operates in a stable gradient region, and training converges reliably.

3. Multi-Head Attention

Rather than computing one attention function, project to \(h\) different \((Q,K,V)\) subspaces and concatenate:

\[ \text{MultiHead}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \text{Concat}(\text{head}_1,\ldots,\text{head}_h)\,W_O \]\[ \text{head}_i = \text{Attention}(\mathbf{Q}W_Q^{(i)},\; \mathbf{K}W_K^{(i)},\; \mathbf{V}W_V^{(i)}) \]

With \(d_k = d_v = d_{\rm model}/h\), the total computation cost matches a single-head model. Different heads learn to attend to different types of relationships simultaneously (syntactic, semantic, positional).

4. Sinusoidal Positional Encoding

Attention is permutation-equivariant — it treats the sequence as a bag of tokens. Positional information is injected by adding positional encodings to the token embeddings:

\[ PE(\text{pos}, 2i) = \sin\!\left(\frac{\text{pos}}{10000^{2i/d_{\rm model}}}\right) \]\[ PE(\text{pos}, 2i+1) = \cos\!\left(\frac{\text{pos}}{10000^{2i/d_{\rm model}}}\right) \]

This choice has a key property: \(PE(\text{pos}+k)\) can be expressed as a linear function of \(PE(\text{pos})\) for any fixed offset \(k\), enabling the model to easily learn relative positions. Additionally, sinusoidal encodings generalise to sequence lengths unseen during training.

5. Transformer Architecture

Each Transformer encoder layer consists of:

Multi-head self-attention: each token attends to all others
Add & LayerNorm: residual connection + layer normalisation
Position-wise FFN: \(\text{FFN}(\mathbf{x}) = \max(0, \mathbf{x}W_1+\mathbf{b}_1)W_2+\mathbf{b}_2\) applied independently to each position
Add & LayerNorm again

Each decoder layer adds a third sub-layer: cross-attention over the encoder output.

5.1 Layer Normalisation

LayerNorm normalises across the feature dimension (not batch), making it independent of batch size:\(\text{LN}(\mathbf{x}) = \frac{\mathbf{x} - \mu}{\sigma + \varepsilon} \odot \boldsymbol{\gamma} + \boldsymbol{\beta}\). Pre-norm (normalise before each sub-layer, GPT-2 style) is more training-stable than the original post-norm for very deep models.

6. Transformer Encoder-Decoder Diagram

7. Python: Self-Attention from Scratch

We implement scaled dot-product attention and multi-head attention from scratch using NumPy. Visualisations show: (1) full attention weight matrix, (2) causal masked attention, (3) multi-head diversity, (4) the scaling variance argument, (5) sinusoidal positional encoding heatmap, and (6) PE cosine similarity showing nearby positions are more similar.

Python

script.py183 lines

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

np.random.seed(42)

# ============================================================
# Self-Attention from scratch
# ============================================================

def softmax(x, axis=-1):
    x_max = np.max(x, axis=axis, keepdims=True)
    e = np.exp(x - x_max)
    return e / np.sum(e, axis=axis, keepdims=True)

def self_attention(X, W_Q, W_K, W_V, mask=None):
    d_k = W_Q.shape[1]
    Q = X @ W_Q
    K = X @ W_K
    V = X @ W_V
    scores = Q @ K.T / np.sqrt(d_k)
    if mask is not None:
        scores = scores + mask * (-1e9)
    attn_weights = softmax(scores, axis=-1)
    output = attn_weights @ V
    return output, attn_weights, Q, K, V

def positional_encoding(seq_len, d_model):
    PE = np.zeros((seq_len, d_model))
    positions = np.arange(seq_len)[:, None]
    div_term = np.exp(np.arange(0, d_model, 2) * (-np.log(10000.0) / d_model))
    PE[:, 0::2] = np.sin(positions * div_term)
    PE[:, 1::2] = np.cos(positions * div_term[:d_model//2])
    return PE

# Small example: sequence of 8 tokens, d_model=16
T = 8
d_model = 16
d_k = 8

# Random token embeddings
X = np.random.randn(T, d_model)

# Add positional encoding
PE = positional_encoding(T, d_model)
X_pe = X + PE

# Random projection matrices
W_Q = np.random.randn(d_model, d_k) * 0.1
W_K = np.random.randn(d_model, d_k) * 0.1
W_V = np.random.randn(d_model, d_k) * 0.1

output, attn_weights, Q, K, V = self_attention(X_pe, W_Q, W_K, W_V)

# Causal mask (for autoregressive generation)
causal_mask = np.triu(np.ones((T, T)), k=1)
output_causal, attn_causal, _, _, _ = self_attention(X_pe, W_Q, W_K, W_V, mask=causal_mask)

# Multi-head attention
n_heads = 4
d_kh = d_k
attn_heads = []
for head in range(n_heads):
    Wq_h = np.random.randn(d_model, d_kh) * 0.1
    Wk_h = np.random.randn(d_model, d_kh) * 0.1
    Wv_h = np.random.randn(d_model, d_kh) * 0.1
    _, attn_h, _, _, _ = self_attention(X_pe, Wq_h, Wk_h, Wv_h)
    attn_heads.append(attn_h)

# Variance analysis: why scale by sqrt(d_k)?
d_k_vals = [1, 2, 4, 8, 16, 32, 64, 128]
pre_softmax_vars = []
post_softmax_vars = []
for dk in d_k_vals:
    q = np.random.randn(1000, dk)
    k = np.random.randn(1000, dk)
    raw_scores = np.array([q[i] @ k[i] for i in range(1000)])
    scaled_scores = raw_scores / np.sqrt(dk)
    pre_softmax_vars.append(np.var(raw_scores))
    post_softmax_vars.append(np.var(scaled_scores))

# ============================================================
# Positional encoding visualisation
# ============================================================
PE_vis = positional_encoding(50, 64)

# ============================================================
# Plots
# ============================================================
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
fig.patch.set_facecolor('#0a0a0f')
fig.suptitle('Self-Attention and Transformer Components', fontsize=14, color='white', fontweight='bold')

# Panel 1: Attention weight matrix (full)
ax = axes[0, 0]
ax.set_facecolor('#0d0d1a')
im = ax.imshow(attn_weights, cmap='magma', vmin=0, vmax=attn_weights.max())
plt.colorbar(im, ax=ax, label='Attention weight')
ax.set_title('Full Self-Attention Weights', color='white', fontsize=11, fontweight='bold')
ax.set_xlabel('Key position', color='white')
ax.set_ylabel('Query position', color='white')
ax.tick_params(colors='white')
labels = [f'pos {i}' for i in range(T)]
ax.set_xticks(range(T)); ax.set_xticklabels(labels, rotation=45, color='white', fontsize=8)
ax.set_yticks(range(T)); ax.set_yticklabels(labels, color='white', fontsize=8)
for sp in ax.spines.values(): sp.set_color('#7c3aed')

# Panel 2: Causal attention mask
ax = axes[0, 1]
ax.set_facecolor('#0d0d1a')
im2 = ax.imshow(attn_causal, cmap='magma', vmin=0)
plt.colorbar(im2, ax=ax, label='Attention weight')
ax.set_title('Causal (Masked) Self-Attention', color='white', fontsize=11, fontweight='bold')
ax.set_xlabel('Key position', color='white')
ax.set_ylabel('Query position', color='white')
ax.tick_params(colors='white')
ax.set_xticks(range(T)); ax.set_xticklabels(labels, rotation=45, color='white', fontsize=8)
ax.set_yticks(range(T)); ax.set_yticklabels(labels, color='white', fontsize=8)
for sp in ax.spines.values(): sp.set_color('#7c3aed')

# Panel 3: Multi-head attention patterns
ax = axes[0, 2]
ax.set_facecolor('#0d0d1a')
combined = np.mean(attn_heads, axis=0)
for h_idx, attn_h in enumerate(attn_heads):
    diag = np.diag(attn_h)
    ax.plot(range(T), diag, 'o-', linewidth=2, markersize=5,
            label=f'Head {h_idx+1}', color=['#a78bfa','#7c3aed','#34d399','#fbbf24'][h_idx])
ax.set_title('Self-Attention Weight on Diagonal (per Head)', color='white', fontsize=11, fontweight='bold')
ax.set_xlabel('Sequence position', color='white')
ax.set_ylabel('Attention to self', color='white')
ax.tick_params(colors='white')
for sp in ax.spines.values(): sp.set_color('#7c3aed')
ax.grid(True, alpha=0.2, color='#7c3aed')
ax.legend(fontsize=9, facecolor='#0d0d1a', edgecolor='#7c3aed', labelcolor='white')

# Panel 4: Variance of dot product vs d_k
ax = axes[1, 0]
ax.set_facecolor('#0d0d1a')
ax.semilogx(d_k_vals, pre_softmax_vars, 'o-', color='#f87171', linewidth=2.5, markersize=7, label='Unscaled Q·K variance')
ax.semilogx(d_k_vals, post_softmax_vars, 's-', color='#34d399', linewidth=2.5, markersize=7, label='Scaled Q·K/√d_k variance')
ax.axhline(1.0, color='#fbbf24', linestyle='--', linewidth=1.5, alpha=0.7, label='Var = 1 (target)')
ax.set_title('Why Scale by sqrt(d_k)?', color='white', fontsize=11, fontweight='bold')
ax.set_xlabel('d_k (log scale)', color='white')
ax.set_ylabel('Variance of pre-softmax scores', color='white')
ax.tick_params(colors='white')
for sp in ax.spines.values(): sp.set_color('#7c3aed')
ax.grid(True, alpha=0.2, color='#7c3aed')
ax.legend(fontsize=9, facecolor='#0d0d1a', edgecolor='#7c3aed', labelcolor='white')

# Panel 5: Positional encoding heatmap
ax = axes[1, 1]
ax.set_facecolor('#0d0d1a')
im3 = ax.imshow(PE_vis.T[:32, :], cmap='RdBu', aspect='auto')
plt.colorbar(im3, ax=ax)
ax.set_title('Sinusoidal Positional Encoding (first 32 dims)', color='white', fontsize=11, fontweight='bold')
ax.set_xlabel('Sequence position', color='white')
ax.set_ylabel('Encoding dimension', color='white')
ax.tick_params(colors='white')
for sp in ax.spines.values(): sp.set_color('#7c3aed')

# Panel 6: PE dot product similarity
ax = axes[1, 2]
ax.set_facecolor('#0d0d1a')
pe_sim = PE_vis @ PE_vis.T
pe_sim_norm = pe_sim / np.outer(np.linalg.norm(PE_vis, axis=1), np.linalg.norm(PE_vis, axis=1))
im4 = ax.imshow(pe_sim_norm[:20, :20], cmap='magma', vmin=-0.2, vmax=1)
plt.colorbar(im4, ax=ax)
ax.set_title('PE Cosine Similarity (nearby positions similar)', color='white', fontsize=11, fontweight='bold')
ax.set_xlabel('Position', color='white')
ax.set_ylabel('Position', color='white')
ax.tick_params(colors='white')
for sp in ax.spines.values(): sp.set_color('#7c3aed')

plt.tight_layout()
plt.savefig('output.png', dpi=130, bbox_inches='tight', facecolor='#0a0a0f')
print(f"Attention output shape: {output.shape}")
print(f"Attention weights sum per row: {attn_weights.sum(axis=1).round(4)}")
print(f"Variance of unscaled scores (d_k=64): {pre_softmax_vars[-2]:.2f}")
print(f"Variance of scaled scores (d_k=64): {post_softmax_vars[-2]:.4f}")
print("Done.")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Share:X Reddit LinkedIn

← Ch 16: Recurrent Networks Ch 18: Large Language Models →