Part II: Deep Learning Architectures | Chapter 6

RNNs & Transformers for Sequences

From vanilla recurrent networks through LSTMs to the self-attention revolution — a complete derivation of modern sequence modelling

Why Sequences Matter in Science

Scientific data is often inherently sequential: time series from particle detectors, protein amino-acid sequences, atmospheric measurements, gravitational wave signals, and molecular dynamics trajectories. Unlike feedforward networks that process fixed-size inputs, recurrent architectures maintain an internal state that evolves as they read a sequence, while transformers use attention mechanisms to relate all positions simultaneously.

This chapter derives the forward pass of vanilla RNNs, exposes the vanishing gradient problem analytically, introduces LSTM gating, and then develops the full transformer architecture from first principles — including scaled dot-product attention and positional encoding.

1. The Vanilla Recurrent Neural Network

A recurrent neural network processes a sequence $\{x_1, x_2, \ldots, x_T\}$ one element at a time, maintaining a hidden state $h_t$ that serves as a compressed summary of all inputs seen so far. The fundamental equations defining the vanilla (Elman) RNN are:

RNN Forward Pass

At each time step $t = 1, 2, \ldots, T$:

$$h_t = \tanh(W_{hh}\, h_{t-1} + W_{xh}\, x_t + b_h)$$

$$\hat{y}_t = W_{hy}\, h_t + b_y$$

Here $W_{hh} \in \mathbb{R}^{d_h \times d_h}$ is the recurrent weight matrix,$W_{xh} \in \mathbb{R}^{d_h \times d_x}$ maps inputs to hidden states,$W_{hy} \in \mathbb{R}^{d_y \times d_h}$ maps hidden states to outputs, and $b_h, b_y$ are biases. The initial hidden state $h_0$ is typically set to zero.

The key insight is that the same weight matrices are shared across all time steps — this is weight tying. It means an RNN can process sequences of arbitrary length with a fixed number of parameters. The hidden state $h_t$ acts as a fixed-dimensional memory that is updated at each step.

Derivation: Unrolling Through Time

To understand the computational graph, we "unroll" the RNN across time. Define the pre-activation $z_t = W_{hh}\, h_{t-1} + W_{xh}\, x_t + b_h$, so$h_t = \tanh(z_t)$. Unrolling gives:

$$h_1 = \tanh(W_{hh}\, h_0 + W_{xh}\, x_1 + b_h)$$

$$h_2 = \tanh(W_{hh}\, \tanh(W_{hh}\, h_0 + W_{xh}\, x_1 + b_h) + W_{xh}\, x_2 + b_h)$$

and so on. The composition of $W_{hh}$ with itself through the tanh nonlinearity is what creates both the power and the difficulty of RNNs.

For a total loss $\mathcal{L} = \sum_{t=1}^{T} \ell_t(\hat{y}_t, y_t)$ summed over time steps, we need gradients with respect to all weights. The algorithm for computing these gradients is called Backpropagation Through Time (BPTT).

2. The Vanishing & Exploding Gradient Problem

The central difficulty with vanilla RNNs is learning long-range dependencies. Consider the gradient of the loss at time $T$ with respect to the hidden state at an earlier time $k$:

Gradient Chain Through Time

By the chain rule:

$$\frac{\partial \ell_T}{\partial h_k} = \frac{\partial \ell_T}{\partial h_T} \prod_{t=k+1}^{T} \frac{\partial h_t}{\partial h_{t-1}}$$

Each Jacobian factor is:

$$\frac{\partial h_t}{\partial h_{t-1}} = \text{diag}(1 - h_t^2)\, W_{hh}$$

where $\text{diag}(1 - h_t^2)$ is the diagonal matrix of tanh derivatives. Taking norms:

$$\left\|\prod_{t=k+1}^{T} \frac{\partial h_t}{\partial h_{t-1}}\right\| \leq \prod_{t=k+1}^{T} \|W_{hh}\| \cdot \|\text{diag}(1 - h_t^2)\|$$

Since $|\tanh'(z)| \leq 1$, if the largest singular value of $W_{hh}$satisfies $\sigma_{\max}(W_{hh}) < 1$, the product shrinks exponentially as$(T - k) \to \infty$: gradients vanish. If $\sigma_{\max}(W_{hh}) > 1$, they can explode.

Formal Bound (Bengio et al., 1994)

Let $\gamma = \sigma_{\max}(W_{hh}) \cdot \max_t |\tanh'(z_t)|$. Then:

$$\left\|\frac{\partial h_T}{\partial h_k}\right\| \leq \gamma^{T-k}$$

When $\gamma < 1$, the gradient decays exponentially with the temporal distance$T - k$. For a sequence of length 100 with $\gamma = 0.9$, the gradient is attenuated by a factor of $0.9^{100} \approx 2.66 \times 10^{-5}$. This makes it essentially impossible to learn dependencies spanning many time steps.

Python
script.py25 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

3. Gradient Clipping

A practical remedy for exploding gradients (but not vanishing gradients) is gradient clipping. Before each parameter update, we rescale the gradient if its norm exceeds a threshold:

$$\tilde{g} = \begin{cases} g & \text{if } \|g\| \leq \theta \\ \theta \cdot \frac{g}{\|g\|} & \text{if } \|g\| > \theta \end{cases}$$

This preserves the gradient direction but caps its magnitude. Pascanu et al. (2013) showed this is essential for stable RNN training. Typical values are $\theta \in [1, 5]$.

4. Long Short-Term Memory (LSTM)

The LSTM (Hochreiter & Schmidhuber, 1997) solves the vanishing gradient problem by introducing a cell state $c_t$ that flows through time with only linear interactions, controlled by learned gates. This creates a "gradient highway" that allows gradients to flow unimpeded across many time steps.

LSTM Gate Equations

At each time step $t$, define the concatenated input$[h_{t-1}, x_t]$. The LSTM computes:

Forget gate (what to erase from cell state):

$$f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)$$

Input gate (what new information to store):

$$i_t = \sigma(W_i [h_{t-1}, x_t] + b_i)$$

Candidate cell state:

$$\tilde{c}_t = \tanh(W_c [h_{t-1}, x_t] + b_c)$$

Cell state update:

$$c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t$$

Output gate (what to expose as hidden state):

$$o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)$$

Hidden state:

$$h_t = o_t \odot \tanh(c_t)$$

Here $\sigma$ is the sigmoid function $\sigma(z) = 1/(1+e^{-z})$, and $\odot$ denotes element-wise (Hadamard) multiplication.

Why LSTM Solves Vanishing Gradients

Consider the gradient of $c_T$ with respect to $c_k$:

$$\frac{\partial c_T}{\partial c_k} = \prod_{t=k+1}^{T} \frac{\partial c_t}{\partial c_{t-1}} = \prod_{t=k+1}^{T} f_t$$

Each factor $f_t$ is a sigmoid output in $(0,1)$. Crucially, when the forget gate is close to 1 (i.e., the network learns to "remember"), the product$\prod f_t \approx 1$ and gradients propagate without decay. Unlike the vanilla RNN where the spectral radius of $W_{hh}$ is fixed, each $f_t$ islearned and input-dependent, allowing the network to selectively preserve gradients for relevant information.

The bias of the forget gate is typically initialized to a large positive value (e.g., 1 or 2) so that $f_t \approx 1$ at the start of training, ensuring information flows freely until the network learns what to forget (Jozefowicz et al., 2015).

Python
script.py50 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

5. Gated Recurrent Unit (GRU)

The GRU (Cho et al., 2014) is a simplified variant that merges the forget and input gates into a single update gate and combines the cell and hidden states:

GRU Equations

$$z_t = \sigma(W_z [h_{t-1}, x_t] + b_z) \quad \text{(update gate)}$$

$$r_t = \sigma(W_r [h_{t-1}, x_t] + b_r) \quad \text{(reset gate)}$$

$$\tilde{h}_t = \tanh(W_h [r_t \odot h_{t-1}, x_t] + b_h)$$

$$h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$$

The update gate $z_t$ interpolates between the old hidden state and the candidate. When $z_t \approx 0$, the hidden state is copied forward unchanged; when$z_t \approx 1$, it is fully replaced. The GRU has fewer parameters than the LSTM (3 gate matrices vs. 4) and often performs comparably.

6. Self-Attention Mechanism

The transformer architecture (Vaswani et al., 2017) abandons recurrence entirely in favour of attention. The fundamental operation is scaled dot-product attention, which allows every position in the sequence to attend to every other position in parallel.

Queries, Keys, and Values

Given an input sequence of embeddings $X \in \mathbb{R}^{T \times d_{\text{model}}}$, we compute three projections:

$$Q = X W_Q, \quad K = X W_K, \quad V = X W_V$$

where $W_Q, W_K \in \mathbb{R}^{d_{\text{model}} \times d_k}$ and$W_V \in \mathbb{R}^{d_{\text{model}} \times d_v}$ are learned projection matrices.

Intuition: The query $q_i$ at position $i$ asks "what information do I need?"; the key $k_j$ at position $j$ advertises "what information do I contain?"; and the value $v_j$ is the actual content that gets transmitted when position $j$ is attended to.

Scaled Dot-Product Attention

The attention output is:

$$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V$$

Why the $\sqrt{d_k}$ scaling? Consider$q_i^\top k_j = \sum_{l=1}^{d_k} q_{il} k_{jl}$. If the entries of $q$and $k$ are i.i.d. with zero mean and unit variance, then:

$$\mathbb{E}[q_i^\top k_j] = 0, \quad \text{Var}(q_i^\top k_j) = d_k$$

So the dot products grow as $\mathcal{O}(\sqrt{d_k})$ in standard deviation. For large $d_k$, this pushes the softmax into regions with extremely small gradients (saturation). Dividing by $\sqrt{d_k}$ normalises the variance to 1, keeping the softmax in a well-behaved regime.

The attention matrix $A = \text{softmax}(QK^\top/\sqrt{d_k}) \in \mathbb{R}^{T \times T}$has rows that sum to 1. Entry $A_{ij}$ is the attention weight that position $i$places on position $j$.

Python
script.py44 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

7. Multi-Head Attention

A single attention head can only capture one type of relationship. Multi-head attention runs$h$ attention heads in parallel, each with its own projections, and concatenates the results:

Multi-Head Attention Equations

$$\text{head}_i = \text{Attention}(X W_Q^{(i)}, X W_K^{(i)}, X W_V^{(i)})$$

$$\text{MultiHead}(X) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\, W_O$$

where $W_Q^{(i)}, W_K^{(i)} \in \mathbb{R}^{d_{\text{model}} \times d_k}$,$W_V^{(i)} \in \mathbb{R}^{d_{\text{model}} \times d_v}$, and the output projection$W_O \in \mathbb{R}^{h \cdot d_v \times d_{\text{model}}}$.

Typically $d_k = d_v = d_{\text{model}}/h$, so the total computation is comparable to a single head with full dimensionality. Different heads learn to attend to different aspects: syntactic relationships, positional proximity, semantic similarity, etc.

Python
script.py50 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

8. Positional Encoding

Since self-attention is permutation-equivariant (it treats the input as a set, not a sequence), we must inject positional information explicitly. The original transformer uses sinusoidal positional encodings:

Sinusoidal Positional Encoding

For position $\text{pos}$ and dimension $i$:

$$\text{PE}(\text{pos}, 2i) = \sin\!\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right)$$

$$\text{PE}(\text{pos}, 2i+1) = \cos\!\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right)$$

Why sinusoidal? For any fixed offset $k$,$\text{PE}(\text{pos}+k)$ can be written as a linear function of$\text{PE}(\text{pos})$. Specifically, using the angle addition formula:

$$\sin(\omega(\text{pos}+k)) = \sin(\omega \cdot \text{pos})\cos(\omega k) + \cos(\omega \cdot \text{pos})\sin(\omega k)$$

This means relative positions can be captured by linear transformations of the encoding, which the attention mechanism can learn. The wavelengths form a geometric progression from $2\pi$to $10000 \cdot 2\pi$, covering a wide range of scales.

Python
script.py47 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

9. The Full Transformer Architecture

The transformer stacks multi-head attention with feedforward layers, using residual connections and layer normalisation throughout.

Encoder Block

Each encoder layer performs:

$$Z = \text{LayerNorm}(X + \text{MultiHead}(X))$$

$$\text{Output} = \text{LayerNorm}(Z + \text{FFN}(Z))$$

The feedforward network (FFN) is applied position-wise:

$$\text{FFN}(z) = W_2 \cdot \text{ReLU}(W_1 z + b_1) + b_2$$

where typically $W_1 \in \mathbb{R}^{d_{\text{model}} \times d_{\text{ff}}}$ with$d_{\text{ff}} = 4 \cdot d_{\text{model}}$. The residual connections$X + \text{MultiHead}(X)$ ensure gradients can flow directly through the network, analogous to the cell state highway in LSTMs.

Layer Normalisation

Unlike batch normalisation, layer normalisation computes statistics across the feature dimension for each individual example:

$$\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$$

where $\mu = \frac{1}{d}\sum_{i=1}^{d} x_i$ and$\sigma^2 = \frac{1}{d}\sum_{i=1}^{d}(x_i - \mu)^2$ are computed per-token. The parameters $\gamma, \beta \in \mathbb{R}^d$ are learned.

Decoder and Causal Masking

In autoregressive generation (e.g., language modelling), the decoder uses causal (masked) attention: position $i$can only attend to positions $j \leq i$. This is implemented by adding$-\infty$ to the attention scores for $j > i$ before softmax:

$$\text{score}_{ij} = \begin{cases} q_i^\top k_j / \sqrt{d_k} & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases}$$

The decoder also contains a cross-attention layer where the queries come from the decoder and the keys and values come from the encoder output.

Python
script.py78 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

10. Computational Complexity: RNN vs. Transformer

PropertyRNNTransformer
Time per layer$\mathcal{O}(T \cdot d^2)$$\mathcal{O}(T^2 \cdot d + T \cdot d^2)$
ParallelisableNo (sequential)Yes (fully parallel)
Max path length$\mathcal{O}(T)$$\mathcal{O}(1)$
Memory$\mathcal{O}(d)$ per step$\mathcal{O}(T^2)$ attention matrix

The $\mathcal{O}(T^2)$ attention cost is the main bottleneck for very long sequences. This has motivated efficient attention variants: sparse attention, linear attention, FlashAttention (which optimises memory access patterns), and state-space models.

11. Scientific Applications

Time Series in Physics

LSTMs and transformers are used to model chaotic dynamical systems, predict turbulence evolution, and denoise gravitational wave signals from LIGO. The ability to capture long-range temporal correlations is critical for these applications.

Protein Language Models

Transformer architectures (e.g., ESM, AlphaFold2's Evoformer) treat amino acid sequences as "protein language." Self-attention captures co-evolutionary relationships between residues, enabling structure prediction and function annotation.

Molecular Generation (SMILES)

Molecules represented as SMILES strings are sequences that can be generated autoregressively. Transformer decoders learn the grammar of valid SMILES, enabling drug discovery through conditional generation of molecules with desired properties.

Climate & Weather

Vision Transformers (ViT) and spatio-temporal transformers like Pangu-Weather and GraphCast have achieved state-of-the-art weather forecasting, processing atmospheric fields as sequences of spatial patches across pressure levels and time.

Python
script.py68 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Chapter Summary

  • Vanilla RNNs process sequences via shared weights and a recurrent hidden state, but suffer from vanishing/exploding gradients for long sequences.
  • LSTMs introduce a cell state with learned forget/input/output gates, creating a gradient highway that preserves information across hundreds of time steps.
  • GRUs simplify the gating mechanism to two gates (update and reset) with comparable performance.
  • Self-attention computes $\text{softmax}(QK^\top/\sqrt{d_k})V$, relating all positions in parallel with $\mathcal{O}(1)$ path length.
  • Multi-head attention runs $h$ parallel attention heads to capture diverse relationships.
  • Positional encoding injects sequence order via sinusoidal functions with a provable linear relationship for relative positions.
  • Transformers stack multi-head attention + FFN layers with residual connections and layer normalisation, enabling massive parallelism and state-of-the-art performance across scientific domains.