RNNs & Transformers for Sequences
From vanilla recurrent networks through LSTMs to the self-attention revolution — a complete derivation of modern sequence modelling
Why Sequences Matter in Science
Scientific data is often inherently sequential: time series from particle detectors, protein amino-acid sequences, atmospheric measurements, gravitational wave signals, and molecular dynamics trajectories. Unlike feedforward networks that process fixed-size inputs, recurrent architectures maintain an internal state that evolves as they read a sequence, while transformers use attention mechanisms to relate all positions simultaneously.
This chapter derives the forward pass of vanilla RNNs, exposes the vanishing gradient problem analytically, introduces LSTM gating, and then develops the full transformer architecture from first principles — including scaled dot-product attention and positional encoding.
1. The Vanilla Recurrent Neural Network
A recurrent neural network processes a sequence $\{x_1, x_2, \ldots, x_T\}$ one element at a time, maintaining a hidden state $h_t$ that serves as a compressed summary of all inputs seen so far. The fundamental equations defining the vanilla (Elman) RNN are:
RNN Forward Pass
At each time step $t = 1, 2, \ldots, T$:
$$h_t = \tanh(W_{hh}\, h_{t-1} + W_{xh}\, x_t + b_h)$$
$$\hat{y}_t = W_{hy}\, h_t + b_y$$
Here $W_{hh} \in \mathbb{R}^{d_h \times d_h}$ is the recurrent weight matrix,$W_{xh} \in \mathbb{R}^{d_h \times d_x}$ maps inputs to hidden states,$W_{hy} \in \mathbb{R}^{d_y \times d_h}$ maps hidden states to outputs, and $b_h, b_y$ are biases. The initial hidden state $h_0$ is typically set to zero.
The key insight is that the same weight matrices are shared across all time steps — this is weight tying. It means an RNN can process sequences of arbitrary length with a fixed number of parameters. The hidden state $h_t$ acts as a fixed-dimensional memory that is updated at each step.
Derivation: Unrolling Through Time
To understand the computational graph, we "unroll" the RNN across time. Define the pre-activation $z_t = W_{hh}\, h_{t-1} + W_{xh}\, x_t + b_h$, so$h_t = \tanh(z_t)$. Unrolling gives:
$$h_1 = \tanh(W_{hh}\, h_0 + W_{xh}\, x_1 + b_h)$$
$$h_2 = \tanh(W_{hh}\, \tanh(W_{hh}\, h_0 + W_{xh}\, x_1 + b_h) + W_{xh}\, x_2 + b_h)$$
and so on. The composition of $W_{hh}$ with itself through the tanh nonlinearity is what creates both the power and the difficulty of RNNs.
For a total loss $\mathcal{L} = \sum_{t=1}^{T} \ell_t(\hat{y}_t, y_t)$ summed over time steps, we need gradients with respect to all weights. The algorithm for computing these gradients is called Backpropagation Through Time (BPTT).
2. The Vanishing & Exploding Gradient Problem
The central difficulty with vanilla RNNs is learning long-range dependencies. Consider the gradient of the loss at time $T$ with respect to the hidden state at an earlier time $k$:
Gradient Chain Through Time
By the chain rule:
$$\frac{\partial \ell_T}{\partial h_k} = \frac{\partial \ell_T}{\partial h_T} \prod_{t=k+1}^{T} \frac{\partial h_t}{\partial h_{t-1}}$$
Each Jacobian factor is:
$$\frac{\partial h_t}{\partial h_{t-1}} = \text{diag}(1 - h_t^2)\, W_{hh}$$
where $\text{diag}(1 - h_t^2)$ is the diagonal matrix of tanh derivatives. Taking norms:
$$\left\|\prod_{t=k+1}^{T} \frac{\partial h_t}{\partial h_{t-1}}\right\| \leq \prod_{t=k+1}^{T} \|W_{hh}\| \cdot \|\text{diag}(1 - h_t^2)\|$$
Since $|\tanh'(z)| \leq 1$, if the largest singular value of $W_{hh}$satisfies $\sigma_{\max}(W_{hh}) < 1$, the product shrinks exponentially as$(T - k) \to \infty$: gradients vanish. If $\sigma_{\max}(W_{hh}) > 1$, they can explode.
Formal Bound (Bengio et al., 1994)
Let $\gamma = \sigma_{\max}(W_{hh}) \cdot \max_t |\tanh'(z_t)|$. Then:
$$\left\|\frac{\partial h_T}{\partial h_k}\right\| \leq \gamma^{T-k}$$
When $\gamma < 1$, the gradient decays exponentially with the temporal distance$T - k$. For a sequence of length 100 with $\gamma = 0.9$, the gradient is attenuated by a factor of $0.9^{100} \approx 2.66 \times 10^{-5}$. This makes it essentially impossible to learn dependencies spanning many time steps.
Click Run to execute the Python code
Code will be executed with Python 3 on the server
3. Gradient Clipping
A practical remedy for exploding gradients (but not vanishing gradients) is gradient clipping. Before each parameter update, we rescale the gradient if its norm exceeds a threshold:
$$\tilde{g} = \begin{cases} g & \text{if } \|g\| \leq \theta \\ \theta \cdot \frac{g}{\|g\|} & \text{if } \|g\| > \theta \end{cases}$$
This preserves the gradient direction but caps its magnitude. Pascanu et al. (2013) showed this is essential for stable RNN training. Typical values are $\theta \in [1, 5]$.
4. Long Short-Term Memory (LSTM)
The LSTM (Hochreiter & Schmidhuber, 1997) solves the vanishing gradient problem by introducing a cell state $c_t$ that flows through time with only linear interactions, controlled by learned gates. This creates a "gradient highway" that allows gradients to flow unimpeded across many time steps.
LSTM Gate Equations
At each time step $t$, define the concatenated input$[h_{t-1}, x_t]$. The LSTM computes:
Forget gate (what to erase from cell state):
$$f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)$$
Input gate (what new information to store):
$$i_t = \sigma(W_i [h_{t-1}, x_t] + b_i)$$
Candidate cell state:
$$\tilde{c}_t = \tanh(W_c [h_{t-1}, x_t] + b_c)$$
Cell state update:
$$c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t$$
Output gate (what to expose as hidden state):
$$o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)$$
Hidden state:
$$h_t = o_t \odot \tanh(c_t)$$
Here $\sigma$ is the sigmoid function $\sigma(z) = 1/(1+e^{-z})$, and $\odot$ denotes element-wise (Hadamard) multiplication.
Why LSTM Solves Vanishing Gradients
Consider the gradient of $c_T$ with respect to $c_k$:
$$\frac{\partial c_T}{\partial c_k} = \prod_{t=k+1}^{T} \frac{\partial c_t}{\partial c_{t-1}} = \prod_{t=k+1}^{T} f_t$$
Each factor $f_t$ is a sigmoid output in $(0,1)$. Crucially, when the forget gate is close to 1 (i.e., the network learns to "remember"), the product$\prod f_t \approx 1$ and gradients propagate without decay. Unlike the vanilla RNN where the spectral radius of $W_{hh}$ is fixed, each $f_t$ islearned and input-dependent, allowing the network to selectively preserve gradients for relevant information.
The bias of the forget gate is typically initialized to a large positive value (e.g., 1 or 2) so that $f_t \approx 1$ at the start of training, ensuring information flows freely until the network learns what to forget (Jozefowicz et al., 2015).
Click Run to execute the Python code
Code will be executed with Python 3 on the server
5. Gated Recurrent Unit (GRU)
The GRU (Cho et al., 2014) is a simplified variant that merges the forget and input gates into a single update gate and combines the cell and hidden states:
GRU Equations
$$z_t = \sigma(W_z [h_{t-1}, x_t] + b_z) \quad \text{(update gate)}$$
$$r_t = \sigma(W_r [h_{t-1}, x_t] + b_r) \quad \text{(reset gate)}$$
$$\tilde{h}_t = \tanh(W_h [r_t \odot h_{t-1}, x_t] + b_h)$$
$$h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$$
The update gate $z_t$ interpolates between the old hidden state and the candidate. When $z_t \approx 0$, the hidden state is copied forward unchanged; when$z_t \approx 1$, it is fully replaced. The GRU has fewer parameters than the LSTM (3 gate matrices vs. 4) and often performs comparably.
6. Self-Attention Mechanism
The transformer architecture (Vaswani et al., 2017) abandons recurrence entirely in favour of attention. The fundamental operation is scaled dot-product attention, which allows every position in the sequence to attend to every other position in parallel.
Queries, Keys, and Values
Given an input sequence of embeddings $X \in \mathbb{R}^{T \times d_{\text{model}}}$, we compute three projections:
$$Q = X W_Q, \quad K = X W_K, \quad V = X W_V$$
where $W_Q, W_K \in \mathbb{R}^{d_{\text{model}} \times d_k}$ and$W_V \in \mathbb{R}^{d_{\text{model}} \times d_v}$ are learned projection matrices.
Intuition: The query $q_i$ at position $i$ asks "what information do I need?"; the key $k_j$ at position $j$ advertises "what information do I contain?"; and the value $v_j$ is the actual content that gets transmitted when position $j$ is attended to.
Scaled Dot-Product Attention
The attention output is:
$$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V$$
Why the $\sqrt{d_k}$ scaling? Consider$q_i^\top k_j = \sum_{l=1}^{d_k} q_{il} k_{jl}$. If the entries of $q$and $k$ are i.i.d. with zero mean and unit variance, then:
$$\mathbb{E}[q_i^\top k_j] = 0, \quad \text{Var}(q_i^\top k_j) = d_k$$
So the dot products grow as $\mathcal{O}(\sqrt{d_k})$ in standard deviation. For large $d_k$, this pushes the softmax into regions with extremely small gradients (saturation). Dividing by $\sqrt{d_k}$ normalises the variance to 1, keeping the softmax in a well-behaved regime.
The attention matrix $A = \text{softmax}(QK^\top/\sqrt{d_k}) \in \mathbb{R}^{T \times T}$has rows that sum to 1. Entry $A_{ij}$ is the attention weight that position $i$places on position $j$.
Click Run to execute the Python code
Code will be executed with Python 3 on the server
7. Multi-Head Attention
A single attention head can only capture one type of relationship. Multi-head attention runs$h$ attention heads in parallel, each with its own projections, and concatenates the results:
Multi-Head Attention Equations
$$\text{head}_i = \text{Attention}(X W_Q^{(i)}, X W_K^{(i)}, X W_V^{(i)})$$
$$\text{MultiHead}(X) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\, W_O$$
where $W_Q^{(i)}, W_K^{(i)} \in \mathbb{R}^{d_{\text{model}} \times d_k}$,$W_V^{(i)} \in \mathbb{R}^{d_{\text{model}} \times d_v}$, and the output projection$W_O \in \mathbb{R}^{h \cdot d_v \times d_{\text{model}}}$.
Typically $d_k = d_v = d_{\text{model}}/h$, so the total computation is comparable to a single head with full dimensionality. Different heads learn to attend to different aspects: syntactic relationships, positional proximity, semantic similarity, etc.
Click Run to execute the Python code
Code will be executed with Python 3 on the server
8. Positional Encoding
Since self-attention is permutation-equivariant (it treats the input as a set, not a sequence), we must inject positional information explicitly. The original transformer uses sinusoidal positional encodings:
Sinusoidal Positional Encoding
For position $\text{pos}$ and dimension $i$:
$$\text{PE}(\text{pos}, 2i) = \sin\!\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right)$$
$$\text{PE}(\text{pos}, 2i+1) = \cos\!\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right)$$
Why sinusoidal? For any fixed offset $k$,$\text{PE}(\text{pos}+k)$ can be written as a linear function of$\text{PE}(\text{pos})$. Specifically, using the angle addition formula:
$$\sin(\omega(\text{pos}+k)) = \sin(\omega \cdot \text{pos})\cos(\omega k) + \cos(\omega \cdot \text{pos})\sin(\omega k)$$
This means relative positions can be captured by linear transformations of the encoding, which the attention mechanism can learn. The wavelengths form a geometric progression from $2\pi$to $10000 \cdot 2\pi$, covering a wide range of scales.
Click Run to execute the Python code
Code will be executed with Python 3 on the server
9. The Full Transformer Architecture
The transformer stacks multi-head attention with feedforward layers, using residual connections and layer normalisation throughout.
Encoder Block
Each encoder layer performs:
$$Z = \text{LayerNorm}(X + \text{MultiHead}(X))$$
$$\text{Output} = \text{LayerNorm}(Z + \text{FFN}(Z))$$
The feedforward network (FFN) is applied position-wise:
$$\text{FFN}(z) = W_2 \cdot \text{ReLU}(W_1 z + b_1) + b_2$$
where typically $W_1 \in \mathbb{R}^{d_{\text{model}} \times d_{\text{ff}}}$ with$d_{\text{ff}} = 4 \cdot d_{\text{model}}$. The residual connections$X + \text{MultiHead}(X)$ ensure gradients can flow directly through the network, analogous to the cell state highway in LSTMs.
Layer Normalisation
Unlike batch normalisation, layer normalisation computes statistics across the feature dimension for each individual example:
$$\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$$
where $\mu = \frac{1}{d}\sum_{i=1}^{d} x_i$ and$\sigma^2 = \frac{1}{d}\sum_{i=1}^{d}(x_i - \mu)^2$ are computed per-token. The parameters $\gamma, \beta \in \mathbb{R}^d$ are learned.
Decoder and Causal Masking
In autoregressive generation (e.g., language modelling), the decoder uses causal (masked) attention: position $i$can only attend to positions $j \leq i$. This is implemented by adding$-\infty$ to the attention scores for $j > i$ before softmax:
$$\text{score}_{ij} = \begin{cases} q_i^\top k_j / \sqrt{d_k} & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases}$$
The decoder also contains a cross-attention layer where the queries come from the decoder and the keys and values come from the encoder output.
Click Run to execute the Python code
Code will be executed with Python 3 on the server
10. Computational Complexity: RNN vs. Transformer
| Property | RNN | Transformer |
|---|---|---|
| Time per layer | $\mathcal{O}(T \cdot d^2)$ | $\mathcal{O}(T^2 \cdot d + T \cdot d^2)$ |
| Parallelisable | No (sequential) | Yes (fully parallel) |
| Max path length | $\mathcal{O}(T)$ | $\mathcal{O}(1)$ |
| Memory | $\mathcal{O}(d)$ per step | $\mathcal{O}(T^2)$ attention matrix |
The $\mathcal{O}(T^2)$ attention cost is the main bottleneck for very long sequences. This has motivated efficient attention variants: sparse attention, linear attention, FlashAttention (which optimises memory access patterns), and state-space models.
11. Scientific Applications
Time Series in Physics
LSTMs and transformers are used to model chaotic dynamical systems, predict turbulence evolution, and denoise gravitational wave signals from LIGO. The ability to capture long-range temporal correlations is critical for these applications.
Protein Language Models
Transformer architectures (e.g., ESM, AlphaFold2's Evoformer) treat amino acid sequences as "protein language." Self-attention captures co-evolutionary relationships between residues, enabling structure prediction and function annotation.
Molecular Generation (SMILES)
Molecules represented as SMILES strings are sequences that can be generated autoregressively. Transformer decoders learn the grammar of valid SMILES, enabling drug discovery through conditional generation of molecules with desired properties.
Climate & Weather
Vision Transformers (ViT) and spatio-temporal transformers like Pangu-Weather and GraphCast have achieved state-of-the-art weather forecasting, processing atmospheric fields as sequences of spatial patches across pressure levels and time.
Click Run to execute the Python code
Code will be executed with Python 3 on the server
Chapter Summary
- • Vanilla RNNs process sequences via shared weights and a recurrent hidden state, but suffer from vanishing/exploding gradients for long sequences.
- • LSTMs introduce a cell state with learned forget/input/output gates, creating a gradient highway that preserves information across hundreds of time steps.
- • GRUs simplify the gating mechanism to two gates (update and reset) with comparable performance.
- • Self-attention computes $\text{softmax}(QK^\top/\sqrt{d_k})V$, relating all positions in parallel with $\mathcal{O}(1)$ path length.
- • Multi-head attention runs $h$ parallel attention heads to capture diverse relationships.
- • Positional encoding injects sequence order via sinusoidal functions with a provable linear relationship for relative positions.
- • Transformers stack multi-head attention + FFN layers with residual connections and layer normalisation, enabling massive parallelism and state-of-the-art performance across scientific domains.