Chapter 16: Recurrent Networks

Recurrent neural networks process sequences by maintaining a hidden state that accumulates information from all past inputs. The vanishing gradient problem plagued early RNNs; the LSTM's gating mechanism was the breakthrough that made deep sequence learning practical.

1. The Vanilla RNN

At each time step \(t\), the RNN reads input \(\mathbf{x}_t\) and updates its hidden state:

\[ \mathbf{h}_t = \tanh\!\left(W_{hh}\mathbf{h}_{t-1} + W_{xh}\mathbf{x}_t + \mathbf{b}_h\right) \]\[ \hat{\mathbf{y}}_t = W_{hy}\mathbf{h}_t + \mathbf{b}_y \]

The same weights \(W_{hh}, W_{xh}\) are reused at every step — this is parameter sharing, what makes RNNs efficient on sequences of arbitrary length.

2. Backpropagation Through Time (BPTT)

For a loss \(\mathcal{L} = \sum_t \mathcal{L}_t\) over all steps, the gradient with respect to \(W_{hh}\) requires the chain rule unrolled through time:

\[ \frac{\partial \mathcal{L}}{\partial W_{hh}} = \sum_{t=1}^{T} \frac{\partial \mathcal{L}_t}{\partial \mathbf{h}_t} \cdot \frac{\partial \mathbf{h}_t}{\partial W_{hh}} \]

The gradient of \(\mathbf{h}_t\) with respect to \(\mathbf{h}_k\) (for \(k < t\)) is:

\[ \frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_k} = \prod_{j=k+1}^{t} \frac{\partial \mathbf{h}_j}{\partial \mathbf{h}_{j-1}} = \prod_{j=k+1}^{t} \mathrm{diag}(1 - \mathbf{h}_j^2)\cdot W_{hh} \]

This is a product of \(t-k\) Jacobian matrices. If the largest singular value of \(W_{hh}\) is \(\lambda_1 < 1\), the product shrinks exponentially — vanishing gradients. If \(\lambda_1 > 1\) they explode. Both make training long-range dependencies impossible.

Gradient accumulation formula

\[ \frac{\partial \mathcal{L}}{\partial W_{hh}} = \sum_{t=1}^{T} \boldsymbol{\delta}_t \mathbf{h}_{t-1}^\top, \qquad \boldsymbol{\delta}_t = \frac{\partial \mathcal{L}_t}{\partial \mathbf{h}_t} + W_{hh}^\top \boldsymbol{\delta}_{t+1} \odot (1 - \mathbf{h}_t^2) \]

3. Long Short-Term Memory (LSTM)

The LSTM (Hochreiter & Schmidhuber, 1997) adds a separate cell state \(\mathbf{c}_t\) that flows through time with only multiplicative interactions — no squashing — enabling gradients to flow over hundreds of steps.

3.1 Full Gate Equations

Forget gate
\(\mathbf{f}_t = \sigma\!\left(W_f \begin{bmatrix}\mathbf{h}_{t-1}\\\mathbf{x}_t\end{bmatrix} + \mathbf{b}_f\right)\)
How much of the old cell state to keep
Input gate
\(\mathbf{i}_t = \sigma\!\left(W_i \begin{bmatrix}\mathbf{h}_{t-1}\\\mathbf{x}_t\end{bmatrix} + \mathbf{b}_i\right)\)
How much of the new candidate to add
Cell candidate
\(\tilde{\mathbf{c}}_t = \tanh\!\left(W_c \begin{bmatrix}\mathbf{h}_{t-1}\\\mathbf{x}_t\end{bmatrix} + \mathbf{b}_c\right)\)
New information to potentially add
Cell update
\(\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t\)
Gated memory update (key to gradient flow)
Output gate
\(\mathbf{o}_t = \sigma\!\left(W_o \begin{bmatrix}\mathbf{h}_{t-1}\\\mathbf{x}_t\end{bmatrix} + \mathbf{b}_o\right)\)
How much of cell to expose
Hidden state
\(\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t)\)
Output to next layer / time step

Why LSTMs solve vanishing gradients

The gradient through the cell state pathway is: \(\frac{\partial \mathbf{c}_t}{\partial \mathbf{c}_{t-1}} = \mathbf{f}_t\). This is a scalar multiplication (no matrix multiply, no squashing), so the gradient is determined only by the forget gate values. When the forget gate is near 1, gradients flow unchanged; the network learns to keep gradients alive when needed.

4. LSTM Cell Diagram

c_{t-1}c_t×Forgetσ(W_f[h,x])+Inputi_t ×c̃_t×Outputσ(W_o[h,x])tanh(c_t)h_tx_t, h_{t-1}

5. GRU: Simplified Gating

The Gated Recurrent Unit (Cho et al., 2014) merges the forget and input gates into one update gate and eliminates the cell state, using only the hidden state:

\[ \mathbf{z}_t = \sigma(W_z[\mathbf{h}_{t-1}, \mathbf{x}_t]) \quad \text{(update gate)} \]
\[ \mathbf{r}_t = \sigma(W_r[\mathbf{h}_{t-1}, \mathbf{x}_t]) \quad \text{(reset gate)} \]
\[ \tilde{\mathbf{h}}_t = \tanh(W[\mathbf{r}_t \odot \mathbf{h}_{t-1}, \mathbf{x}_t]) \]
\[ \mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t \]

GRU has fewer parameters than LSTM and is often preferred for smaller datasets. Both achieve similar performance in practice.

5.1 Bidirectional RNNs

A bidirectional RNN runs one RNN left-to-right and another right-to-left, concatenating both hidden states:\(\mathbf{h}_t = [\overrightarrow{\mathbf{h}}_t; \overleftarrow{\mathbf{h}}_t]\). This gives access to both past and future context at each position — essential for NLP tasks like named entity recognition and machine translation encoders.

6. Python: RNN vs LSTM for Sequence Prediction

We implement both a vanilla RNN and an LSTM from scratch using NumPy. The task is one-step-ahead prediction of a noisy sine wave. We also measure gradient norms as a function of BPTT steps to show the vanishing gradient effect directly.

Python
script.py298 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server