Machine Learning/Part VI: Sequence Models/Chapter 16

Chapter 16: Recurrent Networks

Recurrent neural networks process sequences by maintaining a hidden state that accumulates information from all past inputs. The vanishing gradient problem plagued early RNNs; the LSTM's gating mechanism was the breakthrough that made deep sequence learning practical.

1. The Vanilla RNN

At each time step \(t\), the RNN reads input \(\mathbf{x}_t\) and updates its hidden state:

\[ \mathbf{h}_t = \tanh\!\left(W_{hh}\mathbf{h}_{t-1} + W_{xh}\mathbf{x}_t + \mathbf{b}_h\right) \]\[ \hat{\mathbf{y}}_t = W_{hy}\mathbf{h}_t + \mathbf{b}_y \]

The same weights \(W_{hh}, W_{xh}\) are reused at every step — this is parameter sharing, what makes RNNs efficient on sequences of arbitrary length.

2. Backpropagation Through Time (BPTT)

For a loss \(\mathcal{L} = \sum_t \mathcal{L}_t\) over all steps, the gradient with respect to \(W_{hh}\) requires the chain rule unrolled through time:

\[ \frac{\partial \mathcal{L}}{\partial W_{hh}} = \sum_{t=1}^{T} \frac{\partial \mathcal{L}_t}{\partial \mathbf{h}_t} \cdot \frac{\partial \mathbf{h}_t}{\partial W_{hh}} \]

The gradient of \(\mathbf{h}_t\) with respect to \(\mathbf{h}_k\) (for \(k < t\)) is:

\[ \frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_k} = \prod_{j=k+1}^{t} \frac{\partial \mathbf{h}_j}{\partial \mathbf{h}_{j-1}} = \prod_{j=k+1}^{t} \mathrm{diag}(1 - \mathbf{h}_j^2)\cdot W_{hh} \]

This is a product of \(t-k\) Jacobian matrices. If the largest singular value of \(W_{hh}\) is \(\lambda_1 < 1\), the product shrinks exponentially — vanishing gradients. If \(\lambda_1 > 1\) they explode. Both make training long-range dependencies impossible.

Gradient accumulation formula

\[ \frac{\partial \mathcal{L}}{\partial W_{hh}} = \sum_{t=1}^{T} \boldsymbol{\delta}_t \mathbf{h}_{t-1}^\top, \qquad \boldsymbol{\delta}_t = \frac{\partial \mathcal{L}_t}{\partial \mathbf{h}_t} + W_{hh}^\top \boldsymbol{\delta}_{t+1} \odot (1 - \mathbf{h}_t^2) \]

3. Long Short-Term Memory (LSTM)

The LSTM (Hochreiter & Schmidhuber, 1997) adds a separate cell state \(\mathbf{c}_t\) that flows through time with only multiplicative interactions — no squashing — enabling gradients to flow over hundreds of steps.

3.1 Full Gate Equations

Forget gate

\(\mathbf{f}_t = \sigma\!\left(W_f \begin{bmatrix}\mathbf{h}_{t-1}\\\mathbf{x}_t\end{bmatrix} + \mathbf{b}_f\right)\)

How much of the old cell state to keep

Input gate

\(\mathbf{i}_t = \sigma\!\left(W_i \begin{bmatrix}\mathbf{h}_{t-1}\\\mathbf{x}_t\end{bmatrix} + \mathbf{b}_i\right)\)

How much of the new candidate to add

Cell candidate

\(\tilde{\mathbf{c}}_t = \tanh\!\left(W_c \begin{bmatrix}\mathbf{h}_{t-1}\\\mathbf{x}_t\end{bmatrix} + \mathbf{b}_c\right)\)

New information to potentially add

Cell update

\(\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t\)

Gated memory update (key to gradient flow)

Output gate

\(\mathbf{o}_t = \sigma\!\left(W_o \begin{bmatrix}\mathbf{h}_{t-1}\\\mathbf{x}_t\end{bmatrix} + \mathbf{b}_o\right)\)

How much of cell to expose

Hidden state

\(\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t)\)

Output to next layer / time step

Why LSTMs solve vanishing gradients

The gradient through the cell state pathway is: \(\frac{\partial \mathbf{c}_t}{\partial \mathbf{c}_{t-1}} = \mathbf{f}_t\). This is a scalar multiplication (no matrix multiply, no squashing), so the gradient is determined only by the forget gate values. When the forget gate is near 1, gradients flow unchanged; the network learns to keep gradients alive when needed.

4. LSTM Cell Diagram

5. GRU: Simplified Gating

The Gated Recurrent Unit (Cho et al., 2014) merges the forget and input gates into one update gate and eliminates the cell state, using only the hidden state:

\[ \mathbf{z}_t = \sigma(W_z[\mathbf{h}_{t-1}, \mathbf{x}_t]) \quad \text{(update gate)} \]

\[ \mathbf{r}_t = \sigma(W_r[\mathbf{h}_{t-1}, \mathbf{x}_t]) \quad \text{(reset gate)} \]

\[ \tilde{\mathbf{h}}_t = \tanh(W[\mathbf{r}_t \odot \mathbf{h}_{t-1}, \mathbf{x}_t]) \]

\[ \mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t \]

GRU has fewer parameters than LSTM and is often preferred for smaller datasets. Both achieve similar performance in practice.

5.1 Bidirectional RNNs

A bidirectional RNN runs one RNN left-to-right and another right-to-left, concatenating both hidden states:\(\mathbf{h}_t = [\overrightarrow{\mathbf{h}}_t; \overleftarrow{\mathbf{h}}_t]\). This gives access to both past and future context at each position — essential for NLP tasks like named entity recognition and machine translation encoders.

6. Python: RNN vs LSTM for Sequence Prediction

We implement both a vanilla RNN and an LSTM from scratch using NumPy. The task is one-step-ahead prediction of a noisy sine wave. We also measure gradient norms as a function of BPTT steps to show the vanishing gradient effect directly.

Python

script.py298 lines

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

np.random.seed(42)

# ============================================================
# Minimal RNN and LSTM from scratch (NumPy)
# Task: predict sin(t) one step ahead
# ============================================================

T = 200
t = np.linspace(0, 4 * np.pi, T)
signal = np.sin(t) + 0.1 * np.random.randn(T)

seq_len = 20
X_seq = np.array([signal[i:i+seq_len] for i in range(T - seq_len - 1)])
Y_seq = signal[seq_len:T-1]

# ---- Simple RNN ----
def sigmoid(x):
    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

def tanh_fn(x):
    return np.tanh(np.clip(x, -500, 500))

H = 16   # hidden size
lr = 0.005
n_epochs = 300

# RNN weights
Wxh = np.random.randn(1, H) * 0.1
Whh = np.random.randn(H, H) * 0.1
bh  = np.zeros(H)
Why = np.random.randn(H, 1) * 0.1
by  = np.zeros(1)

rnn_losses = []
for epoch in range(n_epochs):
    total_loss = 0
    dWxh = np.zeros_like(Wxh)
    dWhh = np.zeros_like(Whh)
    dbh  = np.zeros_like(bh)
    dWhy = np.zeros_like(Why)
    dby  = np.zeros_like(by)
    for i in range(len(X_seq)):
        x = X_seq[i].reshape(-1, 1)  # (seq_len, 1)
        y_target = Y_seq[i]
        hs = [np.zeros(H)]
        # Forward
        for t_step in range(seq_len):
            h_new = tanh_fn(x[t_step] @ Wxh + hs[-1] @ Whh + bh)
            hs.append(h_new)
        y_pred = hs[-1] @ Why + by
        loss = 0.5 * float(((y_pred - y_target)**2).item())
        total_loss += loss
        # Backward (BPTT, truncated)
        dy = y_pred - y_target
        dWhy += hs[-1].reshape(-1,1) * dy
        dby  += dy
        dh = dy * Why.T.flatten()
        for t_step in reversed(range(seq_len)):
            dtanh = (1 - hs[t_step+1]**2) * dh
            dbh  += dtanh
            dWxh += x[t_step] * dtanh
            dWhh += hs[t_step].reshape(-1,1) @ dtanh.reshape(1,-1)
            dh    = dtanh @ Whh.T
        # Clip gradients
        for g in [dWxh, dWhh, dbh, dWhy, dby]:
            np.clip(g, -1, 1, out=g)
    # Update
    Wxh -= lr * dWxh / len(X_seq)
    Whh -= lr * dWhh / len(X_seq)
    bh  -= lr * dbh  / len(X_seq)
    Why -= lr * dWhy / len(X_seq)
    by  -= lr * dby  / len(X_seq)
    rnn_losses.append(total_loss / len(X_seq))

# RNN predictions
rnn_preds = []
for i in range(len(X_seq)):
    x = X_seq[i].reshape(-1, 1)
    h = np.zeros(H)
    for t_step in range(seq_len):
        h = tanh_fn(x[t_step] @ Wxh + h @ Whh + bh)
    rnn_preds.append(float((h @ Why + by).item()))

# ---- LSTM ----
H_lstm = 16
Wf = np.random.randn(H_lstm + 1, H_lstm) * 0.1
Wi = np.random.randn(H_lstm + 1, H_lstm) * 0.1
Wc = np.random.randn(H_lstm + 1, H_lstm) * 0.1
Wo = np.random.randn(H_lstm + 1, H_lstm) * 0.1
Wy_l = np.random.randn(H_lstm, 1) * 0.1
by_l = np.zeros(1)

lstm_losses = []
for epoch in range(n_epochs):
    total_loss = 0
    dWf = np.zeros_like(Wf)
    dWi = np.zeros_like(Wi)
    dWc = np.zeros_like(Wc)
    dWo = np.zeros_like(Wo)
    dWy_l = np.zeros_like(Wy_l)
    dby_l = np.zeros_like(by_l)
    for i in range(len(X_seq)):
        x = X_seq[i].reshape(-1,1)
        y_target = Y_seq[i]
        h = np.zeros(H_lstm)
        c = np.zeros(H_lstm)
        hs, cs, fs, ins, cs_tilde, os = [h], [c], [], [], [], []
        # Forward LSTM
        for t_step in range(seq_len):
            xh = np.concatenate([x[t_step], h])
            f_gate = sigmoid(xh @ Wf)
            i_gate = sigmoid(xh @ Wi)
            c_tilde = tanh_fn(xh @ Wc)
            o_gate  = sigmoid(xh @ Wo)
            c = f_gate * c + i_gate * c_tilde
            h = o_gate * tanh_fn(c)
            fs.append(f_gate); ins.append(i_gate)
            cs_tilde.append(c_tilde); os.append(o_gate)
            hs.append(h); cs.append(c)
        y_pred = h @ Wy_l + by_l
        loss = 0.5 * float(((y_pred - y_target)**2).item())
        total_loss += loss
        dy = y_pred - y_target
        dWy_l += h.reshape(-1,1) * dy
        dby_l += dy
        dh = (dy * Wy_l.T).flatten()
        dc = np.zeros(H_lstm)
        for t_step in reversed(range(seq_len)):
            xh = np.concatenate([x[t_step], hs[t_step]])
            tanh_c = tanh_fn(cs[t_step])
            dc_new = dc + dh * os[t_step] * (1 - tanh_c**2)
            dWo += np.outer(xh, dh * tanh_c * os[t_step] * (1 - os[t_step]))
            dc = dc_new * fs[t_step]
            dWf += np.outer(xh, dc_new * cs[t_step-1] if t_step > 0 else np.zeros(H_lstm))
            dWi += np.outer(xh, dc_new * cs_tilde[t_step] * ins[t_step] * (1 - ins[t_step]))
            dWc += np.outer(xh, dc_new * ins[t_step] * (1 - cs_tilde[t_step]**2))
            dh = (np.outer(xh, dh).T @ Wo[-H_lstm:]).flatten()
        for g in [dWf, dWi, dWc, dWo, dWy_l, dby_l]:
            np.clip(g, -1, 1, out=g)
    n_s = len(X_seq)
    Wf -= lr * dWf / n_s; Wi -= lr * dWi / n_s
    Wc -= lr * dWc / n_s; Wo -= lr * dWo / n_s
    Wy_l -= lr * dWy_l / n_s; by_l -= lr * dby_l / n_s
    lstm_losses.append(total_loss / n_s)

lstm_preds = []
for i in range(len(X_seq)):
    x = X_seq[i].reshape(-1,1)
    h = np.zeros(H_lstm); c = np.zeros(H_lstm)
    for t_step in range(seq_len):
        xh = np.concatenate([x[t_step], h])
        f_g = sigmoid(xh @ Wf); i_g = sigmoid(xh @ Wi)
        c_t = tanh_fn(xh @ Wc); o_g = sigmoid(xh @ Wo)
        c = f_g * c + i_g * c_t; h = o_g * tanh_fn(c)
    lstm_preds.append(float((h @ Wy_l + by_l).item()))

# ---- Vanishing gradient demo ----
L = 50
# Random Jacobian matrices for RNN (tanh saturated region)
grad_norms_rnn = [1.0]
W_rnn_demo = np.random.randn(H, H) * 0.5
h_demo = np.random.randn(H)
g = np.ones(H)
for step in range(L):
    dtanh = 1 - np.tanh(W_rnn_demo @ h_demo)**2
    J = dtanh[:, None] * W_rnn_demo
    g = J.T @ g
    grad_norms_rnn.append(float(np.linalg.norm(g)))

# LSTM gradient norms (much more stable due to gated memory)
grad_norms_lstm = [1.0]
g_lstm = np.ones(H_lstm)
for step in range(L):
    forget_gate = np.random.uniform(0.7, 0.95, H_lstm)  # typical forget gate values
    g_lstm = forget_gate * g_lstm
    grad_norms_lstm.append(float(np.linalg.norm(g_lstm)))

# ============================================================
# Plots
# ============================================================
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
fig.patch.set_facecolor('#0a0a0f')
fig.suptitle('RNN vs LSTM: Sequence Prediction and Vanishing Gradients', fontsize=14, color='white', fontweight='bold')

t_pred = t[seq_len:T-1]

# Panel 1: RNN predictions
ax = axes[0, 0]
ax.set_facecolor('#0d0d1a')
ax.plot(t_pred, Y_seq, color='#fbbf24', linewidth=2, alpha=0.7, label='Target sin(t)')
ax.plot(t_pred, rnn_preds, color='#a78bfa', linewidth=2, label='RNN prediction')
ax.set_title(f'RNN Predictions (final loss={rnn_losses[-1]:.4f})', color='white', fontsize=11, fontweight='bold')
ax.set_xlabel('t', color='white'); ax.set_ylabel('value', color='white')
ax.tick_params(colors='white')
for sp in ax.spines.values(): sp.set_color('#7c3aed')
ax.grid(True, alpha=0.2, color='#7c3aed')
ax.legend(fontsize=9, facecolor='#0d0d1a', edgecolor='#7c3aed', labelcolor='white')

# Panel 2: LSTM predictions
ax = axes[0, 1]
ax.set_facecolor('#0d0d1a')
ax.plot(t_pred, Y_seq, color='#fbbf24', linewidth=2, alpha=0.7, label='Target sin(t)')
ax.plot(t_pred, lstm_preds, color='#34d399', linewidth=2, label='LSTM prediction')
ax.set_title(f'LSTM Predictions (final loss={lstm_losses[-1]:.4f})', color='white', fontsize=11, fontweight='bold')
ax.set_xlabel('t', color='white'); ax.set_ylabel('value', color='white')
ax.tick_params(colors='white')
for sp in ax.spines.values(): sp.set_color('#7c3aed')
ax.grid(True, alpha=0.2, color='#7c3aed')
ax.legend(fontsize=9, facecolor='#0d0d1a', edgecolor='#7c3aed', labelcolor='white')

# Panel 3: Training loss
ax = axes[0, 2]
ax.set_facecolor('#0d0d1a')
ax.semilogy(rnn_losses, color='#a78bfa', linewidth=2, label='RNN loss')
ax.semilogy(lstm_losses, color='#34d399', linewidth=2, label='LSTM loss')
ax.set_title('Training Loss Convergence', color='white', fontsize=11, fontweight='bold')
ax.set_xlabel('Epoch', color='white'); ax.set_ylabel('MSE loss (log)', color='white')
ax.tick_params(colors='white')
for sp in ax.spines.values(): sp.set_color('#7c3aed')
ax.grid(True, alpha=0.2, color='#7c3aed')
ax.legend(fontsize=9, facecolor='#0d0d1a', edgecolor='#7c3aed', labelcolor='white')

# Panel 4: Vanishing gradient
ax = axes[1, 0]
ax.set_facecolor('#0d0d1a')
steps = range(len(grad_norms_rnn))
ax.semilogy(steps, grad_norms_rnn, color='#f87171', linewidth=2.5, label='RNN gradient norm')
ax.semilogy(steps[:len(grad_norms_lstm)], grad_norms_lstm, color='#34d399', linewidth=2.5, label='LSTM gradient norm (approx)')
ax.set_title('Vanishing Gradient: RNN vs LSTM', color='white', fontsize=11, fontweight='bold')
ax.set_xlabel('Steps back in time', color='white'); ax.set_ylabel('||dL/dh_t|| (log)', color='white')
ax.tick_params(colors='white')
for sp in ax.spines.values(): sp.set_color('#7c3aed')
ax.grid(True, alpha=0.2, color='#7c3aed')
ax.legend(fontsize=9, facecolor='#0d0d1a', edgecolor='#7c3aed', labelcolor='white')

# Panel 5: Forget gate values over sequence
ax = axes[1, 1]
ax.set_facecolor('#0d0d1a')
# Run LSTM on a sample and show gate activations
x_sample = X_seq[100].reshape(-1,1)
h_s = np.zeros(H_lstm); c_s = np.zeros(H_lstm)
f_vals = []; i_vals = []; o_vals = []
for t_step in range(seq_len):
    xh = np.concatenate([x_sample[t_step], h_s])
    f_g = sigmoid(xh @ Wf); i_g = sigmoid(xh @ Wi)
    c_t = tanh_fn(xh @ Wc); o_g = sigmoid(xh @ Wo)
    c_s = f_g * c_s + i_g * c_t; h_s = o_g * tanh_fn(c_s)
    f_vals.append(np.mean(f_g)); i_vals.append(np.mean(i_g)); o_vals.append(np.mean(o_g))
ax.plot(f_vals, color='#a78bfa', linewidth=2, marker='o', markersize=4, label='Forget gate (avg)')
ax.plot(i_vals, color='#34d399', linewidth=2, marker='s', markersize=4, label='Input gate (avg)')
ax.plot(o_vals, color='#fbbf24', linewidth=2, marker='^', markersize=4, label='Output gate (avg)')
ax.set_ylim(0, 1.05)
ax.axhline(0.5, color='white', linestyle='--', alpha=0.3, linewidth=1)
ax.set_title('LSTM Gate Activations (sample sequence)', color='white', fontsize=11, fontweight='bold')
ax.set_xlabel('Time step', color='white'); ax.set_ylabel('Gate value (sigmoid)', color='white')
ax.tick_params(colors='white')
for sp in ax.spines.values(): sp.set_color('#7c3aed')
ax.grid(True, alpha=0.2, color='#7c3aed')
ax.legend(fontsize=9, facecolor='#0d0d1a', edgecolor='#7c3aed', labelcolor='white')

# Panel 6: Hidden state norms
ax = axes[1, 2]
ax.set_facecolor('#0d0d1a')
rnn_h_norms = []
lstm_h_norms = []
for i in range(min(50, len(X_seq))):
    x = X_seq[i].reshape(-1,1)
    h = np.zeros(H)
    for t_step in range(seq_len):
        h = tanh_fn(x[t_step] @ Wxh + h @ Whh + bh)
    rnn_h_norms.append(np.linalg.norm(h))
    h_l = np.zeros(H_lstm); c_l = np.zeros(H_lstm)
    for t_step in range(seq_len):
        xh = np.concatenate([x[t_step], h_l])
        f_g = sigmoid(xh @ Wf); i_g = sigmoid(xh @ Wi)
        c_t = tanh_fn(xh @ Wc); o_g = sigmoid(xh @ Wo)
        c_l = f_g * c_l + i_g * c_t; h_l = o_g * tanh_fn(c_l)
    lstm_h_norms.append(np.linalg.norm(h_l))
ax.plot(rnn_h_norms, color='#a78bfa', linewidth=2, label='RNN hidden norm')
ax.plot(lstm_h_norms, color='#34d399', linewidth=2, label='LSTM hidden norm')
ax.set_title('Hidden State Norms Across Sequences', color='white', fontsize=11, fontweight='bold')
ax.set_xlabel('Sequence index', color='white'); ax.set_ylabel('||h||', color='white')
ax.tick_params(colors='white')
for sp in ax.spines.values(): sp.set_color('#7c3aed')
ax.grid(True, alpha=0.2, color='#7c3aed')
ax.legend(fontsize=9, facecolor='#0d0d1a', edgecolor='#7c3aed', labelcolor='white')

plt.tight_layout()
plt.savefig('output.png', dpi=130, bbox_inches='tight', facecolor='#0a0a0f')
print(f"RNN final MSE: {rnn_losses[-1]:.5f}")
print(f"LSTM final MSE: {lstm_losses[-1]:.5f}")
print("Done.")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Share:X Reddit LinkedIn

← Ch 15: Variational Inference Ch 17: Attention & Transformers →