Part VII — Advanced Topics

Chapter 20: Graph Neural Networks

Graph neural networks extend deep learning to irregular, relational data structures. We derive the spectral convolution framework from the graph Laplacian, trace the Chebyshev approximation to the simplified GCN propagation rule, and implement a two-layer GCN that classifies nodes using only the graph structure.

1. Graph Representation

A graph \( \mathcal{G} = (\mathcal{V}, \mathcal{E}) \) has \( N = |\mathcal{V}| \) nodes and edges \( (i,j) \in \mathcal{E} \). We represent it with three matrices:

Adjacency Matrix

\( A \in \{0,1\}^{N\times N} \)

A_{ij}=1 if (i,j)∈E, else 0. Symmetric for undirected graphs.

Degree Matrix

\( D_{ii} = \sum_j A_{ij} \)

Diagonal matrix of node degrees. Off-diagonal entries are 0.

Feature Matrix

\( X \in \mathbb{R}^{N\times d} \)

Row i contains the d-dimensional feature vector for node i.

2. Message Passing Framework

Modern GNNs share a unified message passing abstraction. At layer \( \ell \), each node \( v \) updates its representation by:

\[ \mathbf{m}_v^{(\ell)} = \operatorname{AGGREGATE}\!\bigl(\bigl\{\mathbf{h}_u^{(\ell)} : u \in \mathcal{N}(v)\bigr\}\bigr) \]

\[ \mathbf{h}_v^{(\ell+1)} = \operatorname{UPDATE}\!\bigl(\mathbf{h}_v^{(\ell)},\, \mathbf{m}_v^{(\ell)}\bigr) \]

Different GNN variants differ only in how AGGREGATE and UPDATE are defined. Mean, max, and sum aggregation correspond to distinct inductive biases. The key insight is that after \( K \) layers, each node's representation captures its \( K \)-hop neighbourhood.

3. Spectral Graph Theory — GCN Full Derivation

3.1 Graph Laplacian

The graph Laplacian is defined as:

\[ L = D - A \]

\( L \) is symmetric positive semi-definite and admits eigendecomposition \( L = U \Lambda U^T \) where \( U \in \mathbb{R}^{N\times N} \) are orthonormal eigenvectors and \( \Lambda = \text{diag}(\lambda_1,\ldots,\lambda_N) \) with \( 0 = \lambda_1 \leq \lambda_2 \leq \cdots \leq \lambda_N \). The normalised Laplacian is:

\[ \tilde{L} = D^{-1/2} L D^{-1/2} = I - D^{-1/2} A D^{-1/2} \]

3.2 Spectral Graph Convolution

The graph Fourier transform of a signal \( \mathbf{x} \in \mathbb{R}^N \) is\( \hat{\mathbf{x}} = U^T \mathbf{x} \). Spectral convolution with filter \( g_\theta \) is:

\[ g_\theta \star \mathbf{x} = U\, g_\theta(\Lambda)\, U^T \mathbf{x} \]

This requires computing the full eigendecomposition — \( O(N^3) \) and non-local. Chebyshev polynomials provide a tractable approximation.

3.3 Chebyshev Approximation

Approximate \( g_\theta(\Lambda) \approx \sum_{k=0}^{K} \theta_k T_k(\tilde{\Lambda}) \) where\( \tilde{\Lambda} = \frac{2}{\lambda_{\max}}\Lambda - I \) and \( T_k \) are Chebyshev polynomials. This gives a \( K \)-localised filter computable in \( O(K|\mathcal{E}|) \)(Defferrard et al., 2016).

3.4 Kipf & Welling Simplification → GCN

Set \( K=1 \) and \( \lambda_{\max} \approx 2 \):

\[ g_\theta \star \mathbf{x} \approx \theta_0 \mathbf{x} + \theta_1 (L - I)\mathbf{x} = \theta_0 \mathbf{x} - \theta_1 D^{-1/2}AD^{-1/2}\mathbf{x} \]

Constrain \( \theta = \theta_0 = -\theta_1 \) to reduce parameters:

\[ \approx \theta\left(I + D^{-1/2}AD^{-1/2}\right)\mathbf{x} \]

Apply renormalisation trick — add self-loops \( \tilde{A} = A + I \), recompute degree \( \tilde{D}_{ii} = \sum_j \tilde{A}_{ij} \) — to avoid numerical instability:

\[ \boxed{H^{(\ell+1)} = \sigma\!\left(\tilde{D}^{-1/2}\,\tilde{A}\,\tilde{D}^{-1/2}\, H^{(\ell)}\, W^{(\ell)}\right)} \]

where \( \tilde{A} = A + I_N \), \( \tilde{D}_{ii} = \sum_j \tilde{A}_{ij} \), \( W^{(\ell)} \) is a trainable weight matrix, and \( \sigma \) is a non-linearity (e.g. ReLU). This is the GCN layer (Kipf & Welling, 2017).

Interpretation

The normalised adjacency \( \tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2} \) performs a symmetric normalised mean aggregation over each node's neighbourhood (including itself). The \( \tilde{D}^{-1/2} \) factors on both sides prevent high-degree nodes from dominating the aggregation. The weight matrix \( W^{(\ell)} \) provides a learnable linear projection in feature space.

4. Graph Attention Networks (GAT)

GCN uses fixed symmetric normalisation. GAT (Velickovic et al., 2018) replaces it with learned attention coefficients. The unnormalised attention from node \( j \) to \( i \) is:

\[ e_{ij} = \text{LeakyReLU}\!\left(\mathbf{a}^T\!\left[W\mathbf{h}_i \,\|\, W\mathbf{h}_j\right]\right) \]

where \( \mathbf{a} \in \mathbb{R}^{2d'} \) is a learnable attention vector and \( \| \) denotes concatenation. These are normalised over the neighbourhood via softmax:

\[ \alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k \in \mathcal{N}(i)} \exp(e_{ik})}, \quad \mathbf{h}_i' = \sigma\!\left(\sum_{j \in \mathcal{N}(i)} \alpha_{ij}\, W\mathbf{h}_j\right) \]

Multi-head attention concatenates or averages \( K \) independent attention mechanisms, stabilising training. GAT assigns different importance to different neighbours, making it more expressive than GCN for heterogeneous graphs.

5. Graph-Level Tasks & Readout

Node classification uses the final node embeddings directly. For graph-level tasks (e.g. molecular property prediction), we need a readout/pooling function that aggregates all node representations into a single graph vector:

\[ \mathbf{h}_{\mathcal{G}} = \texttt{READOUT}\!\left(\left\{\mathbf{h}_v^{(K)} : v \in \mathcal{V}\right\}\right) \]

Global Sum Pool

\( h_G = \sum_v h_v^{(K)} \)

Simple but scales with graph size.

Global Mean Pool

\( h_G = \frac{1}{N}\sum_v h_v^{(K)} \)

Invariant to graph size; may lose count information.

Hierarchical (DiffPool)

\( S^{(l)} = \text{softmax}(\text{GNN}(A,H)) \)

Learned soft cluster assignments; captures hierarchy.

Applications: molecular property prediction (QM9, PCQM4M), drug discovery, protein structure prediction (residue-level GNNs), citation network classification, knowledge graph completion, and social network community detection.

Python Simulation: GCN on Karate Club

We implement a two-layer GCN from scratch using only NumPy, train it on the Zachary Karate Club graph with only 2 labelled nodes (node 0 and node 33), and visualise the learned 2D node embeddings. The GCN propagates label information through graph structure — this is semi-supervised learning.

Python

script.py198 lines

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D

# Zachary Karate Club graph - adjacency list (34 nodes)
edges = [
    (0,1),(0,2),(0,3),(0,4),(0,5),(0,6),(0,7),(0,8),(0,10),(0,11),(0,12),(0,13),(0,17),(0,19),(0,21),(0,31),
    (1,2),(1,3),(1,7),(1,13),(1,17),(1,19),(1,21),(1,30),
    (2,3),(2,7),(2,8),(2,9),(2,13),(2,27),(2,28),(2,32),
    (3,7),(3,12),(3,13),
    (4,6),(4,10),
    (5,6),(5,10),(5,16),
    (6,16),
    (8,30),(8,32),(8,33),
    (9,33),
    (13,33),
    (14,32),(14,33),
    (15,32),(15,33),
    (18,32),(18,33),
    (19,33),
    (20,32),(20,33),
    (22,32),(22,33),
    (23,25),(23,27),(23,29),(23,32),(23,33),
    (24,25),(24,27),(24,31),
    (25,31),
    (26,29),(26,33),
    (27,33),
    (28,31),(28,33),
    (29,32),(29,33),
    (30,32),(30,33),
    (31,32),(31,33),
    (32,33)
]
N = 34
# Ground truth labels (two communities from Zachary 1977)
labels = [0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,1,0,0,1,0,1,0,1,1,1,1,1,1,1,1,1,1,1,1]

# Build adjacency matrix
A = np.zeros((N, N))
for i, j in edges:
    A[i, j] = 1.0
    A[j, i] = 1.0

# Add self-loops: A_tilde = A + I
A_tilde = A + np.eye(N)

# Degree matrix of A_tilde
D_tilde = np.diag(A_tilde.sum(axis=1))

# Normalisation: D_tilde^{-1/2} A_tilde D_tilde^{-1/2}
D_inv_sqrt = np.diag(1.0 / np.sqrt(np.diag(D_tilde)))
A_norm = D_inv_sqrt @ A_tilde @ D_inv_sqrt

# Initial node features: one-hot degree (normalised)
degree = A.sum(axis=1, keepdims=True)
H0 = np.hstack([degree / degree.max(), np.eye(N)])  # shape (34, 35)

# Two-layer GCN with ReLU
np.random.seed(42)
d0 = H0.shape[1]  # 35
d1 = 8            # hidden dim
d2 = 2            # output (2D embedding for vis)

# Xavier init
W1 = np.random.randn(d0, d1) * np.sqrt(2.0 / (d0 + d1))
W2 = np.random.randn(d1, d2) * np.sqrt(2.0 / (d1 + d2))

def relu(x):
    return np.maximum(0, x)

def softmax_rows(x):
    e = np.exp(x - x.max(axis=1, keepdims=True))
    return e / e.sum(axis=1, keepdims=True)

# Simple supervised training (cross-entropy on 2 labelled nodes: 0 and 33)
def gcn_forward(H, W1, W2):
    H1 = relu(A_norm @ H @ W1)
    H2 = A_norm @ H1 @ W2
    return H1, H2

def cross_entropy(logits, y, idx):
    p = softmax_rows(logits[idx])
    return -np.mean(np.log(p[np.arange(len(idx)), y[idx]] + 1e-9))

labeled = [0, 33]
y = np.array(labels)
lr = 0.05
losses = []

for epoch in range(400):
    H1, H2 = gcn_forward(H0, W1, W2)
    probs = softmax_rows(H2)
    loss = cross_entropy(H2, y, labeled)
    losses.append(loss)

# Backprop (manual)
    dL = probs.copy()
    for i in labeled:
        dL[i, y[i]] -= 1.0
    dL /= len(labeled)

# Layer 2 grad
    dH2 = dL
    dW2 = H1.T @ (A_norm.T @ dH2)
    dH1 = (A_norm.T @ dH2) @ W2.T

# ReLU backward
    dH1_relu = dH1 * (H1 > 0)
    dW1 = H0.T @ (A_norm.T @ dH1_relu)

W1 -= lr * dW1
    W2 -= lr * dW2

# Final embeddings
H1_final, H2_final = gcn_forward(H0, W1, W2)

# Node layout using spring-like positions (simplified circular for each community)
np.random.seed(7)
theta0 = np.linspace(0, 2*np.pi, sum(1 for l in labels if l==0), endpoint=False)
theta1 = np.linspace(0, 2*np.pi, sum(1 for l in labels if l==1), endpoint=False)
pos = np.zeros((N, 2))
idx0, idx1 = 0, 0
for i in range(N):
    if labels[i] == 0:
        pos[i] = [1.5*np.cos(theta0[idx0]) - 1.5, 1.5*np.sin(theta0[idx0])]
        idx0 += 1
    else:
        pos[i] = [1.5*np.cos(theta1[idx1]) + 1.5, 1.5*np.sin(theta1[idx1])]
        idx1 += 1

# ---- Plots ----
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
fig.patch.set_facecolor('#0a0a0f')

# Panel 1: Original graph coloured by ground truth
ax1 = axes[0]
ax1.set_facecolor('#0a0a0f')
ax1.set_aspect('equal')
colors = ['#8b5cf6' if l==0 else '#34d399' for l in labels]
for i, j in edges:
    ax1.plot([pos[i,0], pos[j,0]], [pos[i,1], pos[j,1]], color='#374151', lw=0.5, zorder=1)
ax1.scatter(pos[:,0], pos[:,1], c=colors, s=80, zorder=3, edgecolors='white', linewidths=0.5)
for i in [0, 33]:
    ax1.annotate(str(i), pos[i], textcoords='offset points', xytext=(5,5), color='white', fontsize=9, fontweight='bold')
ax1.set_title('Karate Club Graph\n(ground truth communities)', color='white', fontsize=11, fontweight='bold')
ax1.axis('off')
legend_elements = [Line2D([0],[0],marker='o',color='w',markerfacecolor='#8b5cf6',markersize=8,label="Officer's group"),
                   Line2D([0],[0],marker='o',color='w',markerfacecolor='#34d399',markersize=8,label="Instructor's group")]
ax1.legend(handles=legend_elements, loc='lower left', facecolor='#1a1a2e', labelcolor='white', fontsize=8,
           edgecolor='#7c3aed')

# Panel 2: GCN 2D embeddings
ax2 = axes[1]
ax2.set_facecolor('#0a0a0f')
ax2.set_aspect('equal')
emb = H2_final
preds = np.argmax(softmax_rows(emb), axis=1)
emb_colors = ['#8b5cf6' if preds[i]==0 else '#34d399' for i in range(N)]
ax2.scatter(emb[:,0], emb[:,1], c=emb_colors, s=100, zorder=3, edgecolors='white', linewidths=0.5)
for i in range(N):
    ax2.annotate(str(i), emb[i], textcoords='offset points', xytext=(4,4), color='#9ca3af', fontsize=6)
ax2.set_title('GCN Node Embeddings (2D)\nColoured by predicted class', color='white', fontsize=11, fontweight='bold')
ax2.tick_params(colors='white')
for spine in ax2.spines.values():
    spine.set_color('#7c3aed')
ax2.set_xlabel('GCN dim 1', color='white', fontsize=10)
ax2.set_ylabel('GCN dim 2', color='white', fontsize=10)

# Accuracy
acc = np.mean(preds == np.array(labels))
ax2.text(0.05, 0.95, f'Accuracy: {acc*100:.1f}%', transform=ax2.transAxes, color='#a78bfa', fontsize=10,
         verticalalignment='top', bbox=dict(facecolor='#1e1b4b', alpha=0.6, edgecolor='#7c3aed'))

# Panel 3: Training loss
ax3 = axes[2]
ax3.set_facecolor('#0a0a0f')
ax3.plot(losses, color='#8b5cf6', linewidth=2)
ax3.fill_between(range(len(losses)), losses, alpha=0.2, color='#7c3aed')
ax3.set_xlabel('Epoch', color='white', fontsize=11)
ax3.set_ylabel('Cross-Entropy Loss', color='white', fontsize=11)
ax3.set_title('GCN Training Loss\n(2 labelled nodes only)', color='white', fontsize=11, fontweight='bold')
ax3.tick_params(colors='white')
for spine in ax3.spines.values():
    spine.set_color('#7c3aed')
ax3.grid(True, alpha=0.2, color='#7c3aed')

plt.tight_layout()
plt.savefig('output.png', dpi=150, bbox_inches='tight', facecolor='#0a0a0f')

print("GCN on Zachary Karate Club (34 nodes, {} edges)".format(len(edges)))
print("Architecture: 35 -> 8 -> 2 (ReLU, GCN layers)")
print("Trained on only 2 labelled nodes (node 0 and node 33)")
print("Final classification accuracy: {:.1f}%".format(acc*100))
print("Note: semi-supervised learning via graph structure propagation")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Share:X Reddit LinkedIn

Ch 19: Reinforcement Learning Ch 21: Diffusion Models