Part VII — Advanced Topics

Chapter 20: Graph Neural Networks

Graph neural networks extend deep learning to irregular, relational data structures. We derive the spectral convolution framework from the graph Laplacian, trace the Chebyshev approximation to the simplified GCN propagation rule, and implement a two-layer GCN that classifies nodes using only the graph structure.

1. Graph Representation

A graph \( \mathcal{G} = (\mathcal{V}, \mathcal{E}) \) has \( N = |\mathcal{V}| \) nodes and edges \( (i,j) \in \mathcal{E} \). We represent it with three matrices:

Adjacency Matrix

\( A \in \{0,1\}^{N\times N} \)

A_{ij}=1 if (i,j)∈E, else 0. Symmetric for undirected graphs.

Degree Matrix

\( D_{ii} = \sum_j A_{ij} \)

Diagonal matrix of node degrees. Off-diagonal entries are 0.

Feature Matrix

\( X \in \mathbb{R}^{N\times d} \)

Row i contains the d-dimensional feature vector for node i.

2. Message Passing Framework

Modern GNNs share a unified message passing abstraction. At layer \( \ell \), each node \( v \) updates its representation by:

\[ \mathbf{m}_v^{(\ell)} = \texttt{AGGREGATE}\!\left(\left\{\mathbf{h}_u^{(\ell)} : u \in \mathcal{N}(v)\right\}\right) \]
\[ \mathbf{h}_v^{(\ell+1)} = \texttt{UPDATE}\!\left(\mathbf{h}_v^{(\ell)},\, \mathbf{m}_v^{(\ell)}\right) \]
vh_vu₁h_u1uā‚‚h_u2uā‚ƒh_u3uā‚„h_u4msgmsgmsgmsgh_v^(l+1) = UPDATE(h_v^(l), AGGREGATE({h_u : u in N(v)}))

Different GNN variants differ only in how AGGREGATE and UPDATE are defined. Mean, max, and sum aggregation correspond to distinct inductive biases. The key insight is that after \( K \) layers, each node's representation captures its \( K \)-hop neighbourhood.

3. Spectral Graph Theory — GCN Full Derivation

3.1 Graph Laplacian

The graph Laplacian is defined as:

\[ L = D - A \]

\( L \) is symmetric positive semi-definite and admits eigendecomposition \( L = U \Lambda U^T \) where \( U \in \mathbb{R}^{N\times N} \) are orthonormal eigenvectors and \( \Lambda = \text{diag}(\lambda_1,\ldots,\lambda_N) \) with \( 0 = \lambda_1 \leq \lambda_2 \leq \cdots \leq \lambda_N \). The normalised Laplacian is:

\[ \tilde{L} = D^{-1/2} L D^{-1/2} = I - D^{-1/2} A D^{-1/2} \]

3.2 Spectral Graph Convolution

The graph Fourier transform of a signal \( \mathbf{x} \in \mathbb{R}^N \) is\( \hat{\mathbf{x}} = U^T \mathbf{x} \). Spectral convolution with filter \( g_\theta \) is:

\[ g_\theta \star \mathbf{x} = U\, g_\theta(\Lambda)\, U^T \mathbf{x} \]

This requires computing the full eigendecomposition — \( O(N^3) \) and non-local. Chebyshev polynomials provide a tractable approximation.

3.3 Chebyshev Approximation

Approximate \( g_\theta(\Lambda) \approx \sum_{k=0}^{K} \theta_k T_k(\tilde{\Lambda}) \) where\( \tilde{\Lambda} = \frac{2}{\lambda_{\max}}\Lambda - I \) and \( T_k \) are Chebyshev polynomials. This gives a \( K \)-localised filter computable in \( O(K|\mathcal{E}|) \)(Defferrard et al., 2016).

3.4 Kipf & Welling Simplification → GCN

Set \( K=1 \) and \( \lambda_{\max} \approx 2 \):

\[ g_\theta \star \mathbf{x} \approx \theta_0 \mathbf{x} + \theta_1 (L - I)\mathbf{x} = \theta_0 \mathbf{x} - \theta_1 D^{-1/2}AD^{-1/2}\mathbf{x} \]

Constrain \( \theta = \theta_0 = -\theta_1 \) to reduce parameters:

\[ \approx \theta\left(I + D^{-1/2}AD^{-1/2}\right)\mathbf{x} \]

Apply renormalisation trick — add self-loops \( \tilde{A} = A + I \), recompute degree \( \tilde{D}_{ii} = \sum_j \tilde{A}_{ij} \) — to avoid numerical instability:

\[ \boxed{H^{(\ell+1)} = \sigma\!\left(\tilde{D}^{-1/2}\,\tilde{A}\,\tilde{D}^{-1/2}\, H^{(\ell)}\, W^{(\ell)}\right)} \]

where \( \tilde{A} = A + I_N \), \( \tilde{D}_{ii} = \sum_j \tilde{A}_{ij} \), \( W^{(\ell)} \) is a trainable weight matrix, and \( \sigma \) is a non-linearity (e.g. ReLU). This is the GCN layer (Kipf & Welling, 2017).

Interpretation

The normalised adjacency \( \tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2} \) performs a symmetric normalised mean aggregation over each node's neighbourhood (including itself). The \( \tilde{D}^{-1/2} \) factors on both sides prevent high-degree nodes from dominating the aggregation. The weight matrix \( W^{(\ell)} \) provides a learnable linear projection in feature space.

4. Graph Attention Networks (GAT)

GCN uses fixed symmetric normalisation. GAT (Velickovic et al., 2018) replaces it with learned attention coefficients. The unnormalised attention from node \( j \) to \( i \) is:

\[ e_{ij} = \text{LeakyReLU}\!\left(\mathbf{a}^T\!\left[W\mathbf{h}_i \,\|\, W\mathbf{h}_j\right]\right) \]

where \( \mathbf{a} \in \mathbb{R}^{2d'} \) is a learnable attention vector and \( \| \) denotes concatenation. These are normalised over the neighbourhood via softmax:

\[ \alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k \in \mathcal{N}(i)} \exp(e_{ik})}, \quad \mathbf{h}_i' = \sigma\!\left(\sum_{j \in \mathcal{N}(i)} \alpha_{ij}\, W\mathbf{h}_j\right) \]

Multi-head attention concatenates or averages \( K \) independent attention mechanisms, stabilising training. GAT assigns different importance to different neighbours, making it more expressive than GCN for heterogeneous graphs.

5. Graph-Level Tasks & Readout

Node classification uses the final node embeddings directly. For graph-level tasks (e.g. molecular property prediction), we need a readout/pooling function that aggregates all node representations into a single graph vector:

\[ \mathbf{h}_{\mathcal{G}} = \texttt{READOUT}\!\left(\left\{\mathbf{h}_v^{(K)} : v \in \mathcal{V}\right\}\right) \]

Global Sum Pool

\( h_G = \sum_v h_v^{(K)} \)

Simple but scales with graph size.

Global Mean Pool

\( h_G = \frac{1}{N}\sum_v h_v^{(K)} \)

Invariant to graph size; may lose count information.

Hierarchical (DiffPool)

\( S^{(l)} = \text{softmax}(\text{GNN}(A,H)) \)

Learned soft cluster assignments; captures hierarchy.

Applications: molecular property prediction (QM9, PCQM4M), drug discovery, protein structure prediction (residue-level GNNs), citation network classification, knowledge graph completion, and social network community detection.

Python Simulation: GCN on Karate Club

We implement a two-layer GCN from scratch using only NumPy, train it on the Zachary Karate Club graph with only 2 labelled nodes (node 0 and node 33), and visualise the learned 2D node embeddings. The GCN propagates label information through graph structure — this is semi-supervised learning.

Python
script.py202 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server