Chapter 20: Graph Neural Networks
Graph neural networks extend deep learning to irregular, relational data structures. We derive the spectral convolution framework from the graph Laplacian, trace the Chebyshev approximation to the simplified GCN propagation rule, and implement a two-layer GCN that classifies nodes using only the graph structure.
1. Graph Representation
A graph \( \mathcal{G} = (\mathcal{V}, \mathcal{E}) \) has \( N = |\mathcal{V}| \) nodes and edges \( (i,j) \in \mathcal{E} \). We represent it with three matrices:
Adjacency Matrix
A_{ij}=1 if (i,j)āE, else 0. Symmetric for undirected graphs.
Degree Matrix
Diagonal matrix of node degrees. Off-diagonal entries are 0.
Feature Matrix
Row i contains the d-dimensional feature vector for node i.
2. Message Passing Framework
Modern GNNs share a unified message passing abstraction. At layer \( \ell \), each node \( v \) updates its representation by:
Different GNN variants differ only in how AGGREGATE and UPDATE are defined. Mean, max, and sum aggregation correspond to distinct inductive biases. The key insight is that after \( K \) layers, each node's representation captures its \( K \)-hop neighbourhood.
3. Spectral Graph Theory ā GCN Full Derivation
3.1 Graph Laplacian
The graph Laplacian is defined as:
\( L \) is symmetric positive semi-definite and admits eigendecomposition \( L = U \Lambda U^T \) where \( U \in \mathbb{R}^{N\times N} \) are orthonormal eigenvectors and \( \Lambda = \text{diag}(\lambda_1,\ldots,\lambda_N) \) with \( 0 = \lambda_1 \leq \lambda_2 \leq \cdots \leq \lambda_N \). The normalised Laplacian is:
3.2 Spectral Graph Convolution
The graph Fourier transform of a signal \( \mathbf{x} \in \mathbb{R}^N \) is\( \hat{\mathbf{x}} = U^T \mathbf{x} \). Spectral convolution with filter \( g_\theta \) is:
This requires computing the full eigendecomposition ā \( O(N^3) \) and non-local. Chebyshev polynomials provide a tractable approximation.
3.3 Chebyshev Approximation
Approximate \( g_\theta(\Lambda) \approx \sum_{k=0}^{K} \theta_k T_k(\tilde{\Lambda}) \) where\( \tilde{\Lambda} = \frac{2}{\lambda_{\max}}\Lambda - I \) and \( T_k \) are Chebyshev polynomials. This gives a \( K \)-localised filter computable in \( O(K|\mathcal{E}|) \)(Defferrard et al., 2016).
3.4 Kipf & Welling Simplification ā GCN
Set \( K=1 \) and \( \lambda_{\max} \approx 2 \):
Constrain \( \theta = \theta_0 = -\theta_1 \) to reduce parameters:
Apply renormalisation trick ā add self-loops \( \tilde{A} = A + I \), recompute degree \( \tilde{D}_{ii} = \sum_j \tilde{A}_{ij} \) ā to avoid numerical instability:
where \( \tilde{A} = A + I_N \), \( \tilde{D}_{ii} = \sum_j \tilde{A}_{ij} \), \( W^{(\ell)} \) is a trainable weight matrix, and \( \sigma \) is a non-linearity (e.g. ReLU). This is the GCN layer (Kipf & Welling, 2017).
Interpretation
The normalised adjacency \( \tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2} \) performs a symmetric normalised mean aggregation over each node's neighbourhood (including itself). The \( \tilde{D}^{-1/2} \) factors on both sides prevent high-degree nodes from dominating the aggregation. The weight matrix \( W^{(\ell)} \) provides a learnable linear projection in feature space.
4. Graph Attention Networks (GAT)
GCN uses fixed symmetric normalisation. GAT (Velickovic et al., 2018) replaces it with learned attention coefficients. The unnormalised attention from node \( j \) to \( i \) is:
where \( \mathbf{a} \in \mathbb{R}^{2d'} \) is a learnable attention vector and \( \| \) denotes concatenation. These are normalised over the neighbourhood via softmax:
Multi-head attention concatenates or averages \( K \) independent attention mechanisms, stabilising training. GAT assigns different importance to different neighbours, making it more expressive than GCN for heterogeneous graphs.
5. Graph-Level Tasks & Readout
Node classification uses the final node embeddings directly. For graph-level tasks (e.g. molecular property prediction), we need a readout/pooling function that aggregates all node representations into a single graph vector:
Global Sum Pool
Simple but scales with graph size.
Global Mean Pool
Invariant to graph size; may lose count information.
Hierarchical (DiffPool)
Learned soft cluster assignments; captures hierarchy.
Applications: molecular property prediction (QM9, PCQM4M), drug discovery, protein structure prediction (residue-level GNNs), citation network classification, knowledge graph completion, and social network community detection.
Python Simulation: GCN on Karate Club
We implement a two-layer GCN from scratch using only NumPy, train it on the Zachary Karate Club graph with only 2 labelled nodes (node 0 and node 33), and visualise the learned 2D node embeddings. The GCN propagates label information through graph structure ā this is semi-supervised learning.
Click Run to execute the Python code
Code will be executed with Python 3 on the server