Neural Networks & Universal Approximation
From the perceptron to universal function approximators — the theoretical power of neural networks
Introduction
Neural networks are parametric function approximators inspired loosely by biological neurons. Their key property is compositionality: by stacking layers of simple nonlinear transformations, they can represent extraordinarily complex functions. The Universal Approximation Theorem tells us that even a single hidden layer suffices in principle, but in practice depth provides crucial computational advantages.
In this chapter we build up from the single perceptron to multilayer networks, examine the activation functions that provide nonlinearity, and state and interpret the Universal Approximation Theorem.
Key Topics
- 1. The Perceptron
- 2. Multilayer Feedforward Networks
- 3. Activation Functions (ReLU, Sigmoid, Tanh)
- 4. Universal Approximation Theorem (Cybenko 1989)
- 5. Depth vs. Width
- 6. Representational Power and Limitations
1. The Perceptron
The perceptron (Rosenblatt, 1958) is the simplest neural network: a single neuron that computes a weighted sum of inputs and passes it through a step function:
Perceptron Learning Rule
The perceptron learning algorithm updates weights for each misclassified example:
Perceptron Convergence Theorem: If the data is linearly separable with margin $\gamma$ and all points satisfy $\|\mathbf{x}_i\| \leq R$, the perceptron converges in at most $(R/\gamma)^2$ updates.
The XOR Problem
Minsky and Papert (1969) showed that a single perceptron cannot compute XOR. The XOR function requires nonlinear decision boundaries:
- $\text{XOR}(0,0) = 0$, $\text{XOR}(0,1) = 1$, $\text{XOR}(1,0) = 1$, $\text{XOR}(1,1) = 0$
No single hyperplane can separate the classes. This limitation motivated the development of multilayer networks, which can compose multiple linear boundaries to represent any Boolean function.
2. Multilayer Feedforward Networks
A feedforward neural network with $L$ layers computes a composition of affine transformations and nonlinear activations:
Layer-by-Layer Computation
Each layer $\ell$ transforms its input as follows:
where $\mathbf{W}^{(\ell)} \in \mathbb{R}^{n_\ell \times n_{\ell-1}}$ is the weight matrix,$\mathbf{b}^{(\ell)} \in \mathbb{R}^{n_\ell}$ is the bias vector, and $\phi$ is the activation function applied element-wise.
The input layer has $\mathbf{h}^{(0)} = \mathbf{x}$. The total number of parameters is:
Why Nonlinearity is Essential
Without activation functions, a multilayer network reduces to a single linear map:
The composition of linear maps is linear. Nonlinear activations between layers are what give neural networks their extraordinary representational power.
3. Activation Functions
Sigmoid
Range: $(0, 1)$
Pros: Smooth, bounded, natural probabilistic interpretation
Cons: Saturates for large $|z|$ causing vanishing gradients; outputs not zero-centered
Max gradient: $\sigma'(0) = 0.25$, so each layer shrinks gradients by at least 4x
Hyperbolic Tangent (Tanh)
Range: $(-1, 1)$
Pros: Zero-centered (better gradient flow), stronger gradients than sigmoid
Cons: Still saturates, though less severely
Max gradient: $\tanh'(0) = 1$, twice as large as sigmoid's
Rectified Linear Unit (ReLU)
Range: $[0, \infty)$
Pros: No saturation for $z > 0$, computationally efficient, promotes sparsity
Cons: "Dead neurons" problem — if $z < 0$ for all inputs, gradient is permanently zero
Variants: Leaky ReLU ($\max(\alpha z, z)$), ELU, GELU, Swish
GELU (Gaussian Error Linear Unit)
where $\Phi$ is the CDF of the standard normal. GELU is the default activation in modern transformers (BERT, GPT). Unlike ReLU, it is smooth everywhere and has nonzero gradients for small negative inputs.
4. Universal Approximation Theorem
Theorem (Cybenko, 1989; Hornik et al., 1989)
Let $\phi: \mathbb{R} \to \mathbb{R}$ be a continuous, non-constant, bounded, monotonically increasing function (e.g., sigmoid). Then for any continuous function $f: [0,1]^d \to \mathbb{R}$and any $\epsilon > 0$, there exist $N \in \mathbb{N}$, weights$\{w_{ij}, v_j, b_j, c\}$ such that the function:
satisfies $\sup_{\mathbf{x} \in [0,1]^d}|F(\mathbf{x}) - f(\mathbf{x})| < \epsilon$.
Proof Sketch
The proof uses a density argument from functional analysis:
- Step 1: Show that the span of $\{\phi(\mathbf{w}^T\mathbf{x} + b) : \mathbf{w} \in \mathbb{R}^d, b \in \mathbb{R}\}$ is dense in $C([0,1]^d)$ by contradiction.
- Step 2: Assume the span is not dense. Then by the Hahn-Banach theorem, there exists a nonzero bounded linear functional (signed measure $\mu$) that vanishes on all functions in the span.
- Step 3: Show that $\int \phi(\mathbf{w}^T\mathbf{x} + b)d\mu(\mathbf{x}) = 0$for all $\mathbf{w}, b$ implies $\mu = 0$, reaching a contradiction.
- Step 4: The key insight is that as $\|\mathbf{w}\| \to \infty$, the sigmoid function approaches a step function, and step functions can approximate indicator functions of half-spaces. The Fourier transform of $\mu$ must vanish, implying $\mu = 0$.
What the Theorem Does NOT Say
- Not constructive: It doesn't tell us how to find the weights
- Width may be exponential: The required $N$ may grow exponentially with $d$
- No learning guarantee: Gradient descent may not find the optimal weights
- Compact domain: Only guarantees approximation on a compact set
- Not unique: Many different networks can approximate the same function
Extensions
The theorem has been extended to ReLU networks (Leshno et al., 1993), showing that any non-polynomial activation function yields universal approximation. For ReLU specifically:
- A width-$N$ ReLU network creates a piecewise linear function with at most $N$ linear pieces per dimension
- The number of linear regions grows exponentially with depth but only polynomially with width
- This gives deep networks an exponential advantage over shallow ones for representing certain functions
5. Depth vs. Width
Linear Regions of ReLU Networks
A ReLU network with $L$ layers of width $n$ and input dimension $d$ can partition$\mathbb{R}^d$ into at most:
For a deep network this grows as $O((n/d)^{d(L-1)})$ — exponential in depth. A shallow (1-layer) network achieves only $O(n^d)$ regions — polynomial in width.
Depth Separation Results
There exist functions that can be represented by a depth-$L$ network of polynomial size but require exponential width for a depth-$(L-1)$ network (Telgarsky, 2016):
This "sawtooth" function oscillates $2^L$ times on $[0,1]$. A network with $2L$layers and 2 neurons per layer represents it exactly, but any network with $O(L)$ layers needs $\Omega(2^{L/2})$ neurons.
Practical Implications
- Hierarchical features: Deep networks learn features at multiple scales of abstraction
- Parameter efficiency: Deep networks can represent complex functions with fewer total parameters
- Training difficulty: Deeper networks are harder to train (vanishing/exploding gradients)
- Over-parameterization: Modern networks have more parameters than data points, yet generalize well
6. Weight Initialization
Proper initialization is critical for training deep networks. The goal is to maintain signal variance as it propagates through layers.
Xavier/Glorot Initialization
For sigmoid/tanh activations, initialize weights as:
Derivation: If $\text{Var}(h_j^{(\ell-1)}) = v$ and weights have variance $\sigma_w^2$, then $\text{Var}(z_j^{(\ell)}) = n_{\ell-1}\sigma_w^2 v$. Setting $n_{\ell-1}\sigma_w^2 = 1$preserves variance forward. Similarly, preserving variance backward requires $n_\ell\sigma_w^2 = 1$. Xavier initialization compromises with $\sigma_w^2 = 2/(n_{\ell-1} + n_\ell)$.
He Initialization
For ReLU activations, since ReLU zeros out half the outputs, we need:
The factor of 2 compensates for the fact that $\text{Var}[\text{ReLU}(z)] = \frac{1}{2}\text{Var}[z]$when $z$ is symmetric around zero.
7. Python Simulation: Neural Network Approximation
This simulation demonstrates universal approximation by training a neural network to learn a complex target function, comparing different widths and depths.
Neural Network Universal Approximation Demo
PythonClick Run to execute the Python code
Code will be executed with Python 3 on the server
8. Expressiveness: Counting Parameters and Functions
Parameter Counting
For a fully connected network with layers of widths $(n_0, n_1, \ldots, n_L)$:
For example, a network with architecture $(784, 256, 128, 10)$ has:
VC Dimension
The Vapnik-Chervonenkis dimension of a neural network measures its capacity to shatter (perfectly classify all possible labelings of) a set of points. For a network with $P$parameters and piecewise linear activations:
This means the network can memorize datasets of size up to its VC dimension. In practice, modern networks have VC dimension far exceeding the training set size, yet still generalize well — a phenomenon not fully explained by classical learning theory.
The Double Descent Phenomenon
Classical bias-variance theory predicts that test error increases when models become too complex (overfitting). However, modern deep learning exhibits double descent:
- Under-parameterized regime ($P < n$): Classical U-shaped curve
- Interpolation threshold ($P \approx n$): Peak in test error
- Over-parameterized regime ($P \gg n$): Test error decreases again
This suggests that over-parameterized networks find solutions with special properties (e.g., minimum-norm solutions) that generalize well despite perfectly fitting training data.
9. Regularization for Neural Networks
Dropout (Srivastava et al., 2014)
During training, randomly set each neuron's output to zero with probability $p$:
At test time, multiply by $(1-p)$ to compensate (or equivalently, divide by $(1-p)$during training — "inverted dropout").
Dropout can be interpreted as training an ensemble of $2^N$ subnetworks simultaneously (where $N$ is the number of neurons), with shared weights. It also corresponds approximately to Bayesian inference with a specific prior on weights.
Early Stopping
Monitor validation loss during training and stop when it starts increasing:
- Track validation loss after each epoch
- Save the model with the lowest validation loss
- Stop if validation loss hasn't improved for $k$ epochs (patience)
Early stopping is equivalent to an implicit L2 regularization: for quadratic losses, stopping at step $T$ of gradient descent gives approximately the same solution as Ridge regression with $\lambda \propto 1/(\eta T)$.
10. Neural Networks in Scientific Computing
Function Approximation in Physics
Neural networks approximate potential energy surfaces for molecular dynamics, replacing expensive quantum chemistry calculations. The universal approximation theorem guarantees they can represent any smooth energy landscape.
Surrogate Models
In climate science and engineering, neural networks serve as fast surrogates for expensive simulations. A network trained on simulation outputs can predict results 1000x faster than running the full simulation.
Emulating Physical Systems
Neural networks learn mappings from initial conditions to evolved states, effectively learning the time-evolution operator of dynamical systems. Depth corresponds to compositional structure in the dynamics.
Symmetry-Preserving Networks
Equivariant neural networks (e.g., E(3)-equivariant) build physical symmetries (rotation, translation) directly into the architecture, dramatically improving data efficiency for molecular and materials science.
Approximation Rates
For smooth functions (having $s$ bounded derivatives), the approximation error of a ReLU network with $N$ parameters scales as:
where $d$ is the input dimension. This rate suffers from the curse of dimensionality — the error decreases slowly with $N$ when $d$ is large. However, for functions with special structure (e.g., compositional functions), deep networks can break this curse.
Summary
- Perceptron: A single neuron that computes a linear decision boundary; cannot solve XOR
- Multilayer networks: Compositions of affine maps and nonlinearities; the building block of deep learning
- Activation functions: ReLU dominates practice; sigmoid/tanh still used in special cases (e.g., gates in LSTMs)
- Universal approximation: One hidden layer suffices to approximate any continuous function, but may need exponential width
- Depth advantage: Deep networks have exponentially more linear regions than shallow ones with the same parameter count
- Initialization: He init for ReLU, Xavier for sigmoid/tanh — maintaining variance across layers is critical