ML for Science/Part I: Foundations

Neural Networks & Universal Approximation

From the perceptron to universal function approximators — the theoretical power of neural networks

Introduction

Neural networks are parametric function approximators inspired loosely by biological neurons. Their key property is compositionality: by stacking layers of simple nonlinear transformations, they can represent extraordinarily complex functions. The Universal Approximation Theorem tells us that even a single hidden layer suffices in principle, but in practice depth provides crucial computational advantages.

In this chapter we build up from the single perceptron to multilayer networks, examine the activation functions that provide nonlinearity, and state and interpret the Universal Approximation Theorem.

Key Topics

  • 1. The Perceptron
  • 2. Multilayer Feedforward Networks
  • 3. Activation Functions (ReLU, Sigmoid, Tanh)
  • 4. Universal Approximation Theorem (Cybenko 1989)
  • 5. Depth vs. Width
  • 6. Representational Power and Limitations

1. The Perceptron

The perceptron (Rosenblatt, 1958) is the simplest neural network: a single neuron that computes a weighted sum of inputs and passes it through a step function:

$$\boxed{y = \text{step}\left(\sum_{j=1}^{d} w_j x_j + b\right) = \begin{cases} 1 & \text{if } \mathbf{w}^T\mathbf{x} + b \geq 0 \\ 0 & \text{otherwise} \end{cases}}$$

Perceptron Learning Rule

The perceptron learning algorithm updates weights for each misclassified example:

$$\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} + \eta(y_i - \hat{y}_i)\mathbf{x}_i$$

Perceptron Convergence Theorem: If the data is linearly separable with margin $\gamma$ and all points satisfy $\|\mathbf{x}_i\| \leq R$, the perceptron converges in at most $(R/\gamma)^2$ updates.

The XOR Problem

Minsky and Papert (1969) showed that a single perceptron cannot compute XOR. The XOR function requires nonlinear decision boundaries:

  • $\text{XOR}(0,0) = 0$, $\text{XOR}(0,1) = 1$, $\text{XOR}(1,0) = 1$, $\text{XOR}(1,1) = 0$

No single hyperplane can separate the classes. This limitation motivated the development of multilayer networks, which can compose multiple linear boundaries to represent any Boolean function.

2. Multilayer Feedforward Networks

A feedforward neural network with $L$ layers computes a composition of affine transformations and nonlinear activations:

$$\boxed{f(\mathbf{x}) = \mathbf{W}^{(L)}\phi(\mathbf{W}^{(L-1)}\phi(\cdots\phi(\mathbf{W}^{(1)}\mathbf{x} + \mathbf{b}^{(1)})\cdots) + \mathbf{b}^{(L-1)}) + \mathbf{b}^{(L)}}$$

Layer-by-Layer Computation

Each layer $\ell$ transforms its input as follows:

$$\mathbf{z}^{(\ell)} = \mathbf{W}^{(\ell)}\mathbf{h}^{(\ell-1)} + \mathbf{b}^{(\ell)} \quad \text{(pre-activation)}$$
$$\mathbf{h}^{(\ell)} = \phi(\mathbf{z}^{(\ell)}) \quad \text{(post-activation)}$$

where $\mathbf{W}^{(\ell)} \in \mathbb{R}^{n_\ell \times n_{\ell-1}}$ is the weight matrix,$\mathbf{b}^{(\ell)} \in \mathbb{R}^{n_\ell}$ is the bias vector, and $\phi$ is the activation function applied element-wise.

The input layer has $\mathbf{h}^{(0)} = \mathbf{x}$. The total number of parameters is:

$$\text{params} = \sum_{\ell=1}^{L}(n_\ell \cdot n_{\ell-1} + n_\ell)$$

Why Nonlinearity is Essential

Without activation functions, a multilayer network reduces to a single linear map:

$$\mathbf{W}^{(L)}\mathbf{W}^{(L-1)}\cdots\mathbf{W}^{(1)} = \tilde{\mathbf{W}} \quad \text{(a single matrix)}$$

The composition of linear maps is linear. Nonlinear activations between layers are what give neural networks their extraordinary representational power.

3. Activation Functions

Sigmoid

$$\sigma(z) = \frac{1}{1 + e^{-z}}, \quad \sigma'(z) = \sigma(z)(1 - \sigma(z))$$

Range: $(0, 1)$

Pros: Smooth, bounded, natural probabilistic interpretation

Cons: Saturates for large $|z|$ causing vanishing gradients; outputs not zero-centered

Max gradient: $\sigma'(0) = 0.25$, so each layer shrinks gradients by at least 4x

Hyperbolic Tangent (Tanh)

$$\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} = 2\sigma(2z) - 1, \quad \tanh'(z) = 1 - \tanh^2(z)$$

Range: $(-1, 1)$

Pros: Zero-centered (better gradient flow), stronger gradients than sigmoid

Cons: Still saturates, though less severely

Max gradient: $\tanh'(0) = 1$, twice as large as sigmoid's

Rectified Linear Unit (ReLU)

$$\text{ReLU}(z) = \max(0, z), \quad \text{ReLU}'(z) = \begin{cases} 1 & z > 0 \\ 0 & z < 0 \end{cases}$$

Range: $[0, \infty)$

Pros: No saturation for $z > 0$, computationally efficient, promotes sparsity

Cons: "Dead neurons" problem — if $z < 0$ for all inputs, gradient is permanently zero

Variants: Leaky ReLU ($\max(\alpha z, z)$), ELU, GELU, Swish

GELU (Gaussian Error Linear Unit)

$$\text{GELU}(z) = z \cdot \Phi(z) \approx z \cdot \sigma(1.702z)$$

where $\Phi$ is the CDF of the standard normal. GELU is the default activation in modern transformers (BERT, GPT). Unlike ReLU, it is smooth everywhere and has nonzero gradients for small negative inputs.

4. Universal Approximation Theorem

Theorem (Cybenko, 1989; Hornik et al., 1989)

Let $\phi: \mathbb{R} \to \mathbb{R}$ be a continuous, non-constant, bounded, monotonically increasing function (e.g., sigmoid). Then for any continuous function $f: [0,1]^d \to \mathbb{R}$and any $\epsilon > 0$, there exist $N \in \mathbb{N}$, weights$\{w_{ij}, v_j, b_j, c\}$ such that the function:

$$\boxed{F(\mathbf{x}) = \sum_{j=1}^{N} v_j \phi\left(\sum_{i=1}^{d} w_{ij}x_i + b_j\right) + c}$$

satisfies $\sup_{\mathbf{x} \in [0,1]^d}|F(\mathbf{x}) - f(\mathbf{x})| < \epsilon$.

Proof Sketch

The proof uses a density argument from functional analysis:

  1. Step 1: Show that the span of $\{\phi(\mathbf{w}^T\mathbf{x} + b) : \mathbf{w} \in \mathbb{R}^d, b \in \mathbb{R}\}$ is dense in $C([0,1]^d)$ by contradiction.
  2. Step 2: Assume the span is not dense. Then by the Hahn-Banach theorem, there exists a nonzero bounded linear functional (signed measure $\mu$) that vanishes on all functions in the span.
  3. Step 3: Show that $\int \phi(\mathbf{w}^T\mathbf{x} + b)d\mu(\mathbf{x}) = 0$for all $\mathbf{w}, b$ implies $\mu = 0$, reaching a contradiction.
  4. Step 4: The key insight is that as $\|\mathbf{w}\| \to \infty$, the sigmoid function approaches a step function, and step functions can approximate indicator functions of half-spaces. The Fourier transform of $\mu$ must vanish, implying $\mu = 0$.

What the Theorem Does NOT Say

  • Not constructive: It doesn't tell us how to find the weights
  • Width may be exponential: The required $N$ may grow exponentially with $d$
  • No learning guarantee: Gradient descent may not find the optimal weights
  • Compact domain: Only guarantees approximation on a compact set
  • Not unique: Many different networks can approximate the same function

Extensions

The theorem has been extended to ReLU networks (Leshno et al., 1993), showing that any non-polynomial activation function yields universal approximation. For ReLU specifically:

  • A width-$N$ ReLU network creates a piecewise linear function with at most $N$ linear pieces per dimension
  • The number of linear regions grows exponentially with depth but only polynomially with width
  • This gives deep networks an exponential advantage over shallow ones for representing certain functions

5. Depth vs. Width

Linear Regions of ReLU Networks

A ReLU network with $L$ layers of width $n$ and input dimension $d$ can partition$\mathbb{R}^d$ into at most:

$$\text{regions} \leq \left(\prod_{\ell=1}^{L-1} \left\lfloor\frac{n}{d}\right\rfloor^d\right) \sum_{j=0}^{d}\binom{n}{j}$$

For a deep network this grows as $O((n/d)^{d(L-1)})$exponential in depth. A shallow (1-layer) network achieves only $O(n^d)$ regions — polynomial in width.

Depth Separation Results

There exist functions that can be represented by a depth-$L$ network of polynomial size but require exponential width for a depth-$(L-1)$ network (Telgarsky, 2016):

$$f_L(x) = \underbrace{g \circ g \circ \cdots \circ g}_{L \text{ times}}(x), \quad g(x) = 4x(1-x)$$

This "sawtooth" function oscillates $2^L$ times on $[0,1]$. A network with $2L$layers and 2 neurons per layer represents it exactly, but any network with $O(L)$ layers needs $\Omega(2^{L/2})$ neurons.

Practical Implications

  • Hierarchical features: Deep networks learn features at multiple scales of abstraction
  • Parameter efficiency: Deep networks can represent complex functions with fewer total parameters
  • Training difficulty: Deeper networks are harder to train (vanishing/exploding gradients)
  • Over-parameterization: Modern networks have more parameters than data points, yet generalize well

6. Weight Initialization

Proper initialization is critical for training deep networks. The goal is to maintain signal variance as it propagates through layers.

Xavier/Glorot Initialization

For sigmoid/tanh activations, initialize weights as:

$$W_{ij}^{(\ell)} \sim \mathcal{N}\left(0, \frac{2}{n_{\ell-1} + n_\ell}\right) \quad \text{or} \quad \text{Uniform}\left(-\sqrt{\frac{6}{n_{\ell-1}+n_\ell}}, \sqrt{\frac{6}{n_{\ell-1}+n_\ell}}\right)$$

Derivation: If $\text{Var}(h_j^{(\ell-1)}) = v$ and weights have variance $\sigma_w^2$, then $\text{Var}(z_j^{(\ell)}) = n_{\ell-1}\sigma_w^2 v$. Setting $n_{\ell-1}\sigma_w^2 = 1$preserves variance forward. Similarly, preserving variance backward requires $n_\ell\sigma_w^2 = 1$. Xavier initialization compromises with $\sigma_w^2 = 2/(n_{\ell-1} + n_\ell)$.

He Initialization

For ReLU activations, since ReLU zeros out half the outputs, we need:

$$W_{ij}^{(\ell)} \sim \mathcal{N}\left(0, \frac{2}{n_{\ell-1}}\right)$$

The factor of 2 compensates for the fact that $\text{Var}[\text{ReLU}(z)] = \frac{1}{2}\text{Var}[z]$when $z$ is symmetric around zero.

7. Python Simulation: Neural Network Approximation

This simulation demonstrates universal approximation by training a neural network to learn a complex target function, comparing different widths and depths.

Neural Network Universal Approximation Demo

Python
script.py206 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

8. Expressiveness: Counting Parameters and Functions

Parameter Counting

For a fully connected network with layers of widths $(n_0, n_1, \ldots, n_L)$:

$$P = \sum_{\ell=1}^{L}(n_\ell \cdot n_{\ell-1} + n_\ell) = \sum_{\ell=1}^{L} n_\ell(n_{\ell-1} + 1)$$

For example, a network with architecture $(784, 256, 128, 10)$ has:

$$P = 256 \times 785 + 128 \times 257 + 10 \times 129 = 200{,}960 + 32{,}896 + 1{,}290 = 235{,}146$$

VC Dimension

The Vapnik-Chervonenkis dimension of a neural network measures its capacity to shatter (perfectly classify all possible labelings of) a set of points. For a network with $P$parameters and piecewise linear activations:

$$\text{VC-dim} = O(P \cdot L \cdot \log P)$$

This means the network can memorize datasets of size up to its VC dimension. In practice, modern networks have VC dimension far exceeding the training set size, yet still generalize well — a phenomenon not fully explained by classical learning theory.

The Double Descent Phenomenon

Classical bias-variance theory predicts that test error increases when models become too complex (overfitting). However, modern deep learning exhibits double descent:

  • Under-parameterized regime ($P < n$): Classical U-shaped curve
  • Interpolation threshold ($P \approx n$): Peak in test error
  • Over-parameterized regime ($P \gg n$): Test error decreases again

This suggests that over-parameterized networks find solutions with special properties (e.g., minimum-norm solutions) that generalize well despite perfectly fitting training data.

9. Regularization for Neural Networks

Dropout (Srivastava et al., 2014)

During training, randomly set each neuron's output to zero with probability $p$:

$$\tilde{h}_j^{(\ell)} = m_j^{(\ell)} \cdot h_j^{(\ell)}, \quad m_j^{(\ell)} \sim \text{Bernoulli}(1-p)$$

At test time, multiply by $(1-p)$ to compensate (or equivalently, divide by $(1-p)$during training — "inverted dropout").

Dropout can be interpreted as training an ensemble of $2^N$ subnetworks simultaneously (where $N$ is the number of neurons), with shared weights. It also corresponds approximately to Bayesian inference with a specific prior on weights.

Early Stopping

Monitor validation loss during training and stop when it starts increasing:

  • Track validation loss after each epoch
  • Save the model with the lowest validation loss
  • Stop if validation loss hasn't improved for $k$ epochs (patience)

Early stopping is equivalent to an implicit L2 regularization: for quadratic losses, stopping at step $T$ of gradient descent gives approximately the same solution as Ridge regression with $\lambda \propto 1/(\eta T)$.

10. Neural Networks in Scientific Computing

Function Approximation in Physics

Neural networks approximate potential energy surfaces for molecular dynamics, replacing expensive quantum chemistry calculations. The universal approximation theorem guarantees they can represent any smooth energy landscape.

Surrogate Models

In climate science and engineering, neural networks serve as fast surrogates for expensive simulations. A network trained on simulation outputs can predict results 1000x faster than running the full simulation.

Emulating Physical Systems

Neural networks learn mappings from initial conditions to evolved states, effectively learning the time-evolution operator of dynamical systems. Depth corresponds to compositional structure in the dynamics.

Symmetry-Preserving Networks

Equivariant neural networks (e.g., E(3)-equivariant) build physical symmetries (rotation, translation) directly into the architecture, dramatically improving data efficiency for molecular and materials science.

Approximation Rates

For smooth functions (having $s$ bounded derivatives), the approximation error of a ReLU network with $N$ parameters scales as:

$$\|f - f_N\|_\infty \leq C \cdot N^{-2s/d}$$

where $d$ is the input dimension. This rate suffers from the curse of dimensionality — the error decreases slowly with $N$ when $d$ is large. However, for functions with special structure (e.g., compositional functions), deep networks can break this curse.

Summary

  • Perceptron: A single neuron that computes a linear decision boundary; cannot solve XOR
  • Multilayer networks: Compositions of affine maps and nonlinearities; the building block of deep learning
  • Activation functions: ReLU dominates practice; sigmoid/tanh still used in special cases (e.g., gates in LSTMs)
  • Universal approximation: One hidden layer suffices to approximate any continuous function, but may need exponential width
  • Depth advantage: Deep networks have exponentially more linear regions than shallow ones with the same parameter count
  • Initialization: He init for ReLU, Xavier for sigmoid/tanh — maintaining variance across layers is critical