Generative Models (VAEs & GANs)
Learning to generate data — from variational inference to adversarial training and diffusion
Introduction
Generative models learn the underlying probability distribution of data and can generate new samples from it. Unlike discriminative models that learn $p(y|\mathbf{x})$, generative models learn$p(\mathbf{x})$ or $p(\mathbf{x}|y)$. In science, generative models enable data augmentation, simulation acceleration, anomaly detection, and sampling from complex distributions.
Key Topics
- 1. Variational Autoencoders (VAEs): ELBO Derivation
- 2. The Reparameterization Trick
- 3. Generative Adversarial Networks (GANs)
- 4. Normalizing Flows
- 5. Diffusion Models
- 6. Scientific Applications
1. Latent Variable Models
We assume data $\mathbf{x}$ is generated by first sampling a latent variable$\mathbf{z} \sim p(\mathbf{z})$ from a simple prior (e.g., Gaussian), then sampling$\mathbf{x} \sim p_\theta(\mathbf{x}|\mathbf{z})$ from a decoder:
The Intractability Problem
We want to maximize the marginal likelihood $\log p_\theta(\mathbf{x})$, but the integral is intractable for neural network decoders. The posterior $p_\theta(\mathbf{z}|\mathbf{x})$ is also intractable:
The VAE resolves this by introducing an approximate posterior $q_\phi(\mathbf{z}|\mathbf{x})$(the encoder) and deriving a tractable lower bound.
2. VAE: Deriving the ELBO
Derivation from KL Divergence
Start with the KL divergence between the approximate and true posterior:
Substituting Bayes' rule $p_\theta(\mathbf{z}|\mathbf{x}) = p_\theta(\mathbf{x}|\mathbf{z})p(\mathbf{z})/p_\theta(\mathbf{x})$:
Rearranging (noting $\log p_\theta(\mathbf{x})$ is constant w.r.t. $\mathbf{z}$):
Since the last term is non-negative, we get the Evidence Lower Bound (ELBO):
Interpreting the ELBO Terms
- Reconstruction term $\mathbb{E}_{q_\phi}[\log p_\theta(\mathbf{x}|\mathbf{z})]$: How well can the decoder reconstruct $\mathbf{x}$ from sampled $\mathbf{z}$? For Gaussian decoder, this is negative MSE: $-\|\mathbf{x} - \hat{\mathbf{x}}\|^2/(2\sigma^2)$.
- KL regularization $D_{\text{KL}}(q_\phi \| p)$: Keeps the encoder distribution close to the prior. Prevents the latent space from collapsing to point masses.
Closed-Form KL for Gaussians
When $q_\phi(\mathbf{z}|\mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma}^2))$and $p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})$:
Derivation: Using $D_{\text{KL}}(\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Sigma})\|\mathcal{N}(\mathbf{0},\mathbf{I})) = \frac{1}{2}[\text{tr}(\boldsymbol{\Sigma}) + \boldsymbol{\mu}^T\boldsymbol{\mu} - d - \log\det\boldsymbol{\Sigma}]$with diagonal $\boldsymbol{\Sigma}$.
3. The Reparameterization Trick
The Problem
The ELBO involves an expectation over $q_\phi(\mathbf{z}|\mathbf{x})$. We approximate this by sampling, but sampling is not differentiable — we cannot backpropagate through$\mathbf{z} \sim q_\phi(\mathbf{z}|\mathbf{x})$.
The Solution
Instead of sampling $\mathbf{z}$ directly, reparameterize as a deterministic function of the parameters and an independent noise variable:
Now $\mathbf{z}$ is a differentiable function of $\phi$ (through $\boldsymbol{\mu}$ and$\boldsymbol{\sigma}$), and the stochasticity comes from $\boldsymbol{\epsilon}$ which does not depend on $\phi$. Gradients flow through the deterministic path.
The expectation becomes $\mathbb{E}_{\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0},\mathbf{I})}[\cdot]$, which we estimate with a single sample per data point in each mini-batch.
4. Generative Adversarial Networks (GANs)
GANs (Goodfellow et al., 2014) learn to generate data through a two-player minimax game between a generator $G$ and a discriminator $D$.
The Minimax Objective
Discriminator ($\max_D$): Tries to correctly classify real data as real ($D(\mathbf{x}) \to 1$) and fake data as fake ($D(G(\mathbf{z})) \to 0$).
Generator ($\min_G$): Tries to fool the discriminator by generating data that $D$ classifies as real ($D(G(\mathbf{z})) \to 1$).
Optimal Discriminator
For fixed $G$, the optimal discriminator is:
Proof: The value function can be written as:
Taking the derivative w.r.t. $D(\mathbf{x})$ and setting to zero gives the result. The function$a\log y + b\log(1-y)$ is maximized at $y = a/(a+b)$.
Global Optimum
Substituting $D^*$ back into $V$:
where $D_{\text{JS}}$ is the Jensen-Shannon divergence:
The global minimum is achieved when $p_G = p_{\text{data}}$, giving $D_{\text{JS}} = 0$and $D^*(\mathbf{x}) = 1/2$ everywhere.
Training Instabilities
- Mode collapse: Generator produces only a few modes of the data distribution
- Vanishing gradients: If $D$ is too strong, $\log(1-D(G(\mathbf{z}))) \to 0$ provides no learning signal
- Non-saturating loss: Train $G$ to maximize $\log D(G(\mathbf{z}))$ instead (provides stronger gradients early)
- Wasserstein GAN: Uses Earth Mover distance instead of JS divergence, providing smoother gradients
5. Normalizing Flows
Normalizing flows transform a simple base distribution through a sequence of invertible maps, providing exact likelihood computation.
Change of Variables
If $\mathbf{x} = f(\mathbf{z})$ where $f$ is invertible and $\mathbf{z} \sim p_Z(\mathbf{z})$:
Composition of Flows
A normalizing flow composes $K$ invertible transformations:
The log-likelihood is:
The key design challenge is making the Jacobian determinant efficient to compute. Coupling layers (RealNVP, Glow) achieve $O(d)$ Jacobian computation.
6. Diffusion Models
Diffusion models (Ho et al., 2020; Song & Ermon, 2019) generate data by learning to reverse a gradual noising process. They now produce the highest quality samples across many domains.
Forward Process (Noising)
Gradually add Gaussian noise over $T$ steps with variance schedule $\beta_1, \ldots, \beta_T$:
Using $\alpha_t = 1-\beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^{t}\alpha_s$, we can sample any step directly:
Reverse Process (Denoising)
Learn a neural network $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$ to predict the noise:
The training loss (simplified) is:
This is derived from the variational bound and simplifies to predicting the noise$\boldsymbol{\epsilon}$ added to the clean data.
Score-Based Perspective
The connection to score matching: the model learns the score function(gradient of log-density):
Sampling then corresponds to Langevin dynamics, following the score to high-density regions.
7. Scientific Applications
Molecular Design
VAEs generate novel drug-like molecules by learning a continuous latent space of molecular structures. Interpolation in latent space produces smooth transitions between molecules, enabling property-guided optimization.
Simulation Acceleration
GANs generate synthetic physics simulation outputs (e.g., particle showers in calorimeters) orders of magnitude faster than full Monte Carlo simulation. CaloGAN and similar models produce detector-level data for high-energy physics.
Cosmological Simulation
Generative models create synthetic dark matter density fields and galaxy catalogs, accelerating Bayesian inference for cosmological parameters by replacing expensive N-body simulations.
Anomaly Detection
VAEs detect anomalies via high reconstruction error: data points unlike the training distribution yield poor reconstructions. This is used for new physics searches in particle collider data and for detecting unusual astronomical transients.
Boltzmann Generators
Normalizing flows can be trained to sample from Boltzmann distributions in statistical mechanics (Noe et al., 2019). Given a target energy function $U(\mathbf{x})$:
A normalizing flow trained with the loss $D_{\text{KL}}(q_\theta\|p)$ generates independent samples from the target distribution, bypassing the autocorrelation problem of MCMC methods. This enables efficient sampling of molecular configurations.
8. Comparison of Generative Models
| Property | VAE | GAN | Flow | Diffusion |
|---|---|---|---|---|
| Exact likelihood | Lower bound | No | Yes | Lower bound |
| Sample quality | Blurry | Sharp | Good | Excellent |
| Training | Stable | Unstable | Stable | Stable |
| Latent space | Structured | Unstructured | Structured | N/A |
| Mode coverage | Good | Mode collapse | Good | Excellent |
8. Python Simulation: VAE & GAN
This simulation implements a VAE and GAN on a 2D mixture of Gaussians, demonstrating the ELBO, reparameterization trick, and adversarial training.
VAE & GAN on 2D Gaussian Mixture
PythonClick Run to execute the Python code
Code will be executed with Python 3 on the server
Summary
- VAE ELBO: $\mathbb{E}_{q_\phi}[\log p_\theta(\mathbf{x}|\mathbf{z})] - D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x})\|p(\mathbf{z}))$
- Reparameterization: $\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon}$ enables backprop through sampling
- GAN objective: $\min_G\max_D\;\mathbb{E}[\log D(\mathbf{x})] + \mathbb{E}[\log(1-D(G(\mathbf{z})))]$
- Normalizing flows: Exact likelihood via change of variables and invertible transforms
- Diffusion: Learn to denoise; train by predicting added noise at random timesteps
- Science: Generative models accelerate simulations, augment data, and sample from complex distributions