Module 2 · Continuous Black-Box Optimisation

Evolution Strategies & CMA-ES

Evolution strategies (ES) work in continuous spaces \(\mathbb{R}^n\). Their modern form, the covariance-matrix adaptation evolution strategy (CMA-ES, Hansen & Ostermeier 2001), is the strongest general-purpose black-box optimiser ever published — competitive on benchmarks with problem-specific solvers, and the default tool when gradients are unavailable, noisy, or expensive. This module builds it from scratch.

1. The (1+1)-ES

The simplest ES has one parent and one offspring. At iteration \(t\):

\[ \mathbf{x}' \;=\; \mathbf{x}_t + \sigma_t\,\mathbf{z},\qquad \mathbf{z}\sim\mathcal{N}(\mathbf{0},\mathbf{I}) \]

If \(f(\mathbf{x}') < f(\mathbf{x}_t)\) (assuming minimisation), accept; otherwise reject. The step size \(\sigma_t\) is adapted by the 1/5 success rule of Rechenberg: increase\(\sigma\) if successes are too frequent (the search is wasting opportunities for bigger steps); decrease it if too rare. The empirical target is a 1-in-5 success rate.

if  success rate over last n iter > 0.2:  sigma *= a       # a > 1
if  success rate over last n iter < 0.2:  sigma /= a       # typical a = 1.22^(1/n)

2. The (μ, λ)-ES

The population variant: at each generation, \(\lambda\) offspring are sampled from \(\mu\) parents and the best \(\mu\) of the offspring only (no parents) survive. With \(\mu/\lambda \approx 1/4\)and weighted recombination, the population statistics provide enough information to adapt the step size self-adaptively: each individual carries its own\(\sigma\) as part of its genome, and selection on fitness implicitly selects on step sizes too.

The (μ + λ) variant retains parents and offspring together in the selection pool — more elitist, slower exploration. The (μ, λ) variant deliberately lets parents die: this is essential when noise, dynamic landscapes, or aggressive step-size adaptation can cause stuck states.

3. From Isotropic to Anisotropic Search

An isotropic Gaussian \(\sigma^2 \mathbf{I}\) explores all directions equally. On an ill-conditioned function — one whose level sets are elongated ellipsoids — this is catastrophically inefficient. The remedy: replace\(\sigma^2 \mathbf{I}\) by a general covariance matrix \(\mathbf{C}\)and sample

\[ \mathbf{x}^{(k)} \;=\; \mathbf{m} + \sigma\,\mathcal{N}(\mathbf{0},\mathbf{C}) \]

The challenge is to learn \(\mathbf{C}\) from the search history, cheaply and stably. CMA-ES solves it.

4. CMA-ES, the Algorithm

State at generation \(t\): mean \(\mathbf{m}\in\mathbb{R}^n\), step size \(\sigma > 0\), covariance \(\mathbf{C}\in\mathbb{R}^{n\times n}\), evolution paths \(\mathbf{p}_\sigma, \mathbf{p}_c \in \mathbb{R}^n\).

Sample:

\[ \mathbf{x}^{(k)} \;=\; \mathbf{m} + \sigma\,\mathbf{B}\mathbf{D}\,\mathbf{z}^{(k)},\qquad \mathbf{C} = \mathbf{B}\mathbf{D}^2\mathbf{B}^\top \]

Sort and recombine:

\[ \mathbf{m}_{t+1} \;=\; \sum_{i=1}^{\mu} w_i\,\mathbf{x}^{(i:\lambda)},\qquad \sum w_i = 1,\; w_i \propto \log(\lambda/2 + 1/2) - \log i \]

where \(\mathbf{x}^{(i:\lambda)}\) is the \(i\)-th best offspring. The log-weighting concentrates updates on the very best samples and is one of CMA-ES’s signature design choices.

Update evolution paths:

\[ \mathbf{p}_\sigma \leftarrow (1-c_\sigma)\mathbf{p}_\sigma + \sqrt{c_\sigma(2-c_\sigma)\mu_\mathrm{eff}}\,\mathbf{C}^{-1/2}\,\frac{\mathbf{m}_{t+1} - \mathbf{m}_t}{\sigma} \]

\[ \mathbf{p}_c \leftarrow (1-c_c)\mathbf{p}_c + h_\sigma \sqrt{c_c(2-c_c)\mu_\mathrm{eff}}\,\frac{\mathbf{m}_{t+1} - \mathbf{m}_t}{\sigma} \]

Update covariance:

\[ \mathbf{C} \leftarrow (1 - c_1 - c_\mu)\mathbf{C} + c_1\,\mathbf{p}_c \mathbf{p}_c^\top + c_\mu \sum_{i=1}^\mu w_i\, \mathbf{y}^{(i:\lambda)} \mathbf{y}^{(i:\lambda)\top} \]

Update step size:

\[ \sigma \leftarrow \sigma \exp\!\Bigl(\tfrac{c_\sigma}{d_\sigma}\Bigl(\frac{\|\mathbf{p}_\sigma\|}{\mathbb{E}\|\mathcal{N}(\mathbf{0},\mathbf{I})\|} - 1\Bigr)\Bigr) \]

The intuition: if successive \(\mathbf{m}\)-updates point in correlated directions (\(\|\mathbf{p}_\sigma\|\) large), the step size is too small; if they cancel out (\(\|\mathbf{p}_\sigma\|\) small), too large. The covariance update similarly aligns the search distribution with the directions of consistent improvement.

5. CMA-ES as Information-Geometric Optimisation

The 2010s yielded a clean theoretical reading of CMA-ES via the Information-Geometric Optimisation framework of Ollivier, Arnold, Auger & Hansen (IGO, 2011). View the search distribution as a point on the statistical manifold of multivariate Gaussians; do natural-gradient ascent on a smoothed surrogate of the objective. CMA-ES’s update rules emerge as the natural-gradient steps to second order — the rank-\(\mu\) update is the rank-\(\mu\) approximation of the Fisher information.

This connects evolutionary computation to natural-gradient methods in deep learning (NES, Wierstra et al. 2014; the K-FAC family) and explains the algorithm’s robustness: it is invariant under affine transformations of the search space and monotone transformations of the fitness.

6. Default Hyperparameters

The remarkable feature of CMA-ES is that it has essentially no tunable hyperparameters — everything is derived from \(n\), the problem dimension. The standard recipe (Hansen 2016 tutorial):

  • \(\lambda = 4 + \lfloor 3 \ln n \rfloor\)
  • \(\mu = \lfloor \lambda / 2 \rfloor\)
  • \(c_\sigma = (\mu_\mathrm{eff} + 2)/(n + \mu_\mathrm{eff} + 5)\)
  • \(c_c = (4 + \mu_\mathrm{eff}/n)/(n + 4 + 2\mu_\mathrm{eff}/n)\)
  • \(c_1 \approx 2/((n+1.3)^2 + \mu_\mathrm{eff})\)
  • \(c_\mu \approx \min(1 - c_1,\,2(\mu_\mathrm{eff} - 2 + 1/\mu_\mathrm{eff})/((n+2)^2 + \mu_\mathrm{eff}))\)

The user supplies only initial mean \(\mathbf{m}_0\) and step size\(\sigma_0\) (typically a fifth of the parameter range) and lets the algorithm run. On benchmarks like BBOB / COCO, CMA-ES dominates non-CMA evolutionary methods and stays competitive with Bayesian optimisation in dimensions up to a few hundred.

7. Variants & Extensions

  • sep-CMA-ES & LM-CMA: diagonal or low-rank covariance for \(n \gtrsim 100\) where storing the full\(\mathbf{C}\) is too expensive.
  • BIPOP- & IPOP-CMA-ES: restart with growing or alternating population sizes to escape local optima. The default in the BBOB benchmarks.
  • Active CMA-ES: the \(\mu\) worst samples actively shrink \(\mathbf{C}\) along their directions. Faster convergence on smooth problems.
  • CMA-ES with surrogate models:lq-CMA-ES (Hansen 2019), saACM-ES — a Gaussian-process or quadratic surrogate replaces some real fitness evaluations. The current state of the art for very expensive objectives (1–5 minutes per evaluation).

8. Natural Evolution Strategies

NES (Wierstra, Schaul, Glasmachers, Sun, Peters, Schmidhuber 2008–2014) is the explicit natural-gradient version of ES. It parameterises a search distribution \(\pi_\theta\) and ascends

\[ J(\theta) \;=\; \mathbb{E}_{\mathbf{x} \sim \pi_\theta}[\,f(\mathbf{x})\,] \]

by computing the natural gradient \(\tilde\nabla J = \mathbf{F}^{-1}\nabla J\)with \(\mathbf{F}\) the Fisher information of \(\pi_\theta\). For Gaussian \(\pi_\theta\), NES recovers most of CMA-ES; for other distributions (Cauchy, mixtures), it generalises the ES paradigm. NES is the bridge from ES into modern deep-RL ES (Module 6).