Module 6 ยท Evolution as Policy Optimisation

Evolutionary Reinforcement Learning

Reinforcement learning is policy optimisation: find \(\theta\) maximising\(J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]\). The standard tool is policy gradient, which estimates \(\nabla J\) via REINFORCE / actor-critic and applies SGD. Evolution strategies are an alternative: estimate\(\nabla J\) via finite-difference perturbations of \(\theta\)and apply the same SGD step. The 2017 OpenAI ES paper (Salimans, Ho, Chen, Sidor, Sutskever) showed that on Atari and MuJoCo, the ES alternative is competitive with deep RL while being far easier to scale on commodity CPU clusters.

1. The Score-Function Estimator, Reframed

Smooth the objective by adding Gaussian noise to \(\theta\). The smoothed objective is

\[ \tilde J(\theta) \;=\; \mathbb{E}_{\boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I})}[\,J(\theta + \sigma\,\boldsymbol{\epsilon})\,] \]

Its gradient admits a finite-difference-like Monte Carlo estimator:

\[ \nabla_\theta \tilde J(\theta) \;=\; \frac{1}{\sigma}\,\mathbb{E}_{\boldsymbol{\epsilon}}\!\bigl[\,J(\theta + \sigma\,\boldsymbol{\epsilon})\,\boldsymbol{\epsilon}\,\bigr] \]

Estimate by sampling \(\lambda\) perturbations, evaluating each rollout, and weighting the perturbations by their returns. This is the OpenAI ES gradient โ€” a degenerate special case of NES (Module 2) with isotropic Gaussian, no covariance adaptation, no step-size adaptation.

2. OpenAI ES, the Algorithm

hyperparams: sigma, learning_rate alpha, lambda workers
state: theta in R^N

repeat:
    broadcast a single integer seed s_t to all workers
    on each worker i = 1..lambda:
        epsilon_i = N(0, I)              # generated locally from seed s_t + i
        F_i      = rollout(theta + sigma * epsilon_i)
        send back (i, F_i)               # ONE float per worker per generation

    on master, with all (i, F_i):
        # rank-shape returns to centred utilities u_i in [-0.5, 0.5]
        u_i  = rank_utilities([F_1, ..., F_lambda])
        grad = (1 / (lambda * sigma)) * sum_i u_i * epsilon_i
        theta = theta + alpha * grad

Three small choices make this work at scale:

  • Mirrored sampling. For each\(\boldsymbol{\epsilon}\), also evaluate \(-\boldsymbol{\epsilon}\). Cuts gradient variance in half at no extra parameter cost.
  • Rank-based fitness shaping. Replace raw returns with ranks (e.g. linear weights or NES centred-rank weights). Eliminates outlier sensitivity and makes the algorithm invariant to fitness rescaling.
  • Shared random seed. Workers regenerate\(\boldsymbol{\epsilon}_i\) from a master seed and only return scalar returns. Communication is \(O(\lambda)\) floats per generation regardless of network size โ€” the entire reason ES scales linearly to thousands of CPU cores.

3. ES vs Policy Gradient: The Trade-Off

Policy gradient and ES estimate the same expected gradient via different sources of randomness. PG samples noise inside the rollout (action stochasticity); ES samples noise over the parameters. The variance comparison is roughly:

\[ \mathrm{Var}_{\mathrm{PG}} \;\propto\; T,\qquad \mathrm{Var}_{\mathrm{ES}} \;\propto\; N/\sigma^2 \]

where \(T\) is rollout length and \(N\) is the parameter count. ES wins when \(T \gg N/\sigma^2\) โ€” long episodes, sparse-reward, no per-step credit-assignment signal โ€” or when the policy is small enough that \(N\) is modest. PG wins when \(N\) is large and dense per-step rewards exist.

Other ES advantages: no need for action-value estimation, no off-policy corrections, no replay buffer. ES is also invariant to action-space discretisation and frame-skip, where PG can be surprisingly sensitive to both.

4. Augmented Random Search (ARS)

Mania, Guy & Recht (2018) showed that on the MuJoCo continuous-control benchmarks, an even simpler ES โ€” ARS โ€” beats both deep PG and OpenAI ES. ARS uses:

  • Linear policies (not deep networks).
  • Mirrored sampling, top-\(b\) truncation (use only the best\(b\) of \(\lambda\) pairs in the gradient estimate).
  • Per-feature observation normalisation by online mean/std.
  • Standard-deviation rescaling of the gradient.

The result: linear policies match or exceed published deep-RL benchmarks on MuJoCo. The modest takeaway: deep policies are often unnecessary; the immodest one: published deep-RL results were systematically over-fitted to particular seeds and hyperparameters, and a much simpler baseline reveals it.

5. PEPG & xNES Variants

Schaul et al.โ€™s NES family (Module 2) gives a principled middle ground. xNES (exponential NES) does full-rank covariance updates in a parameterisation that keeps the matrix positive-definite by construction. PEPG (Sehnke et al. 2010) uses parameter-exploring policy gradients with diagonal covariance โ€” a sweet spot between OpenAI ES (no covariance) and CMA-ES (full covariance). For policy networks of \(N \approx 10^3{-}10^4\) parameters, PEPG and xNES often outperform OpenAI ES at the same wall-clock cost.

6. Hybrids: ERL, CERL, P3S

The cleanest current frontier: combine evolution and gradients in a single agent.

  • ERL (Khadka & Tumer 2018): an evolutionary outer loop maintains a population of actors; a single off-policy critic (TD3-style) trains alongside on a shared replay buffer of all population rollouts. Periodically the gradient-trained actor is injected into the population. Marries ES exploration with PG sample-efficiency.
  • CERL (Khadka et al. 2019): generalisation of ERL with multi-objective preferences across a population of critics.
  • P3S (Jung et al. 2020): population of gradient-trained agents with episodic mutation between checkpoints, like PBT crossed with NEAT.
  • EvoRL via QD-RL: use MAP-Elites (Module 5) as an outer loop wrapping any policy-gradient method. The QD archive provides a curriculum of increasingly diverse seed policies for fine-tuning.

7. Practical Notes

  • Antithetic / mirrored sampling is essentially mandatory. The halving of variance is too cheap to skip.
  • Use rank-based fitness shaping rather than raw returns. The algorithm becomes invariant to monotonic transformations of the reward signal.
  • Sigma needs to be tuned. Too small and the algorithm cannot escape local optima; too large and the signal-to-noise of the gradient estimate degrades. Adaptive \(\sigma\) (NES, the 1/5 rule) helps.
  • ES does not need a replay buffer. This is a feature when memory is tight, but it also means ES cannot exploit off-policy data โ€” revealing why hybrids dominate when sample-efficiency matters.