Chapter 19: Reinforcement Learning
Reinforcement learning formalises the problem of an agent learning through interaction. We derive every key equation from first principles: the Bellman equation for policy evaluation, the optimal Bellman operator, Q-learning's off-policy TD update, and the policy gradient theorem via the log-derivative trick.
1. Markov Decision Processes
A Markov Decision Process is a tuple \( (\mathcal{S}, \mathcal{A}, P, R, \gamma) \) where:
- \( \mathcal{S} \) β state space
- \( \mathcal{A} \) β action space
- \( P(s' \mid s, a) \) β transition probability kernel
- \( R(s, a, s') \in \mathbb{R} \) β reward function (often abbreviated \( R(s,a) \))
- \( \gamma \in [0,1) \) β discount factor
At each time step \( t \), the agent observes state \( s_t \), selects action\( a_t \sim \pi(\cdot \mid s_t) \), receives reward \( r_t \), and transitions to \( s_{t+1} \sim P(\cdot \mid s_t, a_t) \). The goal is to maximise expected discounted return:
The Markov property states that \( P(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \ldots) = P(s_{t+1} \mid s_t, a_t) \), meaning the future depends only on the current state and action, not the full history.
AgentβEnvironment Loop
2. Value Functions
The state-value function under policy \( \pi \) is the expected discounted return starting from state \( s \):
The action-value function (Q-function) conditions on both state and action:
The relationship between the two is:\( V^\pi(s) = \sum_a \pi(a \mid s)\, Q^\pi(s, a) \).
3. Bellman Equation β Full Derivation
We derive the Bellman equation by expanding the definition of \( V^\pi(s) \) and exploiting the Markov property. Start from:
Peel off the first reward:
Condition on the first action \( a_0 \sim \pi(\cdot|s) \) and next state \( s_1 \sim P(\cdot|s,a_0) \):
The inner expectation is exactly \( V^\pi(s') \) by definition:
This is the Bellman expectation equation. It expresses V at the current state as a weighted sum over actions and next states of immediate reward plus discounted future value.
Optimal Bellman Equation
The optimal value function \( V^*(s) = \max_\pi V^\pi(s) \) satisfies the Bellman optimality equation:
The corresponding optimal Q-function satisfies:
The optimal policy is then \( \pi^*(s) = \operatorname{argmax}_a Q^*(s,a) \). The Bellman optimality operator \( \mathcal{T}^* \) is a contraction with factor \( \gamma \) under the \( \ell^\infty \) norm, guaranteeing a unique fixed point.
4. Q-Learning
Q-learning (Watkins, 1989) is an off-policy temporal-difference algorithm that directly approximates \( Q^* \) without knowing the model \( P \). After observing transition \( (s, a, r, s') \), the update is:
The TD error \( \delta_t = r + \gamma \max_{a'} Q(s',a') - Q(s,a) \) measures the discrepancy between the current Q-estimate and the bootstrapped target. Q-learning is off-policy because the max over \( a' \) follows the greedy policy regardless of which policy collected the data.
Tabular Q-Learning Algorithm
- Initialise \( Q(s,a) = 0 \) for all \( s,a \)
- For each episode, reset \( s \leftarrow s_0 \)
- Choose \( a \) via \( \varepsilon \)-greedy: with prob \( \varepsilon \) random, else \( \argmax_a Q(s,a) \)
- Take \( a \), observe \( r, s' \)
- Update: \( Q(s,a) \mathrel{+}= \alpha[r + \gamma \max_{a'} Q(s',a') - Q(s,a)] \)
- Set \( s \leftarrow s' \); repeat until terminal
Convergence Guarantee
Q-learning converges to \( Q^* \) with probability 1 provided: (1) all state-action pairs are visited infinitely often, (2) the learning rate satisfies \( \sum_t \alpha_t = \infty \) and \( \sum_t \alpha_t^2 < \infty \), and (3) rewards are bounded. The proof uses stochastic approximation theory and the contraction property of \( \mathcal{T}^* \).
5. Policy Gradient Theorem & REINFORCE
Instead of learning a value function and deriving a policy, policy gradient methods directly parameterise\( \pi_\theta \) and optimise \( J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[G_0] \)by gradient ascent. We derive \( \nabla_\theta J(\theta) \) using the log-derivative trick.
Write the objective as an expectation over trajectories \( \tau = (s_0,a_0,\ldots) \):
Take the gradient:
Apply the log-derivative trick: \( \nabla_\theta p_\theta(\tau) = p_\theta(\tau)\,\nabla_\theta \log p_\theta(\tau) \):
Expand the log-probability of a trajectory (transitions cancel because they don't depend on \( \theta \)):
Therefore:
This is the REINFORCE estimator. The causal form uses the return-to-go \( G_t = \sum_{k=t}^{T-1} \gamma^{k-t} r_k \) (past rewards cannot be influenced by current parameters).
Variance Reduction: Baselines
Subtracting a baseline \( b(s_t) \) (e.g. \( V^\pi(s_t) \)) from \( G_t \)does not bias the gradient (the expected baseline term is zero by the likelihood ratio argument) but drastically reduces variance:
6. Actor-Critic & PPO
Actor-Critic
Replace the Monte-Carlo return \( G_t \) with a bootstrapped TD estimate using a learned critic \( V_w(s) \):
The actor updates \( \theta \) via policy gradient; the critic updates \( w \) by minimising \( \|r_t + \gamma V_w(s_{t+1}) - V_w(s_t)\|^2 \).
PPO Clipped Objective
PPO (Schulman et al., 2017) prevents destructively large policy updates using a clipped surrogate:
The clip prevents \( r_t \) from moving beyond \( [1-\varepsilon, 1+\varepsilon] \), bounding the trust region without solving a constrained optimisation problem.
Python Simulation: Q-Learning GridWorld
We train a tabular Q-learning agent on a 5Γ5 gridworld with an obstacle, then visualise the learned value function as a heatmap and the greedy policy as directional arrows.
Click Run to execute the Python code
Code will be executed with Python 3 on the server