Chapter 18: Large Language Models

Large language models are Transformer-based autoregressive models trained on massive text corpora. They exhibit surprising capabilities that emerge at scale โ€” abilities not present in small models and not explicitly programmed. This chapter covers the mathematical foundations, training pipeline, and evaluation of LLMs.

1. Autoregressive Language Modelling

A language model defines a probability distribution over sequences. Using the chain rule of probability:

\[ P(x_1, x_2, \ldots, x_T) = \prod_{t=1}^{T} P(x_t \mid x_1, \ldots, x_{t-1}) = \prod_{t=1}^{T} P(x_t \mid x_{<t}) \]

Training minimises the cross-entropy loss (equivalently, maximises log-likelihood):

\[ \mathcal{L} = -\frac{1}{T}\sum_{t=1}^{T} \log P_\theta(x_t \mid x_{<t}) \]

Perplexity = \(\exp(\mathcal{L})\) measures how โ€œsurprisedโ€ the model is on average. A perplexity of 20 means the model is roughly as uncertain as choosing uniformly among 20 tokens.

2. GPT Architecture (Decoder-Only Transformer)

GPT uses a decoder-only Transformer: the full Transformer decoder of Chapter 17, but with the cross-attention layer removed. A causal mask ensures each position only attends to previous tokens, maintaining the autoregressive property.

GPT (Decoder-only)

  • Causal (left-to-right) attention mask
  • Trained on next-token prediction
  • Natural for generation tasks
  • Examples: GPT-2, GPT-3, GPT-4, LLaMA, Gemini

BERT (Encoder-only)

  • Bidirectional: attends to all positions
  • Trained on masked language modelling (MLM)
  • Strong for understanding/classification
  • Examples: BERT, RoBERTa, DeBERTa

BERT's Masked Language Modelling: randomly mask 15% of tokens, predict the masked tokens using bidirectional context. This is a denoising objective: \(P(x_{\rm masked} \mid x_{\rm unmasked})\). BERT cannot generate text naturally (it sees future tokens); GPT cannot use future context for understanding.

3. Scaling Laws

Kaplan et al. (2020) and Hoffmann et al. (2022, โ€œChinchillaโ€) showed that language model loss follows a power law in the number of parameters \(N\), training tokens \(D\), and compute \(C\):

\[ L(N, D) \approx \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} + L_\infty \]

Chinchilla finding: for a fixed compute budget, it is better to train a smaller model on more data than a larger model on fewer tokens. Optimal allocation: \(D \approx 20 \times N\) tokens.

3.1 Emergent Abilities

Some capabilities (e.g., multi-step arithmetic, chain-of-thought reasoning, in-context learning) appear suddenly above a threshold model size โ€” they are absent in small models and present in large ones without explicit training. This is called emergence and is an active area of research.

4. Training Pipeline: Pre-training, Fine-tuning, RLHF

  1. Pre-training: Train on massive text corpus \(\mathcal{D}_{\rm pre}\) by minimising cross-entropy. Produces a model \(\pi_{\rm PT}\) that can complete text.
  2. Supervised Fine-tuning (SFT): Fine-tune on high-quality demonstrations \(\{(prompt_i, response_i)\}\) of desired behaviour.
  3. Reward Model Training: Collect human preference data: pairs of responses ranked by quality. Train a reward model \(r_\phi(x, y)\) using the Bradley-Terry preference model:
    \[ P(y_1 \succ y_2) = \sigma(r_\phi(x, y_1) - r_\phi(x, y_2)) \]
  4. RLHF with PPO: Optimise the language model as a policy to maximise the reward model score, subject to a KL penalty from the SFT policy:
    \[ \mathcal{L}_{\rm RLHF}(\theta) = \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot|x)}\!\left[r_\phi(x,y) - \beta\,\mathrm{KL}(\pi_\theta(\cdot|x) \| \pi_{\rm SFT}(\cdot|x))\right] \]
    The KL term prevents the model from exploiting the reward model with nonsensical outputs that score highly.

5. Tokenisation: Byte-Pair Encoding (BPE)

LLMs operate on tokens, not characters. BPE builds a vocabulary by iteratively merging the most frequent pair of adjacent tokens:

  1. Start with character vocabulary.
  2. Count all adjacent pairs in the corpus.
  3. Merge the most frequent pair into a new token.
  4. Repeat until vocabulary reaches target size (e.g., 50,257 for GPT-2).

BPE balances vocabulary size and sequence length. Common words become single tokens; rare words decompose into sub-word tokens, maintaining coverage without an infinite vocabulary.

6. In-Context Learning & Prompting

A surprising property of large LMs: they can learn new tasks from just a few examples in the prompt, without any gradient updates. Given \(k\) examples in context, the model predicts the next output:

\[ P(y_{\rm test} \mid x_{\rm test},\, (x_1,y_1),\ldots,(x_k,y_k)) \]

Chain-of-Thought (CoT)

Prompting with intermediate reasoning steps dramatically improves performance on multi-step problems. The model generates a scratchpad before the final answer.

Zero-shot Prompting

Large enough models can follow instructions without any examples, simply from the instruction text. Instruction-tuning amplifies this ability.

7. GPT Architecture Diagram

Token Embeddings + PETransformer Decoder Block (ร—L)Add & LayerNormPosition-wise FFNAdd & LayerNormMasked Multi-Head Self-AttentionAdd & LayerNormPre-norm (optional)residualLinear + SoftmaxP(next token | context)Causalmask(upper tri)

8. Python: Tiny Character-Level Transformer

We implement a 2-layer character-level Transformer from scratch in NumPy, train it on a small text, and visualise: training loss, attention patterns, perplexity, scaling law curves, character frequency (BPE motivation), and temperature sampling effects.

Python
script.py286 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server