Machine Learning/Part VI: Sequence Models/Chapter 18

Chapter 18: Large Language Models

Large language models are Transformer-based autoregressive models trained on massive text corpora. They exhibit surprising capabilities that emerge at scale — abilities not present in small models and not explicitly programmed. This chapter covers the mathematical foundations, training pipeline, and evaluation of LLMs.

1. Autoregressive Language Modelling

A language model defines a probability distribution over sequences. Using the chain rule of probability:

\[ P(x_1, x_2, \ldots, x_T) = \prod_{t=1}^{T} P(x_t \mid x_1, \ldots, x_{t-1}) = \prod_{t=1}^{T} P(x_t \mid x_{<t}) \]

Training minimises the cross-entropy loss (equivalently, maximises log-likelihood):

\[ \mathcal{L} = -\frac{1}{T}\sum_{t=1}^{T} \log P_\theta(x_t \mid x_{<t}) \]

Perplexity = \(\exp(\mathcal{L})\) measures how “surprised” the model is on average. A perplexity of 20 means the model is roughly as uncertain as choosing uniformly among 20 tokens.

2. GPT Architecture (Decoder-Only Transformer)

GPT uses a decoder-only Transformer: the full Transformer decoder of Chapter 17, but with the cross-attention layer removed. A causal mask ensures each position only attends to previous tokens, maintaining the autoregressive property.

GPT (Decoder-only)

Causal (left-to-right) attention mask
Trained on next-token prediction
Natural for generation tasks
Examples: GPT-2, GPT-3, GPT-4, LLaMA, Gemini

BERT (Encoder-only)

Bidirectional: attends to all positions
Trained on masked language modelling (MLM)
Strong for understanding/classification
Examples: BERT, RoBERTa, DeBERTa

BERT's Masked Language Modelling: randomly mask 15% of tokens, predict the masked tokens using bidirectional context. This is a denoising objective: \(P(x_{\rm masked} \mid x_{\rm unmasked})\). BERT cannot generate text naturally (it sees future tokens); GPT cannot use future context for understanding.

3. Scaling Laws

Kaplan et al. (2020) and Hoffmann et al. (2022, “Chinchilla”) showed that language model loss follows a power law in the number of parameters \(N\), training tokens \(D\), and compute \(C\):

\[ L(N, D) \approx \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} + L_\infty \]

Chinchilla finding: for a fixed compute budget, it is better to train a smaller model on more data than a larger model on fewer tokens. Optimal allocation: \(D \approx 20 \times N\) tokens.

3.1 Emergent Abilities

Some capabilities (e.g., multi-step arithmetic, chain-of-thought reasoning, in-context learning) appear suddenly above a threshold model size — they are absent in small models and present in large ones without explicit training. This is called emergence and is an active area of research.

4. Training Pipeline: Pre-training, Fine-tuning, RLHF

Pre-training: Train on massive text corpus \(\mathcal{D}_{\rm pre}\) by minimising cross-entropy. Produces a model \(\pi_{\rm PT}\) that can complete text.
Supervised Fine-tuning (SFT): Fine-tune on high-quality demonstrations \(\{(prompt_i, response_i)\}\) of desired behaviour.
Reward Model Training: Collect human preference data: pairs of responses ranked by quality. Train a reward model \(r_\phi(x, y)\) using the Bradley-Terry preference model:
\[ P(y_1 \succ y_2) = \sigma(r_\phi(x, y_1) - r_\phi(x, y_2)) \]
RLHF with PPO: Optimise the language model as a policy to maximise the reward model score, subject to a KL penalty from the SFT policy:
\[ \mathcal{L}_{\rm RLHF}(\theta) = \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot|x)}\!\left[r_\phi(x,y) - \beta\,\mathrm{KL}(\pi_\theta(\cdot|x) \| \pi_{\rm SFT}(\cdot|x))\right] \]
The KL term prevents the model from exploiting the reward model with nonsensical outputs that score highly.

5. Tokenisation: Byte-Pair Encoding (BPE)

LLMs operate on tokens, not characters. BPE builds a vocabulary by iteratively merging the most frequent pair of adjacent tokens:

Start with character vocabulary.
Count all adjacent pairs in the corpus.
Merge the most frequent pair into a new token.
Repeat until vocabulary reaches target size (e.g., 50,257 for GPT-2).

BPE balances vocabulary size and sequence length. Common words become single tokens; rare words decompose into sub-word tokens, maintaining coverage without an infinite vocabulary.

6. In-Context Learning & Prompting

A surprising property of large LMs: they can learn new tasks from just a few examples in the prompt, without any gradient updates. Given \(k\) examples in context, the model predicts the next output:

\[ P(y_{\rm test} \mid x_{\rm test},\, (x_1,y_1),\ldots,(x_k,y_k)) \]

Chain-of-Thought (CoT)

Prompting with intermediate reasoning steps dramatically improves performance on multi-step problems. The model generates a scratchpad before the final answer.

Zero-shot Prompting

Large enough models can follow instructions without any examples, simply from the instruction text. Instruction-tuning amplifies this ability.

7. GPT Architecture Diagram

8. Python: Tiny Character-Level Transformer

We implement a 2-layer character-level Transformer from scratch in NumPy, train it on a small text, and visualise: training loss, attention patterns, perplexity, scaling law curves, character frequency (BPE motivation), and temperature sampling effects.

Python

script.py286 lines

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

np.random.seed(42)

# ============================================================
# Tiny character-level Transformer language model
# Train on: "hello world transformer attention language model"
# ============================================================

text = ("hello world transformer attention language model " * 8).strip()
chars = sorted(set(text))
vocab_size = len(chars)
ctoi = {c: i for i, c in enumerate(chars)}
itoc = {i: c for i, c in enumerate(chars)}
data = np.array([ctoi[c] for c in text], dtype=np.int32)
print(f"Corpus length: {len(data)}, Vocab size: {vocab_size}")
print(f"Vocabulary: {chars}")

def softmax(x, axis=-1):
    x = x - np.max(x, axis=axis, keepdims=True)
    e = np.exp(x)
    return e / np.sum(e, axis=axis, keepdims=True)

def layer_norm(x, eps=1e-5):
    mu = np.mean(x, axis=-1, keepdims=True)
    sigma = np.std(x, axis=-1, keepdims=True)
    return (x - mu) / (sigma + eps)

class TinyTransformer:
    def __init__(self, vocab_size, d_model=32, n_heads=2, seq_len=16, lr=0.003):
        self.vocab_size = vocab_size
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads
        self.seq_len = seq_len
        self.lr = lr
        scale = 0.1
        # Embeddings
        self.E = np.random.randn(vocab_size, d_model) * scale
        # Positional encoding (learnable)
        self.PE = np.random.randn(seq_len, d_model) * scale * 0.1
        # Attention projections (per head, stacked)
        self.WQ = np.random.randn(d_model, d_model) * scale
        self.WK = np.random.randn(d_model, d_model) * scale
        self.WV = np.random.randn(d_model, d_model) * scale
        self.WO = np.random.randn(d_model, d_model) * scale
        # FFN
        self.W1 = np.random.randn(d_model, d_model * 2) * scale
        self.b1 = np.zeros(d_model * 2)
        self.W2 = np.random.randn(d_model * 2, d_model) * scale
        self.b2 = np.zeros(d_model)
        # Output head
        self.Wout = np.random.randn(d_model, vocab_size) * scale
        self.bout = np.zeros(vocab_size)

def causal_mask(self, T):
        return np.triu(np.ones((T, T)), k=1)

def attention_block(self, X):
        T, d = X.shape
        Q = X @ self.WQ
        K = X @ self.WK
        V = X @ self.WV
        scores = Q @ K.T / np.sqrt(self.d_k)
        mask = self.causal_mask(T)
        scores = scores - 1e9 * mask
        A = softmax(scores, axis=-1)
        out = A @ V @ self.WO
        return out, A

def ffn(self, X):
        h = np.maximum(0, X @ self.W1 + self.b1)
        return h @ self.W2 + self.b2

def forward(self, idx):
        T = len(idx)
        X = self.E[idx] + self.PE[:T]
        X = layer_norm(X)
        # Attention + residual
        attn_out, A = self.attention_block(X)
        X = layer_norm(X + attn_out)
        # FFN + residual
        ffn_out = self.ffn(X)
        X = layer_norm(X + ffn_out)
        # Output logits
        logits = X @ self.Wout + self.bout
        return logits, A

def loss_and_grads(self, idx):
        T = len(idx)
        x_idx = idx[:-1]
        y_idx = idx[1:]
        logits, A = self.forward(x_idx)
        probs = softmax(logits, axis=-1)
        # Cross-entropy loss
        loss = -np.mean(np.log(probs[np.arange(T-1), y_idx] + 1e-9))
        # Simple numerical gradient for output head only (illustration)
        dlogits = probs.copy()
        dlogits[np.arange(T-1), y_idx] -= 1
        dlogits /= (T - 1)
        return loss, dlogits, logits

def train_step(self, idx):
        loss, dlogits, logits = self.loss_and_grads(idx)
        # Compute forward features for gradient
        T = len(idx)
        x_idx = idx[:-1]
        X = self.E[x_idx] + self.PE[:T-1]
        X = layer_norm(X)
        attn_out, A = self.attention_block(X)
        X2 = layer_norm(X + attn_out)
        ffn_out = self.ffn(X2)
        X3 = layer_norm(X2 + ffn_out)
        # Gradient on Wout
        dWout = X3.T @ dlogits
        dbout = dlogits.sum(axis=0)
        dX3 = dlogits @ self.Wout.T
        # Gradient on embeddings (simple approximation via embedding grad)
        dE = np.zeros_like(self.E)
        for t_i, x_i in enumerate(x_idx):
            dE[x_i] += dX3[t_i]
        # Update
        self.Wout -= self.lr * np.clip(dWout, -1, 1)
        self.bout -= self.lr * np.clip(dbout, -1, 1)
        self.E   -= self.lr * np.clip(dE, -1, 1)
        return loss

def generate(self, start_idx, max_new=30, temperature=0.8):
        context = list(start_idx)
        generated = []
        for _ in range(max_new):
            ctx = np.array(context[-self.seq_len:], dtype=np.int32)
            logits, _ = self.forward(ctx)
            last_logits = logits[-1] / temperature
            probs = softmax(last_logits)
            next_idx = np.random.choice(self.vocab_size, p=probs)
            context.append(next_idx)
            generated.append(next_idx)
        return generated

model = TinyTransformer(vocab_size, d_model=32, n_heads=2, seq_len=16, lr=0.003)

# Training loop
seq_len = 16
n_epochs = 600
losses = []
for epoch in range(n_epochs):
    start = np.random.randint(0, len(data) - seq_len - 1)
    chunk = data[start:start + seq_len + 1]
    loss = model.train_step(chunk)
    losses.append(loss)

# Generate sample
seed_text = "hello"
seed_idx = np.array([ctoi[c] for c in seed_text], dtype=np.int32)
gen_idx = model.generate(seed_idx, max_new=40, temperature=0.7)
gen_text = seed_text + "".join([itoc[i] for i in gen_idx])
print(f"Generated: {gen_text}")
print(f"Final loss: {losses[-1]:.4f}, Initial loss: {losses[0]:.4f}")
print(f"Perplexity: {np.exp(losses[-1]):.2f}")

# ============================================================
# Plots
# ============================================================
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
fig.patch.set_facecolor('#0a0a0f')
fig.suptitle('Tiny Transformer Language Model: Training and Analysis', fontsize=14, color='white', fontweight='bold')

# Panel 1: Training loss
ax = axes[0, 0]
ax.set_facecolor('#0d0d1a')
window = 20
smooth = np.convolve(losses, np.ones(window)/window, mode='valid')
ax.plot(losses, color='#7c3aed', linewidth=1, alpha=0.4, label='Raw loss')
ax.plot(range(window-1, len(losses)), smooth, color='#a78bfa', linewidth=2.5, label=f'Smoothed (w={window})')
ax.set_title('Training Loss (Character-level LM)', color='white', fontsize=11, fontweight='bold')
ax.set_xlabel('Iteration', color='white')
ax.set_ylabel('Cross-entropy loss', color='white')
ax.tick_params(colors='white')
for sp in ax.spines.values(): sp.set_color('#7c3aed')
ax.grid(True, alpha=0.2, color='#7c3aed')
ax.legend(fontsize=9, facecolor='#0d0d1a', edgecolor='#7c3aed', labelcolor='white')

# Panel 2: Attention pattern on a sample
ax = axes[0, 1]
ax.set_facecolor('#0d0d1a')
sample_seq = np.array([ctoi[c] for c in "hello world tra"], dtype=np.int32)
_, attn_viz = model.forward(sample_seq)
im = ax.imshow(attn_viz, cmap='magma', vmin=0)
plt.colorbar(im, ax=ax)
ax.set_title('Attention Weights: "hello world tra"', color='white', fontsize=11, fontweight='bold')
ax.set_xlabel('Key position', color='white')
ax.set_ylabel('Query position', color='white')
seq_labels = list("hello world tra")
ax.set_xticks(range(len(seq_labels)))
ax.set_xticklabels(seq_labels, color='white', fontsize=8)
ax.set_yticks(range(len(seq_labels)))
ax.set_yticklabels(seq_labels, color='white', fontsize=8)
ax.tick_params(colors='white')
for sp in ax.spines.values(): sp.set_color('#7c3aed')

# Panel 3: Perplexity over training
ax = axes[0, 2]
ax.set_facecolor('#0d0d1a')
perplexities = np.exp(np.array(losses))
smooth_pp = np.convolve(perplexities, np.ones(window)/window, mode='valid')
ax.semilogy(perplexities, color='#7c3aed', linewidth=1, alpha=0.4)
ax.semilogy(range(window-1, len(perplexities)), smooth_pp, color='#34d399', linewidth=2.5)
ax.axhline(vocab_size, color='#fbbf24', linestyle='--', linewidth=1.5, label=f'Random baseline ({vocab_size})')
ax.set_title('Perplexity (lower = more confident)', color='white', fontsize=11, fontweight='bold')
ax.set_xlabel('Iteration', color='white')
ax.set_ylabel('Perplexity (log scale)', color='white')
ax.tick_params(colors='white')
for sp in ax.spines.values(): sp.set_color('#7c3aed')
ax.grid(True, alpha=0.2, color='#7c3aed')
ax.legend(fontsize=9, facecolor='#0d0d1a', edgecolor='#7c3aed', labelcolor='white')

# Panel 4: Scaling laws illustration
ax = axes[1, 0]
ax.set_facecolor('#0d0d1a')
N_params = np.logspace(6, 11, 100)
alpha = 0.076
beta_scale = 406.4
C_base = 1e18
# Chinchilla-style: L(N) ~ (N_c/N)^alpha
N_c = 8.8e13
loss_N = (N_c / N_params)**alpha
loss_N_2 = 1.69 + (406.4 / N_params)**0.34
ax.loglog(N_params, loss_N, color='#a78bfa', linewidth=2.5, label=r'L ~ (N_c/N)^0.076 (Hoffmann)')
ax.loglog(N_params, loss_N_2, '--', color='#34d399', linewidth=2, label=r'L ~ C + (N_0/N)^0.34 (Kaplan)')
ax.set_title('Neural Scaling Laws: Loss vs Parameters', color='white', fontsize=11, fontweight='bold')
ax.set_xlabel('Model parameters N', color='white')
ax.set_ylabel('Cross-entropy loss', color='white')
ax.tick_params(colors='white')
for sp in ax.spines.values(): sp.set_color('#7c3aed')
ax.grid(True, alpha=0.2, color='#7c3aed', which='both')
ax.legend(fontsize=9, facecolor='#0d0d1a', edgecolor='#7c3aed', labelcolor='white')

# Panel 5: BPE token frequency illustration
ax = axes[1, 1]
ax.set_facecolor('#0d0d1a')
char_counts = {}
for c in text:
    char_counts[c] = char_counts.get(c, 0) + 1
chars_sorted = sorted(char_counts.items(), key=lambda x: -x[1])
labels_c = [c[0] if c[0] != ' ' else 'SPC' for c, _ in chars_sorted]
counts_c = [c for _, c in chars_sorted]
colors_b = ['#a78bfa' if i < 5 else '#7c3aed' for i in range(len(labels_c))]
ax.bar(range(len(labels_c)), counts_c, color=colors_b, edgecolor='#4c1d95')
ax.set_xticks(range(len(labels_c)))
ax.set_xticklabels(labels_c, color='white', fontsize=9, rotation=45)
ax.set_title('Character Frequency (BPE starts here)', color='white', fontsize=11, fontweight='bold')
ax.set_xlabel('Character', color='white')
ax.set_ylabel('Count', color='white')
ax.tick_params(colors='white')
for sp in ax.spines.values(): sp.set_color('#7c3aed')
ax.grid(True, alpha=0.2, color='#7c3aed', axis='y')

# Panel 6: Temperature sampling
ax = axes[1, 2]
ax.set_facecolor('#0d0d1a')
logits_ex = np.array([3.0, 1.5, 0.5, 0.2, 0.1, 0.0, -0.5, -1.0])
temps = [0.3, 0.7, 1.0, 2.0]
x_pos = np.arange(len(logits_ex))
for temp in temps:
    scaled = logits_ex / temp
    probs_t = np.exp(scaled - np.max(scaled))
    probs_t /= probs_t.sum()
    ax.plot(x_pos, probs_t, 'o-', linewidth=2, markersize=6,
            label=f'T={temp}')
ax.set_title('Temperature Sampling: Effect on Probabilities', color='white', fontsize=11, fontweight='bold')
ax.set_xlabel('Token index', color='white')
ax.set_ylabel('Probability', color='white')
ax.tick_params(colors='white')
for sp in ax.spines.values(): sp.set_color('#7c3aed')
ax.grid(True, alpha=0.2, color='#7c3aed')
ax.legend(fontsize=9, facecolor='#0d0d1a', edgecolor='#7c3aed', labelcolor='white',
          title='Temperature', title_fontsize=8)

plt.tight_layout()
plt.savefig('output.png', dpi=130, bbox_inches='tight', facecolor='#0a0a0f')
print("Saved output.png")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Share:X Reddit LinkedIn

← Ch 17: Attention & Transformers Ch 19: Reinforcement Learning →