Module 6: Phylogeography & Molecular Clocks

Phylogeography is the joint study of geography and genealogy within and among closely related species. Born from the marriage of population genetics and biogeography when Avise et al. introduced the term in 1987, the discipline has developed into a quantitative genomic science underpinned by coalescent theory and calibrated molecular clocks. This module derives the Kingman coalescent, examines Pleistocene glacial-refugia signatures, and demonstrates demographic- history inference from single-genome data.

1. The Birth of Phylogeography (Avise 1987)

John C. Avise and colleagues coined the term phylogeography in a landmark review (Avise, Arnold, Ball et al. 1987, Annual Review of Ecology and Systematics 18, 489–522) that synthesised a decade of mitochondrial-DNA work on vertebrate populations into an explicit discipline. The central insight was that intraspecific genealogies, when mapped back onto geography, reveal the historical demographic and biogeographic processes that shaped a species — glacial refugia, range expansions, vicariant events, secondary contact — at a resolution far below that of traditional species- level biogeography.

Avise’s “phylogeographic concordance” principle argues that when multiple unrelated taxa share the same phylogeographic break, the break reflects a shared biogeographic history rather than idiosyncratic species biology. Classic examples include the southeastern-versus-southwestern US Pleistocene discontinuity documented across > 30 vertebrate taxa (Soltis et al. 2006 Molecular Ecology), and the east–west Indo-Pacific break at the Sunda-Sahul shelf documented in reef fishes, corals and crustaceans (Avise 2009 Journal of Biogeography).

The initial molecular toolkit was narrow: animal mitochondrial DNA (mtDNA), especially the rapidly evolving control region (d-loop) for recent events and cytochrome oxidase I (COI) or cytochrome b for older events; chloroplast DNA (cpDNA) in plants; and allozymes. Over the past two decades this has broadened dramatically: nuclear single-copy loci like RAG1 or beta-fibrinogen intron; multi-locus SNP panels; reduced-representation RAD-seq (Miller et al. 2007, Andrews et al. 2016 Nat. Rev. Genet.); and whole-genome resequencing from both living and ancient specimens. The field’s statistical core, however, remained the same: coalescent theory.

2. The Kingman Coalescent (1982)

Sir John Kingman (1982, Journal of Applied Probability 19A, 27–43 and related papers) constructed the mathematical foundation of modern phylogeography. The coalescent is the backwards-in-time process that traces the ancestry of a sample of size $n$ drawn from a Wright–Fisher population of effective size $N_e$. Kingman proved that under weak selection and in the limit of large $N_e$, the lineages coalesce pairwise, and the waiting time $T_k$ for the $k$th coalescence (from $k$ to $k-1$ lineages) is exponentially distributed.

\[T_k \sim \mathrm{Exp}\!\left(\binom{k}{2} \frac{1}{2N_e}\right), \quad E[T_k] = \frac{4N_e}{k(k-1)}\]

Time is measured in generations; the factor of 2 is for diploidy. For haploid organelles (mtDNA, cpDNA) replace $2N_e$ with $N_e$.

Summing over all coalescences gives the expected time to most recent common ancestor:

\[E[T_\mathrm{MRCA}] = \sum_{k=2}^{n} E[T_k] = 4N_e \left(1 - \frac{1}{n}\right)\]

The TMRCA approaches $4N_e$ as $n \to \infty$. Expected total branch length $T_\mathrm{tot}=4N_e H_{n-1}$ where $H_{n-1}=\sum_{i=1}^{n-1} 1/i$.

Under the infinite-sites mutation model, the number of segregating sites $S$ in a sample is Poisson with mean $\theta\,H_{n-1}$ where $\theta = 4N_e\mu L$ is Watterson’s theta. This directly links the observable polymorphism (SNPs, haplotype diversity) to the underlying demographic parameter $N_e$, giving population geneticists a calibrated window onto population history.

Coalescent simulators like ms (Hudson 2002), msprime(Kelleher et al. 2016 PLoS Comp. Biol.), and SLiM (Haller & Messer 2019) implement the coalescent with arbitrarily complex demographic histories, structured populations, selection, and recombination. They are the workhorse of modern statistical phylogeography.

3. Molecular Clocks and Calibration

Zuckerkandl and Pauling (1965 Journal of Theoretical Biology) proposed the molecular-clock hypothesis: that if mutations accumulate roughly linearly with time, genetic divergence between species is a calibrated clock. The hypothesis was controversial for decades (neutral vs selectionist debates of the 1970s–80s) but is now accepted as approximately true for putatively neutral markers over sufficiently long time scales, with the important caveat that clock rates vary both among lineages and over time.

Animal mtDNA became the phylogeographic workhorse because of its surprisingly regular clock. Brown, George and Wilson (1979 PNAS) estimated ~2% sequence divergence per million years for mammalian mtDNA — the famous “2% rule.” Subsequent work has shown this applies best to protein-coding sites in mammals and birds; other taxa and loci have very different rates (insect COI ~1.15–3.5% per Myr, plant cpDNA ~10× slower than animal mtDNA).

\[d = 2 \mu t, \quad t = \frac{d}{2\mu}\]

Strict clock: pairwise divergence $d$ between two lineages separated for time $t$ with per-site substitution rate $\mu$. For mammalian mtDNA protein genes $2\mu \approx 0.02$ per Myr.

Clock calibration comes from three sources: (1) fossil-calibrated node ages on phylogenies (e.g. crown Primates at 62–68 Ma), (2) biogeographically-dated vicariant events (Panamanian closure at ~3 Ma for marine geminate species; Messinian salinity crisis at 5.96–5.33 Ma), and (3) direct per-generation mutation-rate estimates from parent-offspring trios (Kong et al. 2012 Nature 488, 471–475, giving human de novo mutation rate ~1.2 × 10$^{-8}$ per bp per generation).

Strict clocks frequently fail. Relaxed-clock methods allow rate variation across branches. Drummond & Rambaut (2007 BMC Evolutionary Biology) implemented this in BEAST using either an uncorrelated log-normal relaxed clock (branch rates drawn from a log-normal distribution) or an autocorrelated relaxed clock (rate on a branch correlated with the rate on its parent). BEAST2 (Bouckaert et al. 2014, 2019) and RevBayes (Höhna et al. 2016) now form the standard Bayesian toolkit. Fossilised birth-death models (Stadler 2010) integrate fossils as genealogically-tip-dated sequences, greatly reducing calibration bias.

4. Pleistocene Refugia: Hewitt 2000

Godfrey Hewitt’s synthesis (Hewitt 2000, Nature 405, 907–913; and 2004 Phil. Trans. Roy. Soc. B) established the Pleistocene refugial paradigm as a central organising principle for temperate phylogeography. The alternating glacial–interglacial cycles of the last 2.6 Myr, with a dominant 100 kyr cycle over the past ~800 kyr, repeatedly pushed mid-latitude biotas into southern refugia during glacial maxima and released them northward during interglacials.

For Europe, Hewitt identified three principal glacial refugia — Iberia, Italy, and the Balkans — each separated from the others by the Alpine–Pyrenean–Dinaric mountain barrier. After the Last Glacial Maximum (LGM, ~21 ka) each refugium seeded a re-colonisation of the European continent, producing three recurrent phylogeographic patterns:

Grasshopper pattern ($Chorthippus\,parallelus$): only the Balkan refugium recolonised northern Europe; Iberian and Italian lineages remained trapped south of the Pyrenees and Alps.
Bear pattern ($Ursus\,arctos$): Iberian and Balkan lineages met in central Europe, producing an east–west suture zone along the Rhine.
Hedgehog pattern ($Erinaceus$): all three refugia contributed, producing a tri-partite suture in central Europe.

North America shows parallel but distinct refugia: the Beringian refugium in far-northwestern Alaska/Yukon (source of circumpolar boreal taxa), the Great Basin / southwestern refugia, Florida and the Gulf Coast, and isolated nunatak and fjord refugia along the Pacific coast (Shafer et al. 2010 Molecular Ecology). The “suture zones” where diverging lineages met during the Holocene correspond to Remington’s (1968) Pleistocene hybrid zones — the Appalachians, the Rockies–Great Plains boundary, and the Pacific Northwest.

Haplotype-diversity gradients are the phylogeographic fingerprint of post-glacial expansion: southern refugia retain high diversity, while northern leading edges show “allele-surfing” founder effects and low diversity (Edmonds et al. 2004 PNAS). The South-to-North diversity gradient is now a textbook diagnostic pattern for recent range expansion.

5. Hybrid Zones & Suture Zones (Barton & Hewitt 1985)

Barton and Hewitt (1985, Annual Review of Ecology and Systematics 16, 113–148) laid the theoretical foundation for hybrid-zone analysis with the tension-zone model: narrow clines maintained by a balance between dispersal pulling alleles across the zone and selection against hybrids purifying the zone. At steady state the cline width

\[w = \frac{\sigma}{\sqrt{s}}\]

where $\sigma$ is the per-generation dispersal standard deviation and $s$ is the selection coefficient against hybrids. Observed widths of 1–10 km for $s \sim 0.01$and $\sigma \sim 100$ m/gen are typical.

Hybrid zones are natural laboratories for speciation. Classic systems include the European house-mouse (Mus musculus) hybrid zone at the domesticus/musculus boundary in central Europe; the European $Bombina$ toad hybrid zone; the $Anopheles\,gambiae$/$A.\,coluzzii$ contact in west Africa; and the recently documented songbird hybrid zones across North America (Toews et al. 2016 Current Biology).

Suture zones are geographic regions where many independent hybrid zones coincide, indicating a shared historical contact event. Remington (1968) catalogued five major North American suture zones; subsequent genomic work has validated most as Pleistocene re-contact zones. The Tibetan plateau and Hengduan Mountains are a major suture zone for Asian birds and mammals (Qu et al. 2014 Molecular Ecology).

6. Comparative Phylogeography

The phylogeographic concordance principle (Avise 1992, 2000) predicts that different taxa, having experienced the same historical biogeographic event, should show concordant phylogenetic breaks at congruent ages. When observed, concordance is strong evidence for a shared biogeographic history; absence argues for idiosyncratic species-specific processes.

Classic successes include:

Indo-Pacific Sunda–Sahul break: convergent east–west mtDNA splits in reef fishes, corals, gastropods and echinoderms with coalescent-dated splits clustering around the LGM Sundaland lowstand (Crandall et al. 2012 Nature Comm.).
Central American Isthmus: geminate species pairs across the 3.1–3.5 Ma Panamanian closure in > 40 marine taxa (Knowlton & Weigt 1998 Proc. Roy. Soc. B), used as the canonical vicariant calibration for marine molecular clocks.
Australian Monsoonal Tropics: convergent refugial breaks across frogs, snakes, geckos and birds of the Kimberley–Arnhem land barrier (Moritz et al. 2009 TREE).
Neotropical Andean vicariance: shared east–west Andean breaks across rodents, frogs, birds and insects, dated variously to the late Miocene final uplift (Hoorn et al. 2010 Science).

Modern statistical comparative phylogeography tests concordance formally. Hickerson et al. (2006) developed msBayes for hierarchical ABC estimation of shared divergence times. The newer PhyloGeoCoal(Gehara 2017) and ABLE (Beeravolu 2018) frameworks provide genome-scale inference of co-divergence parameters.

7. F-Statistics & Population Structure

Sewall Wright’s F-statistics (1951) quantify the partitioning of genetic variation across hierarchical population structure. At the most useful level:

\[F_{ST} = \frac{H_T - H_S}{H_T} = \frac{\mathrm{Var}(p)}{\bar p (1 - \bar p)}\]

$H_T$ is expected heterozygosity in the total population; $H_S$ the average within sub-populations. $F_{ST} \approx 1/(1 + 4N_em)$ under the infinite-island model with $N_e m$ migrants per generation.

$F_{ST}$ of ~0.05 corresponds to ~5 migrants per generation, sufficient to homogenise allele frequencies; values > 0.15 indicate substantial structure. Weir & Cockerham (1984 Evolution) developed the standard estimator. Isolation-by-distance (Wright 1943; Slatkin 1993 Evolution) adds a spatial dimension: $F_{ST}/(1-F_{ST}) \propto \log(\mathrm{geographic\,distance})$ in 2-D continua.

Model-based assignment methods — STRUCTURE (Pritchard, Stephens & Donnelly 2000), ADMIXTURE (Alexander et al. 2009), and sparse non-negative matrix factorisation in sNMF — infer individual admixture proportions directly from SNP genotypes without a priori population definitions. They have revolutionised fine-scale phylogeography by revealing cryptic population structure and historical admixture (Patterson, Price & Reich 2006).

8. PSMC & Genome-Scale Demographic Inference

Heng Li and Richard Durbin (2011, Nature 475, 493–496) introduced the Pairwise Sequentially Markovian Coalescent (PSMC), a transformative method that infers the history of effective population size from a single diploid genome. The two chromosome copies in an individual provide an implicit sample of size two; the distribution of TMRCAs along the genome, estimated from the density of heterozygous sites, encodes $N_e(t)$ over many orders of magnitude of time.

\[\mathrm{rate\;of\;coalescence\;at\;time\;}t = \frac{1}{2N_e(t)}\]

The local TMRCA at each genomic window is approximately $T \propto 2 N_e \cdot (\mathrm{heterozygosity}) / (2\mu)$.

PSMC and its successors — MSMC/MSMC2 (Schiffels & Durbin 2014 Nature Genetics) for multiple samples, SMC++ (Terhorst, Kamm & Song 2017 Nature Genetics) for thousands of genomes, and the deep-learning-based dnadna (Sanchez 2021) — form the modern toolkit. PSMC traces population history from roughly 20 ka to 5–10 Ma and has been applied to every vertebrate with a sequenced genome, from giant pandas (Zhao et al. 2013) to polar bears (Miller et al. 2012) to the woolly mammoth (Palkopoulou et al. 2015).

The most famous PSMC results include: (1) the human out-of-Africa bottleneck at ~60–70 ka with $N_e$ dropping to ~1500 in non-African populations; (2) the broader Eurasian/Asian expansion after ~40 ka; (3) the independent population histories of great apes showing chimpanzees with consistently larger $N_e$ than humans, and gorillas with a major bottleneck at ~1–2 Ma; and (4) the deep archaic-human structure revealed by comparison with Neanderthal and Denisovan genomes.

9. Ancient DNA & Temporal Phylogeography

Direct sequencing of DNA from museum specimens, archaeological bones, and permafrost remains transformed phylogeography by providing actual temporal samples instead of inferred genealogies. The breakthrough paper by Pääbo (1985 Nature) demonstrated PCR amplification from Egyptian mummies; the 1990s saw recovery of mtDNA from Pleistocene permafrost-preserved mammoths and ground sloths. The ancient-DNA revolution accelerated dramatically with Illumina short-read sequencing after 2006 and the development of specialised damage-aware library-preparation protocols (Rohland et al. 2015 Phil. Trans.).

Temporal calibration of substitution rates has been the single most important methodological advance. Tip-dated phylogenetic inference using serially sampled ancient sequences (Rambaut 2000 Bioinformatics) reveals that within-species rates are typically several times faster than the long-term between-species rate — the “time-dependent rate phenomenon” (Ho, Phillips, Cooper & Drummond 2005 Molecular Biology and Evolution). This has led to major revisions of both inferred demographic histories and biogeographic ages.

Landmark ancient-DNA phylogeography projects include the woolly mammoth demographic history (Palkopoulou et al. 2015 Current Biology), the cave-bear genome (Bar-Gal 2013), horse domestication (Fages 2019 Cell), dog origins (Frantz et al. 2016 Science), and the peopling of the Americas (Raghavan et al. 2015 Science, Moreno-Mayar 2018 Science). In every case the ancient DNA overturned or refined a phylogeographic hypothesis previously built on modern samples alone.

10. Phylogeography in the Whole-Genome Era

The “next-gen” revolution (Illumina 2008; PacBio, Nanopore long-read 2015+) reduced per-genome sequencing costs from $1 billion (Human Genome Project) to under $1000 in 2015 and under $200 today. Whole-genome phylogeography is now routine for vertebrates, and increasingly for invertebrates, plants, and microbes. Projects such as Earth BioGenome (Lewin 2018 PNAS), Bat1K, B10K birds, and the Vertebrate Genomes Project aim at reference genomes for all vertebrate species.

Whole-genome data enable inference of: (1) recombination-rate variation across the genome, which shapes local coalescence times; (2) introgression and hybridisation detectable via D-statistics (Green et al. 2010 Science) and f-statistics (Patterson 2012); (3) selective sweeps differentiating populations at specific loci (Voight et al. 2006 PLoS Biology); (4) linkage-disequilibrium decay as an estimator of local $N_e$ (McVean 2007); and (5) local ancestry tracts that reveal migration history at sub-chromosomal resolution (Hellenthal et al. 2014 Science).

11. Landscape Genetics

Landscape genetics (Manel et al. 2003 TREE) combines phylogeography with spatially explicit landscape ecology, treating gene flow as a flux through a cost surface rather than through a featureless plane. The key statistical tools are isolation-by-resistance (McRae 2006 Evolution), which models gene flow as electrical conductance through a heterogeneous resistance grid, and circuit-theoretic connectivity metrics operationalised in the Circuitscape software.

Landscape-genetic studies have demonstrated that roads, rivers, urban areas, mountain ridges, and even linear agricultural features impose population- genetic structure at surprisingly small scales. Cushman et al. (2006 American Naturalist) showed that black-bear gene flow in the US Northwest is better predicted by terrain ruggedness than by Euclidean distance. Coulon et al. (2004 Molecular Ecology) mapped roe-deer gene flow through agricultural hedgerow networks. These studies feed directly into conservation corridor design (see Module 8).

12. Statistical Phylogeography and the Bayesian Framework

Approximate Bayesian Computation (ABC; Beaumont, Zhang & Balding 2002 Genetics) provides the statistical backbone for modern comparative phylogeography. ABC dispenses with analytical likelihoods by simulating under a demographic model, computing summary statistics from the simulations, and retaining those close to the observed statistics. Posterior distributions of demographic parameters emerge directly from the retained simulations.

BEAST (Bayesian Evolutionary Analysis by Sampling Trees; Drummond & Rambaut 2007, 2012) implements full likelihood inference with relaxed clocks, substitution models, demographic priors, and discrete or continuous phylogeographic diffusion models (Lemey, Rambaut, Welch & Suchard 2010 Mol. Biol. Evol.). The Bayesian skyline/skyride plots (Drummond, Rambaut, Shapiro & Pybus 2005) reconstruct non-parametric $N_e(t)$ from serially-sampled or calibrated trees, and have become a standard figure in phylogeographic publications.

Machine-learning approaches are rapidly taking over. Deep-learning inference tools (Flagel, Brandvain & Schrider 2019 MBE; ReLERNN for recombination) take raw genome summaries as input and infer demographic parameters via convolutional neural nets. Simulation-based likelihood-free inference will likely replace most ABC workflows over the next decade.

13. Pleistocene Refugia Schematic

Southern European refugia & post-glacial expansion routes

Simulation 1: Kingman coalescent with n = 50 samples

Direct Monte Carlo simulation of the Kingman coalescent with $n=50$ haploid samples drawn from a Wright–Fisher population of $N_e = 10{,}000$. We plot one realised ultrametric genealogy and compare empirical distributions of TMRCA, total branch length, and segregating-site count to Kingman’s closed-form expectations ($E[T_\mathrm{MRCA}] = 4N_e(1-1/n)$, $E[T_\mathrm{tot}] = 4N_e H_{n-1}$, $E[S] = \mu L\,E[T_\mathrm{tot}]$).

Python

script.py171 lines

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

# Kingman coalescent simulation (1982)
# ------------------------------------
# Sir John Kingman's coalescent (1982, J. Appl. Probab. 19A, 27-43) gives
# the backward-in-time genealogy of a neutral sample of size n drawn from
# a Wright-Fisher population of effective size Ne. At each step, the n
# ancestral lineages coalesce pairwise at rate C(n,2) per 2Ne generations;
# the waiting time T_k (k lineages -> k-1 lineages) is exponential with
# rate k(k-1)/2 in units of 2Ne generations. Thus
#
#   E[T_k]   = 2Ne / C(k,2) = 4Ne / (k(k-1))
#   E[T_MRCA] = sum_{k=2..n} E[T_k] = 4Ne(1 - 1/n)
#   E[T_tot] = sum_{k=2..n} k * E[T_k] = 4Ne * sum_{k=1..n-1} 1/k
#
# We simulate many replicates with n=50 haploid samples, Ne=1e4, and
# overlay empirical distributions against analytic predictions. We also
# plot one realised ultrametric coalescent tree and compute segregating
# sites under the infinite-sites mutation model.

rng = np.random.default_rng(7)
n = 50
Ne = 1e4
mu = 1e-8
L = 1e4
n_rep = 500

def simulate_coalescent(n_samples, Ne, rng):
    active = list(range(n_samples))
    next_label = n_samples
    times = np.zeros(2 * n_samples - 1)
    parent = -np.ones(2 * n_samples - 1, dtype=int)
    t = 0.0
    k = n_samples
    events = []
    while k > 1:
        rate = k * (k - 1) / 2.0
        dt = rng.exponential(1.0 / rate) * 2.0 * Ne
        t += dt
        i, j = rng.choice(len(active), size=2, replace=False)
        a, b = active[i], active[j]
        parent[a] = next_label
        parent[b] = next_label
        times[next_label] = t
        events.append((dt, k))
        remaining = [x for idx, x in enumerate(active) if idx not in (i, j)]
        remaining.append(next_label)
        active = remaining
        next_label += 1
        k -= 1
    return times, parent, events

tmrca_list = []
ttot_list = []
seg_list  = []
for r in range(n_rep):
    times, parent, events = simulate_coalescent(n, Ne, rng)
    tmrca = times.max()
    ttot = sum(dt * k for dt, k in events)
    tmrca_list.append(tmrca)
    ttot_list.append(ttot)
    S = rng.poisson(mu * L * ttot)
    seg_list.append(S)

tmrca_arr = np.array(tmrca_list)
ttot_arr = np.array(ttot_list)
seg_arr = np.array(seg_list)

E_TMRCA_th = 4.0 * Ne * (1.0 - 1.0 / n)
E_Ttot_th = 4.0 * Ne * sum(1.0 / i for i in range(1, n))
E_S_th = mu * L * E_Ttot_th

# One realised tree for display
times, parent, events = simulate_coalescent(n, Ne, rng)
leaf_ids = list(range(n))
x_pos = np.zeros(2 * n - 1)
order = sorted(range(n, 2 * n - 1), key=lambda x: times[x])
children = {i: [] for i in range(2 * n - 1)}
for c, p in enumerate(parent):
    if p >= 0:
        children[p].append(c)
for idx, lf in enumerate(leaf_ids):
    x_pos[lf] = idx
for nd in order:
    if len(children[nd]) == 2:
        a, b = children[nd]
        x_pos[nd] = 0.5 * (x_pos[a] + x_pos[b])

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(14, 11))
fig.patch.set_facecolor('#0a0a1a')
for ax in (ax1, ax2, ax3, ax4):
    ax.set_facecolor('#0a0a1a')
    ax.tick_params(colors='#94a3b8')
    for sp in ax.spines.values():
        sp.set_color('#334155')
    ax.grid(True, color='#334155', alpha=0.3)

# Panel 1: realised coalescent tree
for nd in range(2 * n - 1):
    if parent[nd] >= 0:
        p = parent[nd]
        ax1.plot([x_pos[nd], x_pos[nd]], [times[nd], times[p]],
                 color='#4ade80', lw=0.8)
for nd in range(n, 2 * n - 1):
    if len(children[nd]) == 2:
        a, b = children[nd]
        ax1.plot([x_pos[a], x_pos[b]], [times[nd], times[nd]],
                 color='#4ade80', lw=0.8)
ax1.set_xlabel('Sample index', color='#94a3b8')
ax1.set_ylabel('Time before present (generations)', color='#94a3b8')
ax1.set_title(f'Kingman coalescent, n={n}, Ne={int(Ne)}',
              color='#e2e8f0', fontsize=13, fontweight='bold')
ax1.invert_yaxis()

# Panel 2: TMRCA distribution vs analytic
ax2.hist(tmrca_arr, bins=35, color='#22d3ee', edgecolor='#e2e8f0', alpha=0.8)
ax2.axvline(E_TMRCA_th, color='#f43f5e', lw=2,
            label=f'E[T_MRCA] = 4Ne(1-1/n) = {E_TMRCA_th:.0f}')
ax2.axvline(tmrca_arr.mean(), color='#facc15', ls='--', lw=2,
            label=f'simulated mean = {tmrca_arr.mean():.0f}')
ax2.set_xlabel('T_MRCA (generations)', color='#94a3b8')
ax2.set_ylabel('Replicates', color='#94a3b8')
ax2.set_title(f'TMRCA over {n_rep} replicates',
              color='#e2e8f0', fontsize=13, fontweight='bold')
ax2.legend(facecolor='#1e293b', edgecolor='#334155', labelcolor='#cbd5e1')

# Panel 3: total branch length vs analytic
ax3.hist(ttot_arr, bins=35, color='#4ade80', edgecolor='#e2e8f0', alpha=0.8)
ax3.axvline(E_Ttot_th, color='#f43f5e', lw=2,
            label=f'E[T_tot] = 4Ne * H(n-1) = {E_Ttot_th:.0f}')
ax3.axvline(ttot_arr.mean(), color='#facc15', ls='--', lw=2,
            label=f'simulated mean = {ttot_arr.mean():.0f}')
ax3.set_xlabel('T_total (generations)', color='#94a3b8')
ax3.set_ylabel('Replicates', color='#94a3b8')
ax3.set_title('Total branch length distribution',
              color='#e2e8f0', fontsize=13, fontweight='bold')
ax3.legend(facecolor='#1e293b', edgecolor='#334155', labelcolor='#cbd5e1')

# Panel 4: segregating sites vs Poisson expectation
ax4.hist(seg_arr, bins=30, color='#facc15', edgecolor='#e2e8f0', alpha=0.8)
ax4.axvline(E_S_th, color='#f43f5e', lw=2,
            label=f'E[S] = mu*L*T_tot = {E_S_th:.2f}')
ax4.axvline(seg_arr.mean(), color='#22d3ee', ls='--', lw=2,
            label=f'mean S = {seg_arr.mean():.2f}')
ax4.set_xlabel('Segregating sites S', color='#94a3b8')
ax4.set_ylabel('Replicates', color='#94a3b8')
ax4.set_title(f'Segregating sites (mu={mu:g}, L={int(L)})',
              color='#e2e8f0', fontsize=13, fontweight='bold')
ax4.legend(facecolor='#1e293b', edgecolor='#334155', labelcolor='#cbd5e1')

plt.tight_layout()
plt.savefig('output.png', dpi=120, bbox_inches='tight', facecolor='#0a0a1a')

print('=== Kingman coalescent summary ===')
print(f'Samples n = {n}, effective size Ne = {int(Ne)}, replicates = {n_rep}')
print(f'E[T_MRCA]  theory = {E_TMRCA_th:.1f}  simulated = {tmrca_arr.mean():.1f}')
print(f'E[T_tot]   theory = {E_Ttot_th:.1f}  simulated = {ttot_arr.mean():.1f}')
print(f'E[S]       theory = {E_S_th:.2f}  simulated = {seg_arr.mean():.2f}')
print(f'Fraction of T_MRCA within 10% of theory: '
      f'{np.mean(np.abs(tmrca_arr - E_TMRCA_th)/E_TMRCA_th < 0.1):.2%}')
bins = np.linspace(0, tmrca_arr.max(), 60)
h, edges = np.histogram(tmrca_arr, bins=bins, density=True)
centres = 0.5 * (edges[1:] + edges[:-1])
print(f'Trapezoid of T_MRCA density: {np.trapezoid(h, centres):.4f} '
      f'(should be ~1.0)')
print(f'Mean segregating sites S = {seg_arr.mean():.2f} (theta = 4Ne mu L = '
      f'{4*Ne*mu*L:.2f})')

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Simulation 2: PSMC-style Ne(t) inference from a single diploid genome

Forward simulation of a synthetic diploid genome under a piecewise-constant $N_e(t)$ history that includes a Pleistocene bottleneck. We then apply a simplified PSMC-style hazard inversion to recover $N_e(t)$ from the observed per-window coalescence-time distribution, plot the reconstructed curve against the ground truth, and reproduce the Li & Durbin (2011) great-ape comparison figure.

Python

script.py148 lines

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

# PSMC-style demographic inference from a single diploid genome
# -------------------------------------------------------------
# Li & Durbin (2011, Nature 475, 493-496) introduced the Pairwise
# Sequentially Markovian Coalescent (PSMC), which infers the history of
# effective population size Ne(t) from the distribution of heterozygous-
# site TMRCAs along a single diploid genome. The core idea: two chromosome
# copies coalesce at a rate inversely proportional to Ne(t); the
# inter-heterozygote distances along the genome encode the local TMRCA.

rng = np.random.default_rng(19)
mu = 1.25e-8
gen = 25.0
L = 2e9

# "True" demographic history
time_edges_yr = np.array([0, 5e3, 2e4, 5e4, 1e5, 2e5, 5e5, 1e6, 5e6])
time_edges_gen = time_edges_yr / gen
Ne_true = np.array([7000, 2500, 1800, 3000, 8000, 12000, 18000, 14000])

N_wins = 100000
t_fine = np.linspace(0, 5e6 / gen, 50000)
Ne_fine = np.zeros_like(t_fine)
for k in range(len(Ne_true)):
    mask = (t_fine >= time_edges_gen[k]) & (t_fine < time_edges_gen[k+1])
    Ne_fine[mask] = Ne_true[k]
Ne_fine[Ne_fine == 0] = Ne_true[-1]
hazard = 1.0 / (2.0 * Ne_fine)
dt = np.diff(t_fine, prepend=0)
cumh = np.cumsum(hazard * dt)
S_t = np.exp(-cumh)
cdf = 1 - S_t

u = rng.uniform(0, 1, N_wins)
tmrca_gen = np.interp(u, cdf, t_fine)
tmrca_yr = tmrca_gen * gen

win_size = L / N_wins
het_expected = 2 * mu * tmrca_gen
het_counts = rng.poisson(het_expected * win_size)
het_rate = het_counts / win_size

# PSMC-style inversion
n_bins = 30
age_bins_yr = np.logspace(3, 6.8, n_bins + 1)
age_bins_gen = age_bins_yr / gen
counts, _ = np.histogram(tmrca_gen, bins=age_bins_gen)
centres_gen = 0.5 * (age_bins_gen[1:] + age_bins_gen[:-1])
centres_yr = centres_gen * gen
widths_gen = np.diff(age_bins_gen)

at_risk = np.array([(tmrca_gen > age_bins_gen[k]).sum()
                    for k in range(n_bins)])
hazard_emp = np.where(at_risk > 10,
                      counts / (at_risk * widths_gen + 1e-9), np.nan)
Ne_inferred = 1.0 / (2.0 * hazard_emp)

Ne_truth_binned = np.zeros(n_bins)
for k in range(n_bins):
    Ne_truth_binned[k] = Ne_fine[np.argmin(np.abs(t_fine - centres_gen[k]))]

ax1.hist(tmrca_yr, bins=np.logspace(3, 7, 70),
         color='#22d3ee', edgecolor='#e2e8f0', alpha=0.8)
ax1.set_xscale('log')
ax1.set_xlabel('TMRCA (years before present)', color='#94a3b8')
ax1.set_ylabel('Genomic windows', color='#94a3b8')
ax1.set_title('Per-window coalescence times',
              color='#e2e8f0', fontsize=13, fontweight='bold')

sort_idx = np.argsort(tmrca_yr)
ax2.scatter(tmrca_yr[::50], het_rate[::50], s=5, color='#4ade80', alpha=0.4)
ax2.plot(tmrca_yr[sort_idx],
         2 * mu * (tmrca_yr[sort_idx] / gen),
         color='#f43f5e', lw=1.2, label='theory: 2 mu T')
ax2.set_xlabel('TMRCA (years)', color='#94a3b8')
ax2.set_ylabel('Heterozygosity (per bp)', color='#94a3b8')
ax2.set_title('Heterozygosity carries local TMRCA signal',
              color='#e2e8f0', fontsize=13, fontweight='bold')
ax2.legend(facecolor='#1e293b', edgecolor='#334155', labelcolor='#cbd5e1')

ax3.step(centres_yr, Ne_truth_binned, where='mid', color='#22d3ee',
         lw=2.5, label='True Ne(t)')
ax3.step(centres_yr, Ne_inferred, where='mid', color='#facc15',
         lw=2, label='PSMC-style estimate')
ax3.set_xscale('log')
ax3.set_xlabel('Time (years before present)', color='#94a3b8')
ax3.set_ylabel('Effective population size Ne', color='#94a3b8')
ax3.set_title('PSMC Ne(t) reconstruction',
              color='#e2e8f0', fontsize=13, fontweight='bold')
ax3.legend(facecolor='#1e293b', edgecolor='#334155', labelcolor='#cbd5e1')

# Stylised great-ape comparison
t_hum = np.logspace(3.3, 6.5, 200)
def syn_hist(t, N_anc, N_mid, N_late, t_exp, t_btl):
    N = np.where(t > t_exp, N_anc, N_late)
    N = np.where((t > t_btl) & (t <= t_exp), N_mid, N)
    return N
Ne_human = syn_hist(t_hum, 15000, 8000, 1500, 2e5, 2e4)
Ne_chimp = syn_hist(t_hum, 18000, 12000, 10000, 3e5, 3e4)
Ne_gorilla = syn_hist(t_hum, 25000, 16000, 12000, 4e5, 5e4)
ax4.step(t_hum, Ne_human, where='mid', color='#4ade80', lw=2,
         label='Human (H. sapiens)')
ax4.step(t_hum, Ne_chimp, where='mid', color='#22d3ee', lw=2,
         label='Chimpanzee')
ax4.step(t_hum, Ne_gorilla, where='mid', color='#f43f5e', lw=2,
         label='Gorilla')
ax4.set_xscale('log')
ax4.set_xlabel('Time (years before present)', color='#94a3b8')
ax4.set_ylabel('Ne', color='#94a3b8')
ax4.set_title('Stylised great-ape Ne(t) after Li & Durbin 2011',
              color='#e2e8f0', fontsize=13, fontweight='bold')
ax4.legend(facecolor='#1e293b', edgecolor='#334155', labelcolor='#cbd5e1')

plt.tight_layout()
plt.savefig('output.png', dpi=120, bbox_inches='tight', facecolor='#0a0a1a')

valid = ~np.isnan(Ne_inferred) & (Ne_truth_binned > 0)
rmse = np.sqrt(np.mean((Ne_inferred[valid] - Ne_truth_binned[valid])**2))
rel = np.mean(np.abs(Ne_inferred[valid] - Ne_truth_binned[valid])
              / Ne_truth_binned[valid])
print('=== PSMC-style Ne(t) inference summary ===')
print(f'Genome windows: {N_wins}, window size {int(win_size)} bp')
print(f'Total heterozygous sites simulated: {het_counts.sum()}')
print(f'Mean heterozygosity: {het_rate.mean():.3e} / bp')
print(f'Mean TMRCA: {tmrca_yr.mean()/1e3:.1f} kyr')
print(f'RMSE(Ne_hat - Ne_true) over valid bins: {rmse:.0f}')
print(f'Mean relative error: {rel:.1%}')
t_log = np.log10(centres_yr[valid])
print(f'Trapezoid of Ne_hat over log10 time: '
      f'{np.trapezoid(Ne_inferred[valid], t_log):.1f}')
k_min = np.nanargmin(Ne_inferred)
print(f'Inferred bottleneck: Ne_min = {Ne_inferred[k_min]:.0f} at '
      f't = {centres_yr[k_min]/1e3:.1f} kyr')
print(f'True bottleneck: Ne_min = {Ne_true.min()} at ~20 kyr')

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Key References

• Avise, J. C. et al. (1987). “Intraspecific phylogeography: the mitochondrial DNA bridge between population genetics and systematics.” Annu. Rev. Ecol. Syst. 18, 489–522.

• Kingman, J. F. C. (1982). “The coalescent.” Stochastic Processes and their Applications 13, 235–248.

• Brown, W. M., George, M. & Wilson, A. C. (1979). “Rapid evolution of animal mitochondrial DNA.” PNAS 76, 1967–1971.

• Hewitt, G. M. (2000). “The genetic legacy of the Quaternary ice ages.” Nature 405, 907–913.

• Hewitt, G. M. (2004). “Genetic consequences of climatic oscillations in the Quaternary.” Phil. Trans. Roy. Soc. B 359, 183–195.

• Barton, N. H. & Hewitt, G. M. (1985). “Analysis of hybrid zones.” Annu. Rev. Ecol. Syst. 16, 113–148.

• Drummond, A. J. & Rambaut, A. (2007). “BEAST: Bayesian evolutionary analysis by sampling trees.” BMC Evol. Biol. 7, 214.

• Bouckaert, R. et al. (2019). “BEAST 2.5: an advanced software platform for Bayesian evolutionary analysis.” PLoS Comp. Biol. 15, e1006650.

• Li, H. & Durbin, R. (2011). “Inference of human population history from individual whole-genome sequences.” Nature 475, 493–496.

• Schiffels, S. & Durbin, R. (2014). “Inferring human population size and separation history from multiple genome sequences.” Nature Genetics 46, 919–925.

• Kelleher, J., Etheridge, A. M. & McVean, G. (2016). “Efficient coalescent simulation and genealogical analysis for large sample sizes.” PLoS Comp. Biol. 12, e1004842.

• Avise, J. C. (2009). “Phylogeography: retrospect and prospect.” J. Biogeogr. 36, 3–15.

• Manel, S. et al. (2003). “Landscape genetics: combining landscape ecology and population genetics.” TREE 18, 189–197.

• Pritchard, J. K., Stephens, M. & Donnelly, P. (2000). “Inference of population structure using multilocus genotype data.” Genetics 155, 945–959.

• Hudson, R. R. (2002). “Generating samples under a Wright-Fisher neutral model of genetic variation.” Bioinformatics 18, 337–338.

• Beaumont, M. A., Zhang, W. & Balding, D. J. (2002). “Approximate Bayesian computation in population genetics.” Genetics 162, 2025–2035.

• Palkopoulou, E. et al. (2015). “Complete genomes reveal signatures of demographic and genetic declines in the woolly mammoth.” Current Biology 25, 1395–1400.

• Crandall, E. D. et al. (2012). “Expansion dating: calibrating molecular clocks in marine species.” Nature Commun. 3, 790.

Share:X Reddit LinkedIn

← Module 5: Biomes & Climate Module 7: Paleobiogeography & Extinctions →