Module 4: First Cells & LUCA

Somewhere between prebiotic chemistry and the three-domain tree of life, the first cell emerged. This module examines how self-assembling lipid vesicles (Szostak protocells) could grow, divide, and encapsulate genetic material; how a split in membrane chemistry cleaves the Archaea and Bacteria; how Weiss et al.'s 2016 comparative genomics reconstructed LUCA as an anaerobic, thermophilic, hydrogen-dependent autotroph; and how the genetic code itself reflects historical rather than purely functional logic.

proto-RNAProtocellC10-C14 fatty acid bilayer+ encapsulated RNAThree-domain tree (Woese 1990)LUCA~3.8-3.5 GyaBacteriaArchaeaEukaryaendosymbiosis(mitochondrion ~ 2.1 Gya)ester-linkedfatty-acid lipidsether-linkedisoprenoid lipidschimeric(both)

4.1 Protocells: Szostak Vesicles

Jack Szostak's laboratory at Harvard has for two decades pursued a minimalist protocell: a fatty-acid vesicle containing replicating genetic material, capable of growth and division under non-biological conditions. The model uses short-chain (\(\text{C}_{10}\)-\text{C}_{14}\)) fatty acids, which:

  • Self-assemble into bilayer vesicles above their critical aggregation concentration
  • Are permeable to small solutes including activated nucleotides
  • Grow by incorporating fatty-acid monomers from solution
  • Divide by mechanical shear (Zhu & Szostak 2009): elongated vesicles pinch off
  • Permit non-enzymatic template copying of encapsulated RNA

Osmotic growth

A key Szostak result: vesicles containing more encapsulated polymer grow faster, stealing fatty acids from their empty neighbours. The underlying mechanism is osmotic:

\[ \Delta\Pi = i RT \Delta c_\text{solute} \]

The internal osmotic pressure stretches the bilayer, lowering its surface pressure and making incorporation of additional fatty acid energetically favourable. Full-of-content vesicles literally grow at the expense of empty ones β€” a primitive form of Darwinian selection without genetic information.

Division mechanics

Budino & Szostak (2011) showed that cycles of fatty-acid addition first produce long filaments of interconnected vesicles; gentle shear forces then divide these into daughter vesicles with little content loss. This pre-figures modern cell division without any protein machinery (FtsZ / tubulin).

4.2 The Archaeal-Bacterial Membrane Divide

One of the most puzzling features of the tree of life: Bacteria and Eukarya use ester-linked straight-chain fatty acids built on sn-glycerol-3-phosphate, while Archaea use ether-linked isoprenoid chains on the opposite stereoisomer (sn-glycerol-1-phosphate). If LUCA already had a membrane, one lineage would have had to totally rebuild it.

Lane-Martin model of the divide

Lane & Martin (2012) and Sojo et al. (2014) argue LUCA had a leaky inorganic (FeS) membrane; genuine lipid membranes were inventedindependently in the Archaea and Bacteria after they had left the vent system. This would also explain why the stereoisomer of the glycerol head group differs: each lineage invented its own glycerophosphate synthase.

\[ \text{Archaea: sn-G1P} + 2\,\text{GGPP} \longrightarrow \text{archaeol (ether-linked isoprenoid)} \]
\[ \text{Bacteria: sn-G3P} + 2\,\text{Acyl-CoA} \longrightarrow \text{phosphatidic acid (ester-linked FA)} \]

4.3 Reconstructing LUCA

The Last Universal Common Ancestor is a conceptual node on the tree, not an organism we can sequence. But we can infer its genome by identifying gene families present in both Bacteria and Archaea and shared by widely-diverged lineages β€” subject to the caveat that horizontal gene transfer (HGT) is rampant.

Weiss 2016: 355 LUCA genes

Madeline Weiss and collaborators applied strict phylogenetic criteria to 6.1 million prokaryotic protein-coding genes, looking for clusters that (a) appear in both a Bacterium and an Archaeon, and (b) have tree topology consistent with vertical inheritance from LUCA rather than HGT. They found 355 gene families, painting a coherent physiological portrait:

  • Thermophilic: reverse gyrase and heat-shock machinery
  • Anaerobic: no oxygen-dependent enzymes; instead Fe-Fe and Ni-Fe hydrogenases
  • Hydrogen-dependent autotroph: Wood-Ljungdahl CO2 fixation
  • Na+/H+ antiporters: consistent with the leaky-vent transition
  • Methanogenic-like: key enzymes of the methyl branch

This portrait is essentially a vent chemoautotroph, providing independent support for the alkaline-vent cradle hypothesis of Module 3.

Caveats

The Weiss reconstruction has been criticised (Gogarten, Martin 2017) for potentially biasing against gene families with broad distributions that look HGT-like but are actually inherited. A less restrictive analysis (Moody et al. 2024) produced ~2600 genesimplying a more sophisticated LUCA β€” more like a modern prokaryote than a proto-cell. The debate is active; both views converge on LUCA being an anaerobic autotroph, but disagree on sophistication.

4.4 Two or Three Domains? Eukaryogenesis Preview

Carl Woese's 1977 discovery of Archaea as a third domain (from rRNA phylogeny) was a landmark. But more recent work suggests eukaryotes nest within Archaea rather than branching separately:

  • Eocyte hypothesis (Lake 1988): eukaryotes > TACK clade of archaea
  • Asgard archaea (Spang 2015; Zaremba-Niedzwiedzka 2017): β€œLokiarchaeota” and relatives possess eukaryote-like actin, GTPases, and ESCRT machinery
  • 2-domain tree: Bacteria + (Archaea including eukaryotes)

We treat eukaryogenesis fully in Module 7. Here we note only that the Archaea-Eukarya intimacy reinforces the importance of LUCA being an organism from which all three branches share a common chemiosmotic/hydrogen-dependent physiology.

4.5 Origin of the Genetic Code

The canonical 64-codon-to-20-amino-acid mapping is nearly universal (tiny variants in mitochondria, ciliates, Mycoplasma). Three competing explanations for its structure:

Frozen accident (Crick 1968)

Once a code achieves global usage, changing any codon assignment is lethal in almost all contexts: a flood of misfolded proteins. The code is β€œfrozen” not because it is optimal but because alternatives are inaccessible. Crick noted that the specific mapping was historical contingency.

Stereochemical theory

Woese and others proposed direct stereochemical affinity between RNA triplets and cognate amino acids: the codon's chemical shape has some preferential binding to the amino acid it encodes. Aptamer experiments (Yarus, Majerfeld) isolated RNA sequences that bind amino acids preferentially containing the cognate codon 7 of 8 times for arginine β€” strong but not universal evidence.

Co-evolutionary theory

Wong (1975): the code expanded as amino acid biosynthesis expanded. Early amino acids (glycine, alanine, aspartate, glutamate) had multiple codons; later-evolving ones (tryptophan, histidine, cysteine, methionine) have fewer. Codons assigned to biosynthetic precursors are near those assigned to products β€” so the code records metabolic history.

Error-minimisation optimality

Measuring how resistant the code is to single-base errors: we compute \(\langle(\phi_a - \phi_{a'})^2\rangle\) over all one-step neighbours in codon space, where \(\phi\) is a physicochemical property (hydrophobicity, polarity, size). Freeland & Hurst (1998) found the canonical code is better than \(\sim 10^6\) out of \(10^8\) random alternatives β€” one in a million. Our Python simulation below demonstrates this robustness quantitatively.

4.6 From Ribozymes to the Modern Ribosome

The transition from RNA-world catalysis to protein-based metabolism required invention of the translation machinery β€” itself a ribozyme. Modern ribosomes contain ~54 proteins and ~3 rRNAs, but the catalytic heart (peptidyl transferase centre) is pure RNA. Several strands of evidence suggest translation evolved stepwise:

  • The 23S PTC can be reconstructed as an ancestral hairpin with only \(\sim 110\) nucleotides (Bokov & Steinberg 2009)
  • Early tRNAs were likely minihelices pairing an anticodon stem with a CCA-acceptor end
  • First tRNA aminoacylation may have used direct stereochemical recognition before aaRS proteins existed
  • The ribosome decoding centre (16S) and PTC (23S) are among the most conserved macromolecules in biology

A plausible path: self-aminoacylating ribozymes β†’ template-directed peptide synthesis on RNA scaffolds β†’ invention of the ribosome β†’ encoded protein world. Once encoded proteins existed, they rapidly displaced RNA catalysts in almost all contexts except those already β€œfrozen” (spliceosome, ribosome, RNase P, telomerase).

4.7 The Minimal Genome & Syn3.0

Craig Venter's group engineered Mycoplasma mycoides JCVI-syn3.0 in 2016 β€” a bacterium with just 473 genes, the smallest genome of any autonomously reproducing cell. Roughly 149 of its genes have unknown function yet are essential for viability β€” suggesting the minimal cell still contains molecular biology we do not understand. The natural minimum genome (Mycoplasma genitalium) has 525 genes; even obligate endosymbionts like Carsonella ruddii (160 genes) cannot survive outside their host.

This sets a lower bound on LUCA: the 355 Weiss genes are a subset of a much larger ancestral genome of perhaps 1000-2000 genes. LUCA was a real, metabolically capable organism, not an abstract fuzzy precursor.

Python: Protocell Growth & LUCA Gene Set

Panel 1 models fatty-acid vesicle growth as a function of encapsulated RNA content, showing the Szostak selection dynamic: content-rich vesicles grow faster via osmotic amplification. Panel 2 visualises the Weiss 2016 decomposition of the 355 LUCA gene families into functional categories, weighted strongly toward translation, cofactor biosynthesis, and the Wood-Ljungdahl pathway β€” the footprint of a vent-dwelling hydrogen-dependent autotroph.

Python
script.py75 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Python: Genetic Code Error-Robustness

We compute the mean squared hydrophobicity change under single-base mutations for the canonical genetic code, and compare it to 2000 random permutation codes. The result reproduces the Freeland-Hurst finding: the canonical code is significantly more robust than almost all randomly-assembled alternatives, strong evidence of optimisation (natural or otherwise) rather than pure frozen accident.

Python
script.py110 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

References

  1. Szostak, J.W., Bartel, D.P. & Luisi, P.L. (2001). Synthesizing life. Nature, 409, 387-390.
  2. Zhu, T.F. & Szostak, J.W. (2009). Coupled growth and division of model protocell membranes. JACS, 131, 5705-5713.
  3. Weiss, M.C. et al. (2016). The physiology and habitat of the last universal common ancestor. Nature Microbiology, 1, 16116.
  4. Moody, E.R.R. et al. (2024). The nature of the last universal common ancestor and its impact on the early Earth system. Nature Ecology & Evolution.
  5. Woese, C.R., Kandler, O. & Wheelis, M.L. (1990). Towards a natural system of organisms. PNAS, 87, 4576-4579.
  6. Spang, A. et al. (2015). Complex archaea that bridge the gap between prokaryotes and eukaryotes. Nature, 521, 173-179.
  7. Zaremba-Niedzwiedzka, K. et al. (2017). Asgard archaea illuminate the origin of eukaryotic cellular complexity. Nature, 541, 353-358.
  8. Lane, N. & Martin, W.F. (2012). The origin of membrane bioenergetics. Cell, 151, 1406-1416.
  9. Crick, F.H.C. (1968). The origin of the genetic code. Journal of Molecular Biology, 38, 367-379.
  10. Freeland, S.J. & Hurst, L.D. (1998). The genetic code is one in a million. Journal of Molecular Evolution, 47, 238-248.
  11. Wong, J.T. (1975). A co-evolution theory of the genetic code. PNAS, 72, 1909-1912.
  12. Yarus, M., Widmann, J. & Knight, R. (2009). RNA-amino acid binding: a stereochemical era for the genetic code. JME, 69, 406-429.