Module 3

Genome Assembly & Comparative Genomics

Assembly reconstructs a genome from shotgun reads using overlap-layout-consensus (for long reads) or de Bruijn graphs (for short reads). Long-read sequencing (ONT, PacBio HiFi) has transformed assembly quality. Comparative genomics then aligns assembled genomes to reveal synteny and rearrangements.

1. De Bruijn Graphs

A de Bruijn graph on k-mers: for each (kβˆ’1)-mer X and (kβˆ’1)-mer Y that overlap in the middle, draw an edge X β†’ Y labelled by their joined k-mer. Assembly becomes finding an Eulerian path.

\[ \text{Nodes}: (k-1)\text{-mers}\quad \text{Edges}: k\text{-mers},\ X \to Y\ \text{if}\ X[1:] = Y[:-1] \]

Short-read assemblers (SPAdes, ABySS, Velvet) build de Bruijn graphs, collapse bubbles from sequencing errors, resolve repeats with paired-end information, and emit contigs. Pevzner 2001 demonstrated the approach; modern variants use succinct graph representations for human-scale assemblies.

2. Long-Read Assembly (OLC)

Long reads (ONT 10–100 kb, PacBio HiFi 15–25 kb with 99%+ accuracy) favour overlap-layout-consensus: pairwise align all reads, build an overlap graph, then consensus each path. Flye, Canu, hifiasm, and Verkko are current tools. Near-telomere-to-telomere human assemblies (T2T, Nurk 2022 Science) now span each chromosome end-to-end β€” closing the ~8% of the genome that short-read assembly had left unresolved.

Simulation: De Bruijn Graph

Python
script.py38 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

3. Assembly Quality & N50

N50 is the contig length such that 50% of the assembly resides in contigs of that length or longer. Higher N50 β€” larger contigs β€” is better. BUSCO scores assess completeness by checking for expected single-copy orthologs. Both metrics together define assembly quality for a given genome.

4. Comparative Genomics

Once assembled, genomes are compared with whole-genome aligners: MUMmer (exact- match-based), LAST, minimap2. Synteny blocks reveal conserved gene orders; rearrangement breakpoints mark genome evolution events. Tools like Mauve, progressiveCactus, and SynTracker integrate whole-clade alignment. Pan-genome construction adds presence/absence variation to the reference picture.

Key References

β€’ Pevzner, P. A. et al. (2001). β€œAn Eulerian path approach to DNA fragment assembly.” Proc. Natl. Acad. Sci., 98, 9748–9753.

β€’ Bankevich, A. et al. (2012). β€œSPAdes: a new genome assembly algorithm and its applications to single-cell sequencing.” J. Comput. Biol., 19, 455–477.

β€’ Nurk, S. et al. (2022). β€œThe complete sequence of a human genome.” Science, 376, 44–53.

β€’ Marcais, G. & Kingsford, C. (2011). β€œA fast, lock-free approach for efficient parallel counting of k-mers.” Bioinformatics, 27, 764–770.

Share:XRedditLinkedIn