Genome Assembly & Annotation
Reconstructing genomes from reads and decoding the functional elements within
Assembly Strategies: De Novo vs. Reference-Guided
Genome assembly is the computational process of reconstructing the original genomic sequence from the short (or long) sequencing reads produced by sequencing platforms. This task is analogous to assembling a jigsaw puzzle with millions of tiny overlapping pieces -- many of which may look similar due to repetitive sequences.
De Novo Assembly
Assembling a genome without any prior reference sequence. Required for novel organisms, highly divergent genomes, or when the reference is incomplete.
- No reference bias -- discovers all variants and structural features
- Computationally intensive (requires large RAM and processing time)
- Quality depends heavily on read length, coverage, and insert sizes
- Two main algorithmic paradigms: OLC and de Bruijn graphs
- Tools: SPAdes, Flye, Hifiasm, Canu, MEGAHIT
Reference-Guided Assembly
Reads are aligned (mapped) to an existing reference genome. Variants and consensus sequences are called relative to this reference.
- Much faster and less memory-intensive than de novo
- Subject to reference bias -- novel sequences may be missed
- Cannot detect insertions larger than the read/insert size
- Standard for resequencing projects (e.g., 1000 Genomes, GWAS)
- Tools: BWA-MEM2, Bowtie2, minimap2, HISAT2
In practice, many modern genome projects use a hybrid approach, combining de novo assembly with reference-guided scaffolding and polishing. For example, a de novo assembly from PacBio HiFi reads may be scaffolded using Hi-C data and then polished with short Illumina reads for base-level accuracy correction.
Assembly Algorithms: OLC & De Bruijn Graphs
Overlap-Layout-Consensus (OLC)
The OLC algorithm was developed for Sanger-era and long-read assembly. It operates in three stages:
- Overlap: All pairwise overlaps between reads are computed. For $N$ reads, this requires $O(N^2)$ comparisons in the naive case, though efficient seed-and-extend algorithms reduce this. An overlap graph is constructed where each node represents a read and edges represent sufficiently long overlaps.
- Layout: The overlap graph is simplified by removing transitive edges (if read A overlaps B and B overlaps C, and A also overlaps C, the A-C edge is redundant). The simplified graph ideally forms a set of linear paths corresponding to contiguous genomic segments.
- Consensus: For each path in the layout, a multiple sequence alignment of the constituent reads is computed, and the consensus base at each position is determined by majority voting (or a more sophisticated probabilistic model).
OLC assemblers include Celera Assembler (used for the first Sanger-era human genome assembly by Celera Genomics), Canu, and Flye (optimized for long noisy reads from PacBio and ONT).
De Bruijn Graph Approach
For short-read data, the OLC approach is impractical due to the enormous number of reads (billions) and the $O(N^2)$ pairwise comparison cost. The de Bruijn graphapproach (named after Dutch mathematician Nicolaas Govert de Bruijn) solves this elegantly by decomposing reads into overlapping subsequences of length $k$, called k-mers.
De Bruijn Graph Construction
- Each read of length $L$ is decomposed into $L - k + 1$ overlapping k-mers
- Each unique k-mer becomes a node in the graph
- An edge connects k-mer A to k-mer B if the last $k-1$ bases of A match the first $k-1$ bases of B (i.e., they were consecutive in a read)
- The genome sequence corresponds to an Eulerian path through the graph -- a path that traverses every edge exactly once
The choice of $k$ is critical: too small a $k$ produces ambiguous paths (many k-mers will occur at multiple genomic loci), while too large a $k$ fragments the graph (reads with errors will produce unique k-mers not connected to the main graph). Many assemblers test multiple values or use iterative approaches. Popular de Bruijn graph assemblers include SPAdes, MEGAHIT, Velvet, and ABySS.
K-mer Coverage
The k-mer coverage $C_k$ relates to the read coverage $C$ as follows:
Where $L$ is the read length and $k$ is the k-mer size. For typical Illumina data with $L = 150$ and $k = 31$, $C_k \approx 0.8C$. K-mer frequency histograms are invaluable for estimating genome size, heterozygosity, and repeat content prior to assembly.
Assembly Quality Metrics
Evaluating the quality of a genome assembly is crucial for determining its suitability for downstream analyses. Several standard metrics are used, each capturing a different aspect of assembly quality.
Contigs and Scaffolds
A contig (contiguous sequence) is a continuous stretch of assembled sequence without any gaps. A scaffold is formed by ordering and orienting contigs using additional information -- such as paired-end/mate-pair read distances, Hi-C chromatin interaction data, or optical mapping -- with gaps between contigs represented by stretches of N's (unknown bases).
N50: The Key Assembly Statistic
The N50 is the most widely reported assembly contiguity metric. It is defined as follows:
In other words, sort all contigs (or scaffolds) by length in descending order. Starting from the longest, accumulate their lengths. The N50 is the length of the contig at which the cumulative sum first reaches or exceeds 50% of the total assembly size. An assembly with a higher N50 is generally more contiguous.
Related metrics include:
- L50: The minimum number of contigs whose cumulative length reaches 50% of the total assembly. Lower L50 indicates fewer, larger contigs.
- NG50: Like N50 but normalized to the expected genome size rather than the total assembly size, enabling fairer comparisons between assemblies of different completeness.
- auN (area under the Nx curve): A single statistic that integrates over all Nx values from 0 to 100, providing a more comprehensive contiguity measure than N50 alone.
GC Content
GC content is a fundamental sequence composition metric used to assess assembly quality and detect contamination:
Unexpected GC content peaks in a GC distribution plot (blob plot) may indicate contamination from foreign organisms. The human genome has ~41% GC, while E. coli has ~51% and Plasmodium falciparum has ~19%.
BUSCO: Completeness Assessment
Beyond contiguity, BUSCO (Benchmarking Universal Single-Copy Orthologs) assesses assembly completeness by searching for a curated set of genes expected to be present in single copy in a given taxonomic lineage. BUSCO results are reported as percentages of Complete (single-copy + duplicated), Fragmented, and Missing orthologs. A high-quality mammalian genome assembly typically shows >95% complete BUSCOs.
| Metric | Measures | Ideal Value | Limitation |
|---|---|---|---|
| N50 / NG50 | Contiguity | As large as possible (ideally chromosome-scale) | Does not measure correctness |
| L50 | Fragmentation | As small as possible | Sensitive to total assembly size |
| BUSCO | Gene completeness | >95% complete | Lineage-specific; biased toward conserved genes |
| QV (consensus quality) | Base-level accuracy | QV50+ (1 error per 100 kb) | Requires independent data for estimation |
Gene Prediction & Structural Annotation
Once a genome is assembled, the next critical step is genome annotation -- identifying the locations and structures of all functional elements, particularly protein-coding genes. Annotation can be divided into structural annotation (defining the physical boundaries of genes and their sub-features) and functional annotation(assigning biological roles to predicted genes).
Open Reading Frame (ORF) Finding
The simplest approach to gene identification is scanning for open reading frames (ORFs) -- stretches of DNA that begin with a start codon (ATG) and end with a stop codon (TAA, TAG, or TGA) without any intervening stop codons in the same reading frame. For each strand, there are three possible reading frames, giving six total frames to scan.
In prokaryotes, where genes are typically intron-free and occupy a high fraction of the genome (~87% coding in E. coli), ORF finding is relatively straightforward -- long ORFs (>300 bp or 100 codons) are likely to be real genes. In eukaryotes, the presence of introns, alternative splicing, and the vast non-coding genome makes simple ORF finding insufficient.
Ab Initio Gene Prediction
Ab initio (from first principles) gene predictors use statistical models trained on known gene structures to identify genes based on sequence features alone. These programs employ Hidden Markov Models (HMMs) or generalized HMMs (GHMMs) that model the architecture of eukaryotic genes, including:
- Promoter signals: TATA box (consensus TATAAA, ~25 bp upstream of TSS), Inr element, CpG islands
- Splice sites: Donor site (5' splice site: GT at intron start), acceptor site (3' splice site: AG at intron end), branch point sequence
- Codon usage bias: Non-uniform frequency of synonymous codons in coding regions
- Start/stop codons: Kozak consensus sequence context around the ATG
- Polyadenylation signal: AATAAA ~10-30 bp upstream of poly(A) site
Popular ab initio predictors include Augustus, GeneMark, SNAP, and GlimmerHMM. While powerful, ab initio methods alone typically achieve ~60-80% sensitivity and can produce false positives, so they are best used in combination with evidence-based methods.
Homology-Based & Evidence-Based Annotation
Homology-based methods align known proteins, ESTs (expressed sequence tags), or RNA-seq transcripts from the same or related species to the assembled genome. Regions showing significant similarity to known genes are strong candidates for gene models. Tools like GeneWise, Exonerate, and SPALN perform spliced alignment of proteins to genomic DNA.
Modern evidence-based annotation pipelines integrate multiple lines of evidence -- ab initio predictions, protein homology, RNA-seq alignments, and other experimental data -- using combinatorial approaches. The MAKER pipeline and BRAKER (combining GeneMark-ET with Augustus, trained on RNA-seq evidence) are widely used eukaryotic genome annotation systems. NCBI's Eukaryotic Genome Annotation Pipelineand Ensembl's Genebuild pipeline are the standard reference annotation systems.
Structure of a Eukaryotic Gene
A typical protein-coding gene in eukaryotes contains the following structural elements (from 5' to 3' on the sense strand):
- Promoter region: Upstream regulatory elements (enhancers, silencers, TATA box, CpG island)
- 5' UTR: Untranslated region between the transcription start site and the start codon
- Exons: Coding sequences (and portions of UTRs); average ~170 bp in humans
- Introns: Non-coding intervening sequences; average ~6,000 bp in humans; removed by splicing
- 3' UTR: Untranslated region between the stop codon and poly(A) signal
- Polyadenylation signal: AATAAA directing mRNA cleavage and poly(A) tail addition
Sequence Similarity Search & Functional Annotation
The BLAST Algorithm
The Basic Local Alignment Search Tool (BLAST), developed by Altschul et al. (1990), is the most widely used bioinformatics tool in history. BLAST identifies regions of local similarity between a query sequence and a database, providing statistical assessments of the significance of matches. Different BLAST programs handle various query/database combinations:
| Program | Query | Database | Application |
|---|---|---|---|
| blastn | Nucleotide | Nucleotide | Finding similar DNA sequences, primer design |
| blastp | Protein | Protein | Identifying homologous proteins |
| blastx | Nucleotide (translated) | Protein | Finding protein homologs from DNA queries |
| tblastn | Protein | Nucleotide (translated) | Finding coding regions in unannotated genomes |
| tblastx | Nucleotide (translated) | Nucleotide (translated) | Comparing unannotated genomic sequences |
BLAST E-Value
The E-value (Expect value) is the most important statistical measure in BLAST output. It represents the number of alignments with a score equal to or better than the observed score that would be expected by chance, given the database size:
Where:
- $K$ and $\lambda$ are statistical parameters dependent on the scoring matrix and gap penalties
- $m$ = query sequence length
- $n$ = total database size (in residues or bases)
- $S$ = raw alignment score
Typical E-value thresholds: E < $10^{-5}$ for confident homology; E < $10^{-50}$ for very strong matches. An E-value of 0.01 means that one hit of this quality is expected by chance for every 100 database searches.
The bit score ($S'$) is a normalized version of the raw score that is independent of the database size and scoring system:
Functional Annotation: GO & KEGG
Once genes are structurally predicted, functional annotationassigns biological meaning by integrating multiple data sources:
- Gene Ontology (GO): A standardized vocabulary organized into three hierarchical ontologies: Biological Process (e.g., GO:0006915 "apoptotic process"), Molecular Function (e.g., GO:0003700 "DNA-binding transcription factor activity"), and Cellular Component (e.g., GO:0005634 "nucleus"). GO terms are assigned based on homology, domain architecture, and experimental evidence codes.
- KEGG (Kyoto Encyclopedia of Genes and Genomes): Maps genes to metabolic and signaling pathways. KEGG Orthology (KO) assignments link genes to functional modules and pathway maps, enabling pathway completeness assessment.
- InterPro / Pfam: Protein domain and family classification using HMM profiles. InterProScan searches against multiple domain databases simultaneously.
- EC Numbers: Enzyme Commission classification for catalytic functions (e.g., EC 2.7.1.1 for hexokinase).
Comparative Genomics Foundations
Genome annotation is greatly enhanced by comparative genomics -- comparing the assembled genome to related species. Conserved regions (identified by multi-species alignments using tools like LASTZ, MUMmer, or Progressive Cactus) are more likely to be functional. The PhastCons and PhyloP scores quantify evolutionary conservation at each nucleotide position using phylogenetic models, and are widely used to prioritize candidate functional elements and variants of clinical significance.