Are the courses on CoursesHub.World free?

Yes, all courses on CoursesHub.World are completely free and open access. We believe in democratizing education and making university-level science courses available to everyone worldwide.

What subjects are covered on CoursesHub.World?

CoursesHub.World offers courses in physics (Quantum Mechanics, General Relativity, QFT, Plasma Physics, Cosmology), biology (Molecular Biology, Cell Physiology, Pharmacology), and earth sciences (Oceanography, Atmospheric Science, Climatology).

What level are the courses?

Our courses are designed at the graduate and advanced undergraduate level. They include rigorous mathematical derivations and are suitable for physics students, researchers, and serious self-learners.

Where do the video lectures come from?

Our 400+ video lectures come from world-renowned sources including MIT OpenCourseWare, Stanford, lectures by Nobel laureates, and other leading universities and educators.

Part I: Genomics | Chapter 3

Genome Assembly & Annotation

Reconstructing genomes from reads and decoding the functional elements within

Assembly Strategies: De Novo vs. Reference-Guided

Genome assembly is the computational process of reconstructing the original genomic sequence from the short (or long) sequencing reads produced by sequencing platforms. This task is analogous to assembling a jigsaw puzzle with millions of tiny overlapping pieces -- many of which may look similar due to repetitive sequences.

De Novo Assembly

Assembling a genome without any prior reference sequence. Required for novel organisms, highly divergent genomes, or when the reference is incomplete.

No reference bias -- discovers all variants and structural features
Computationally intensive (requires large RAM and processing time)
Quality depends heavily on read length, coverage, and insert sizes
Two main algorithmic paradigms: OLC and de Bruijn graphs
Tools: SPAdes, Flye, Hifiasm, Canu, MEGAHIT

Reference-Guided Assembly

Reads are aligned (mapped) to an existing reference genome. Variants and consensus sequences are called relative to this reference.

Much faster and less memory-intensive than de novo
Subject to reference bias -- novel sequences may be missed
Cannot detect insertions larger than the read/insert size
Standard for resequencing projects (e.g., 1000 Genomes, GWAS)
Tools: BWA-MEM2, Bowtie2, minimap2, HISAT2

In practice, many modern genome projects use a hybrid approach, combining de novo assembly with reference-guided scaffolding and polishing. For example, a de novo assembly from PacBio HiFi reads may be scaffolded using Hi-C data and then polished with short Illumina reads for base-level accuracy correction.

Assembly Algorithms: OLC & De Bruijn Graphs

Overlap-Layout-Consensus (OLC)

The OLC algorithm was developed for Sanger-era and long-read assembly. It operates in three stages:

Overlap: All pairwise overlaps between reads are computed. For $N$ reads, this requires $O(N^2)$ comparisons in the naive case, though efficient seed-and-extend algorithms reduce this. An overlap graph is constructed where each node represents a read and edges represent sufficiently long overlaps.
Layout: The overlap graph is simplified by removing transitive edges (if read A overlaps B and B overlaps C, and A also overlaps C, the A-C edge is redundant). The simplified graph ideally forms a set of linear paths corresponding to contiguous genomic segments.
Consensus: For each path in the layout, a multiple sequence alignment of the constituent reads is computed, and the consensus base at each position is determined by majority voting (or a more sophisticated probabilistic model).

OLC assemblers include Celera Assembler (used for the first Sanger-era human genome assembly by Celera Genomics), Canu, and Flye (optimized for long noisy reads from PacBio and ONT).

De Bruijn Graph Approach

For short-read data, the OLC approach is impractical due to the enormous number of reads (billions) and the $O(N^2)$ pairwise comparison cost. The de Bruijn graphapproach (named after Dutch mathematician Nicolaas Govert de Bruijn) solves this elegantly by decomposing reads into overlapping subsequences of length $k$, called k-mers.

De Bruijn Graph Construction

Each read of length $L$ is decomposed into $L - k + 1$ overlapping k-mers
Each unique k-mer becomes a node in the graph
An edge connects k-mer A to k-mer B if the last $k-1$ bases of A match the first $k-1$ bases of B (i.e., they were consecutive in a read)
The genome sequence corresponds to an Eulerian path through the graph -- a path that traverses every edge exactly once

The choice of $k$ is critical: too small a $k$ produces ambiguous paths (many k-mers will occur at multiple genomic loci), while too large a $k$ fragments the graph (reads with errors will produce unique k-mers not connected to the main graph). Many assemblers test multiple values or use iterative approaches. Popular de Bruijn graph assemblers include SPAdes, MEGAHIT, Velvet, and ABySS.

K-mer Coverage

The k-mer coverage $C_k$ relates to the read coverage $C$ as follows:

$$C_k = C \times \frac{L - k + 1}{L}$$

Where $L$ is the read length and $k$ is the k-mer size. For typical Illumina data with $L = 150$ and $k = 31$, $C_k \approx 0.8C$. K-mer frequency histograms are invaluable for estimating genome size, heterozygosity, and repeat content prior to assembly.

Assembly Quality Metrics

Evaluating the quality of a genome assembly is crucial for determining its suitability for downstream analyses. Several standard metrics are used, each capturing a different aspect of assembly quality.

Contigs and Scaffolds

A contig (contiguous sequence) is a continuous stretch of assembled sequence without any gaps. A scaffold is formed by ordering and orienting contigs using additional information -- such as paired-end/mate-pair read distances, Hi-C chromatin interaction data, or optical mapping -- with gaps between contigs represented by stretches of N's (unknown bases).

N50: The Key Assembly Statistic

The N50 is the most widely reported assembly contiguity metric. It is defined as follows:

$$\text{N50} = \text{length } L \text{ such that } \sum_{\substack{i \\ l_i \geq L}} l_i \geq \frac{1}{2} \sum_{i} l_i$$

In other words, sort all contigs (or scaffolds) by length in descending order. Starting from the longest, accumulate their lengths. The N50 is the length of the contig at which the cumulative sum first reaches or exceeds 50% of the total assembly size. An assembly with a higher N50 is generally more contiguous.

Related metrics include:

L50: The minimum number of contigs whose cumulative length reaches 50% of the total assembly. Lower L50 indicates fewer, larger contigs.
NG50: Like N50 but normalized to the expected genome size rather than the total assembly size, enabling fairer comparisons between assemblies of different completeness.
auN (area under the Nx curve): A single statistic that integrates over all Nx values from 0 to 100, providing a more comprehensive contiguity measure than N50 alone.

GC Content

GC content is a fundamental sequence composition metric used to assess assembly quality and detect contamination:

$$\text{GC Content} = \frac{G + C}{A + T + G + C} \times 100\%$$

Unexpected GC content peaks in a GC distribution plot (blob plot) may indicate contamination from foreign organisms. The human genome has ~41% GC, while E. coli has ~51% and Plasmodium falciparum has ~19%.

BUSCO: Completeness Assessment

Beyond contiguity, BUSCO (Benchmarking Universal Single-Copy Orthologs) assesses assembly completeness by searching for a curated set of genes expected to be present in single copy in a given taxonomic lineage. BUSCO results are reported as percentages of Complete (single-copy + duplicated), Fragmented, and Missing orthologs. A high-quality mammalian genome assembly typically shows >95% complete BUSCOs.

Metric	Measures	Ideal Value	Limitation
N50 / NG50	Contiguity	As large as possible (ideally chromosome-scale)	Does not measure correctness
L50	Fragmentation	As small as possible	Sensitive to total assembly size
BUSCO	Gene completeness	>95% complete	Lineage-specific; biased toward conserved genes
QV (consensus quality)	Base-level accuracy	QV50+ (1 error per 100 kb)	Requires independent data for estimation

Gene Prediction & Structural Annotation

Once a genome is assembled, the next critical step is genome annotation -- identifying the locations and structures of all functional elements, particularly protein-coding genes. Annotation can be divided into structural annotation (defining the physical boundaries of genes and their sub-features) and functional annotation(assigning biological roles to predicted genes).

Open Reading Frame (ORF) Finding

The simplest approach to gene identification is scanning for open reading frames (ORFs) -- stretches of DNA that begin with a start codon (ATG) and end with a stop codon (TAA, TAG, or TGA) without any intervening stop codons in the same reading frame. For each strand, there are three possible reading frames, giving six total frames to scan.

In prokaryotes, where genes are typically intron-free and occupy a high fraction of the genome (~87% coding in E. coli), ORF finding is relatively straightforward -- long ORFs (>300 bp or 100 codons) are likely to be real genes. In eukaryotes, the presence of introns, alternative splicing, and the vast non-coding genome makes simple ORF finding insufficient.

Ab Initio Gene Prediction

Ab initio (from first principles) gene predictors use statistical models trained on known gene structures to identify genes based on sequence features alone. These programs employ Hidden Markov Models (HMMs) or generalized HMMs (GHMMs) that model the architecture of eukaryotic genes, including:

Promoter signals: TATA box (consensus TATAAA, ~25 bp upstream of TSS), Inr element, CpG islands
Splice sites: Donor site (5' splice site: GT at intron start), acceptor site (3' splice site: AG at intron end), branch point sequence
Codon usage bias: Non-uniform frequency of synonymous codons in coding regions
Start/stop codons: Kozak consensus sequence context around the ATG
Polyadenylation signal: AATAAA ~10-30 bp upstream of poly(A) site

Popular ab initio predictors include Augustus, GeneMark, SNAP, and GlimmerHMM. While powerful, ab initio methods alone typically achieve ~60-80% sensitivity and can produce false positives, so they are best used in combination with evidence-based methods.

Homology-Based & Evidence-Based Annotation

Homology-based methods align known proteins, ESTs (expressed sequence tags), or RNA-seq transcripts from the same or related species to the assembled genome. Regions showing significant similarity to known genes are strong candidates for gene models. Tools like GeneWise, Exonerate, and SPALN perform spliced alignment of proteins to genomic DNA.

Modern evidence-based annotation pipelines integrate multiple lines of evidence -- ab initio predictions, protein homology, RNA-seq alignments, and other experimental data -- using combinatorial approaches. The MAKER pipeline and BRAKER (combining GeneMark-ET with Augustus, trained on RNA-seq evidence) are widely used eukaryotic genome annotation systems. NCBI's Eukaryotic Genome Annotation Pipelineand Ensembl's Genebuild pipeline are the standard reference annotation systems.

Structure of a Eukaryotic Gene

A typical protein-coding gene in eukaryotes contains the following structural elements (from 5' to 3' on the sense strand):

Promoter region: Upstream regulatory elements (enhancers, silencers, TATA box, CpG island)
5' UTR: Untranslated region between the transcription start site and the start codon
Exons: Coding sequences (and portions of UTRs); average ~170 bp in humans
Introns: Non-coding intervening sequences; average ~6,000 bp in humans; removed by splicing
3' UTR: Untranslated region between the stop codon and poly(A) signal
Polyadenylation signal: AATAAA directing mRNA cleavage and poly(A) tail addition

Sequence Similarity Search & Functional Annotation

The BLAST Algorithm

The Basic Local Alignment Search Tool (BLAST), developed by Altschul et al. (1990), is the most widely used bioinformatics tool in history. BLAST identifies regions of local similarity between a query sequence and a database, providing statistical assessments of the significance of matches. Different BLAST programs handle various query/database combinations:

Program	Query	Database	Application
blastn	Nucleotide	Nucleotide	Finding similar DNA sequences, primer design
blastp	Protein	Protein	Identifying homologous proteins
blastx	Nucleotide (translated)	Protein	Finding protein homologs from DNA queries
tblastn	Protein	Nucleotide (translated)	Finding coding regions in unannotated genomes
tblastx	Nucleotide (translated)	Nucleotide (translated)	Comparing unannotated genomic sequences

BLAST E-Value

The E-value (Expect value) is the most important statistical measure in BLAST output. It represents the number of alignments with a score equal to or better than the observed score that would be expected by chance, given the database size:

$$E = K \cdot m \cdot n \cdot e^{-\lambda S}$$

Where:

$K$ and $\lambda$ are statistical parameters dependent on the scoring matrix and gap penalties
$m$ = query sequence length
$n$ = total database size (in residues or bases)
$S$ = raw alignment score

Typical E-value thresholds: E < $10^{-5}$ for confident homology; E < $10^{-50}$ for very strong matches. An E-value of 0.01 means that one hit of this quality is expected by chance for every 100 database searches.

The bit score ($S'$) is a normalized version of the raw score that is independent of the database size and scoring system:

$$S' = \frac{\lambda S - \ln K}{\ln 2}$$

Functional Annotation: GO & KEGG

Once genes are structurally predicted, functional annotationassigns biological meaning by integrating multiple data sources:

Gene Ontology (GO): A standardized vocabulary organized into three hierarchical ontologies: Biological Process (e.g., GO:0006915 "apoptotic process"), Molecular Function (e.g., GO:0003700 "DNA-binding transcription factor activity"), and Cellular Component (e.g., GO:0005634 "nucleus"). GO terms are assigned based on homology, domain architecture, and experimental evidence codes.
KEGG (Kyoto Encyclopedia of Genes and Genomes): Maps genes to metabolic and signaling pathways. KEGG Orthology (KO) assignments link genes to functional modules and pathway maps, enabling pathway completeness assessment.
InterPro / Pfam: Protein domain and family classification using HMM profiles. InterProScan searches against multiple domain databases simultaneously.
EC Numbers: Enzyme Commission classification for catalytic functions (e.g., EC 2.7.1.1 for hexokinase).

Comparative Genomics Foundations

Genome annotation is greatly enhanced by comparative genomics -- comparing the assembled genome to related species. Conserved regions (identified by multi-species alignments using tools like LASTZ, MUMmer, or Progressive Cactus) are more likely to be functional. The PhastCons and PhyloP scores quantify evolutionary conservation at each nucleotide position using phylogenetic models, and are widely used to prioritize candidate functional elements and variants of clinical significance.

←DNA Sequencing Technologies Comparative & Functional Genomics→