Genomics, Proteomics, Transcriptomics & Metabolomics
The Omics Revolution — From Genomes to Systems Biology
Course Overview
The omics sciences represent one of the most transformative revolutions in modern biology. By studying entire classes of biological molecules simultaneously — genomes, transcriptomes, proteomes, and metabolomes — researchers can now understand living systems at an unprecedented scale and resolution.
This comprehensive course covers 20 detailed chapters spanning four major omics disciplines and their integration through systems biology and bioinformatics. From DNA sequencing technologies and RNA-Seq to mass spectrometry-based proteomics and metabolic profiling, you will learn the principles, technologies, and computational methods that drive modern biomedical research.
Course Structure
Part I: Genomics
Genome organization, DNA sequencing technologies (Sanger, NGS, third-generation), assembly, annotation, and functional genomics
4 chapters — from genome architecture to GWAS and comparative genomics
Part II: Transcriptomics
Gene expression analysis, microarray technology, RNA-Seq pipelines, single-cell and spatial transcriptomics
4 chapters — from microarrays to 10x Genomics and Visium spatial data
Part III: Proteomics
Protein separation, mass spectrometry (MALDI, ESI, tandem MS), quantitative approaches, interactomics
4 chapters — from 2D-PAGE to SILAC, TMT labeling, and Y2H screens
Part IV: Metabolomics
Metabolic profiling, NMR and mass spectrometry platforms, flux analysis, clinical applications
4 chapters — from targeted/untargeted metabolomics to biomarker discovery
Part V: Multi-Omics Integration
Systems biology, bioinformatics tools, machine learning approaches, precision medicine applications
4 chapters — from network analysis to AI-driven biomarker discovery
Mathematics in Omics — Equations & Derivations
Modern omics sciences rely heavily on quantitative methods. Below are the 12 core mathematical frameworks — with step-by-step derivations — that underpin genomics, transcriptomics, proteomics, and metabolomics analysis.
1. Smith-Waterman Algorithm (Local Alignment)
The Smith-Waterman algorithm finds the optimal local alignment between two sequences using dynamic programming. A scoring matrix H is filled cell by cell, where s(a_i, b_j) is the substitution score and d is the gap penalty:
Key differences from global alignment:
- The zero floor ensures negative-scoring regions are discarded (local vs global)
- Traceback begins at the highest-scoring cell (not bottom-right) and ends at any cell with score 0
- Scoring matrices (BLOSUM62, PAM250) encode evolutionary substitution probabilities
- Affine gap penalties (open + extend) are more biologically realistic than linear penalties
2. Needleman-Wunsch Algorithm (Global Alignment)
Unlike Smith-Waterman, the Needleman-Wunsch algorithm aligns sequences end-to-end (globally). The recurrence has no zero floor, and initialization penalizes leading gaps:
Initialization and traceback:
Traceback starts at F(m,n) (bottom-right corner) and proceeds to F(0,0). Time and space complexity are both O(mn). Compared to Smith-Waterman: global alignment is preferred for closely related sequences of similar length, while local alignment excels at finding conserved domains within divergent sequences.
3. Fold Change (Differential Expression)
Fold change quantifies the magnitude of expression change between conditions. The log₂ transformation makes the scale symmetric around zero:
Interpreting log₂FC values:
- log₂FC = +1: 2-fold upregulation (treatment is 2x control)
- log₂FC = -1: 2-fold downregulation (treatment is 0.5x control)
- |log₂FC| > 1: Common threshold for biological significance
In volcano plots, log₂FC is plotted on the x-axis against -log₁₀(p-value) on the y-axis, allowing simultaneous visualization of effect size and statistical significance.
4. Phred Quality Score
Phred scores quantify the probability of a base call error (P_e) in DNA sequencing data. The score is a logarithmic transformation:
Common quality thresholds:
- Q20: P_e = 0.01 (99% accuracy, 1 error per 100 bases)
- Q30: P_e = 0.001 (99.9% accuracy, 1 error per 1,000 bases — Illumina standard)
- Q40: P_e = 0.0001 (99.99% accuracy, 1 error per 10,000 bases)
Quality scores are encoded as ASCII characters in FASTQ files. Q30 is the typical minimum threshold for high-confidence variant calling and downstream analysis.
5. BLAST E-value (Statistical Significance of Alignments)
The BLAST E-value estimates the number of alignments with a score S or better that would occur by chance in a database search. It derives from the Karlin-Altschul statistics for local alignment:
where:
- K, λ: Karlin-Altschul statistical parameters (depend on scoring matrix and gap penalties)
- m: Length of query sequence
- n: Total size of the database (in residues)
- S: Raw alignment score
E < 1e-5 is commonly considered significant. E-value scales with database size — the same alignment will have a larger E-value when searched against a bigger database.
6. Shannon Entropy (Sequence Conservation)
Shannon entropy measures the information content (uncertainty) at each position in a multiple sequence alignment. For a column with symbol frequencies p_i:
The information content R at a position is the difference from maximum entropy:
For DNA (k=4), max entropy is 2 bits. For proteins (k=20), max is ~4.32 bits. Highly conserved positions have low entropy (high information content) and appear as tall letters in sequence logos. This is fundamental to motif discovery and binding site analysis.
7. Mass-to-Charge Ratio (Mass Spectrometry)
In electrospray ionization (ESI) mass spectrometry, analytes acquire multiple charges. The observed m/z for a molecule of molecular mass M carrying z protons is:
where m_H = 1.00728 Da (proton mass). For two adjacent charge states z and z+1:
Charge state deconvolution uses the envelope of multiply charged peaks to determine the true molecular mass M. This is essential for intact protein mass measurement in top-down proteomics.
8. Benjamini-Hochberg FDR Correction
When testing thousands of genes simultaneously, multiple testing correction is essential. The Benjamini-Hochberg procedure controls the false discovery rate (FDR) at level α:
Procedure step-by-step:
- Rank all m p-values from smallest to largest
- Find the largest rank k where p_(k) ≤ (k/m)·α
- Reject all hypotheses with rank ≤ k
- The adjusted p-value (q-value) for gene i is: q_i = min(p_i · m/i, 1)
BH is less conservative than Bonferroni (α/m) and is the standard correction in RNA-Seq differential expression analysis. FDR of 0.05 means 5% of significant results are expected to be false positives.
9. Principal Component Analysis (PCA)
PCA reduces high-dimensional omics data to a lower-dimensional representation while preserving maximum variance. Given a centered data matrix X (n samples x p features), compute the covariance matrix:
Eigenvalue decomposition of the covariance matrix yields principal components:
The proportion of variance explained by the k-th principal component is:
PC1 captures the direction of greatest variance, PC2 the next orthogonal direction, etc. PCA is ubiquitous in omics for quality control (detecting batch effects), sample clustering, and dimensionality reduction prior to downstream analysis.
10. Differential Expression — DESeq2 Model
RNA-Seq count data is overdispersed (variance > mean), so DESeq2 models counts with a negative binomial distribution:
where K_ij is the count for gene i in sample j, with mean μ_ij and gene-specific dispersion α_i. The mean is modeled via a log-linear generalized linear model:
The NB variance relates to mean and dispersion:
DESeq2 uses empirical Bayes shrinkage to stabilize dispersion estimates across genes. Significance is assessed via the Wald test: z = β/SE(β), testing whether the log₂ fold change β differs significantly from zero.
11. Michaelis-Menten in Metabolomics (Pathway Flux)
Metabolic flux analysis relies on enzyme kinetics to model pathway throughput. The Michaelis-Menten equation describes the rate of an enzymatic reaction at steady state:
Derived from the steady-state assumption on enzyme-substrate complex [ES]:
In flux balance analysis (FBA), the stoichiometric matrix S and steady-state constraint S·v = 0 are used with linear programming to predict metabolic fluxes. Michaelis-Menten kinetics constrains individual reaction rates within these network models, bridging metabolomics measurements to systems-level pathway analysis.
12. Network Topology Metrics (Interactomics)
Protein-protein interaction (PPI) networks and gene regulatory networks are analyzed using graph-theoretic measures. For a network with N nodes and adjacency matrix A:
Degree of node i (number of interactions):
Clustering coefficient (local connectivity):
where e_i is the number of edges among the neighbors of node i.
Betweenness centrality (bottleneck importance):
where σ_st is the total number of shortest paths from s to t, and σ_st(i) is the number passing through i. Biological networks are typically scale-free (P(k) ~ k^(-γ)), meaning a few hub proteins have many interactions. High-betweenness nodes are often essential genes and promising drug targets.
Summary of Key Omics Equations
| # | Equation | Omics Domain | Application |
|---|---|---|---|
| 1 | Smith-Waterman | Genomics | Local sequence alignment |
| 2 | Needleman-Wunsch | Genomics | Global sequence alignment |
| 3 | Fold Change (log₂FC) | Transcriptomics | Differential expression |
| 4 | Phred Quality Score | Genomics | Base call accuracy |
| 5 | BLAST E-value | Genomics | Alignment significance |
| 6 | Shannon Entropy | Genomics | Sequence conservation |
| 7 | Mass-to-Charge Ratio | Proteomics | Mass spectrometry |
| 8 | Benjamini-Hochberg FDR | All Omics | Multiple testing correction |
| 9 | PCA | All Omics | Dimensionality reduction |
| 10 | DESeq2 (Negative Binomial) | Transcriptomics | RNA-Seq DE analysis |
| 11 | Michaelis-Menten | Metabolomics | Pathway flux analysis |
| 12 | Network Topology | Interactomics | PPI network analysis |
What You Will Learn
Genomics
- - Genome structure & chromatin organization
- - NGS & third-generation sequencing
- - De novo assembly & gene prediction
- - GWAS & functional annotation
Transcriptomics
- - Microarray design & normalization
- - RNA-Seq: alignment, quantification, DE
- - scRNA-seq clustering & trajectory
- - Spatial transcriptomics methods
Proteomics
- - 2D-PAGE & chromatography separation
- - MALDI-TOF & ESI tandem MS
- - Label-free & isotope-labeling quantification
- - Protein-protein interaction networks
Metabolomics
- - Targeted vs untargeted approaches
- - NMR & LC-MS/GC-MS platforms
- - Metabolic flux analysis (MFA)
- - Clinical biomarker discovery