Module 8
Machine Learning in Genomics
Deep learning has reshaped multiple bioinformatics problems: variant-effect prediction, regulatory genomics, molecular property prediction, and protein design. DNA foundation models (DNABERT, Enformer, Nucleotide Transformer, HyenaDNA) and protein language models (ESM, ProGen) are the new substrate.
1. Variant-Effect Prediction
Classical tools (SIFT, PolyPhen-2, CADD) score variant deleteriousness from evolutionary conservation and structural features. Deep-learning variants (SpliceAI Jaganathan 2019, Enformer Avsec 2021, AlphaMissense Cheng 2023) use attention over DNA context to predict splicing, regulatory impact, and missense consequences. AlphaMissense achieves near-expert accuracy on 71 M possible human missense variants.
2. DNA Language Models
DNABERT (Ji 2021) applied BERT to k-mer-tokenised DNA. Nucleotide Transformer (Dalla-Torre 2023) scales to 2.5 B parameters across 175 species. Enformer extends receptive field to 100 kb using attention-augmented CNNs, enabling long-range enhancer–gene link prediction. HyenaDNA (Poli 2023) uses state-space models for million-base context — approaching whole-gene regulatory windows.
Simulation: Context-Length & Foundation Models
Click Run to execute the Python code
Code will be executed with Python 3 on the server
3. Protein Language Models
ESM (Rives 2020, Lin 2023 ESM2) trains on 250M protein sequences; the 15B-parameter model encodes structural and functional information without explicit MSA. ESMFold (Lin 2023) predicts structures from single sequences without MSA, trading some accuracy for massive speed-up over AlphaFold2. Applications: zero-shot effect prediction, directed evolution, de novo protein design (ESM-IF inverse folding, ProGen generative).
4. Graph Neural Networks for Chemistry
GNNs treat molecules as graphs of atoms + bonds. Message-passing layers aggregate neighbour information, producing fixed-size molecular fingerprints. Applications: ADMET prediction (solubility, toxicity), binding-affinity prediction, reaction yield prediction, retrosynthesis planning. Chemprop (Yang 2019), SchNet, and MEGNet are reference libraries. Foundation models for chemistry (MolFormer, Uni-Mol) are starting to dominate.
Key References
• Avsec, Z. et al. (2021). “Effective gene expression prediction from sequence by integrating long-range interactions.” Nat. Methods, 18, 1196–1203.
• Cheng, J. et al. (2023). “Accurate proteome-wide missense variant effect prediction with AlphaMissense.” Science, 381, eadg7492.
• Lin, Z. et al. (2023). “Evolutionary-scale prediction of atomic-level protein structure.” Science, 379, 1123–1130.
• Poli, M. et al. (2023). “HyenaDNA: long-range genomic sequence modeling at single nucleotide resolution.” NeurIPS.