Module 8

Machine Learning in Genomics

Deep learning has reshaped multiple bioinformatics problems: variant-effect prediction, regulatory genomics, molecular property prediction, and protein design. DNA foundation models (DNABERT, Enformer, Nucleotide Transformer, HyenaDNA) and protein language models (ESM, ProGen) are the new substrate.

1. Variant-Effect Prediction

Classical tools (SIFT, PolyPhen-2, CADD) score variant deleteriousness from evolutionary conservation and structural features. Deep-learning variants (SpliceAI Jaganathan 2019, Enformer Avsec 2021, AlphaMissense Cheng 2023) use attention over DNA context to predict splicing, regulatory impact, and missense consequences. AlphaMissense achieves near-expert accuracy on 71 M possible human missense variants.

2. DNA Language Models

DNABERT (Ji 2021) applied BERT to k-mer-tokenised DNA. Nucleotide Transformer (Dalla-Torre 2023) scales to 2.5 B parameters across 175 species. Enformer extends receptive field to 100 kb using attention-augmented CNNs, enabling long-range enhancer–gene link prediction. HyenaDNA (Poli 2023) uses state-space models for million-base context — approaching whole-gene regulatory windows.

Simulation: Context-Length & Foundation Models

Python

script.py40 lines

import numpy as np, matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

# Attention context-length vs model performance (stylised)
context = np.array([100, 500, 1000, 5000, 10000, 50000, 100000, 200000])
# Enformer-style long-range regulatory prediction accuracy
acc = 0.45 + 0.35 * np.tanh((np.log10(context) - 2) / 1.5)

models = ['BERT (~512)', 'DNABERT (~512)', 'Enformer (100 kb)',
          'Nucleotide Transformer (12 kb)', 'HyenaDNA (1 M)']
params_M = [110, 100, 250, 500, 7]
perf = [0.56, 0.62, 0.80, 0.72, 0.65]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5), facecolor='#0a0a1a')
for ax in (ax1, ax2):
    ax.set_facecolor('#111827'); ax.tick_params(colors='#cbd5e1')
    for s in ax.spines.values(): s.set_color('#334155')
    ax.grid(True, color='#334155', alpha=0.3)

ax1.semilogx(context, acc, color='#2dd4bf', lw=2.6, marker='o')
ax1.axvline(100000, color='#fbbf24', ls='--', label='Enformer context')
ax1.set_xlabel('Receptive field (bp)', color='#cbd5e1')
ax1.set_ylabel('Enhancer-gene link accuracy', color='#cbd5e1')
ax1.set_title('Context length matters for regulatory DNA',
              color='#5eead4', fontweight='bold')
ax1.legend(facecolor='#1e293b', edgecolor='#334155', labelcolor='#cbd5e1')

ax2.bar(models, perf, color='#2dd4bf', edgecolor='#5eead4')
for i, v in enumerate(perf):
    ax2.text(i, v+0.01, f'{v:.2f}', ha='center', color='#5eead4')
ax2.set_ylabel('Benchmark Pearson r', color='#cbd5e1')
ax2.set_title('DNA foundation-model benchmarks',
              color='#5eead4', fontweight='bold')
plt.setp(ax2.get_xticklabels(), rotation=15, ha='right')

plt.tight_layout()
plt.savefig('output.png', dpi=120, bbox_inches='tight', facecolor='#0a0a1a')
print('Enformer (Avsec 2021) extended DNA context to 100 kb')
print('HyenaDNA (Poli 2023) extended to 1 M bp with state-space models')

Click Run to execute the Python code

Code will be executed with Python 3 on the server

3. Protein Language Models

ESM (Rives 2020, Lin 2023 ESM2) trains on 250M protein sequences; the 15B-parameter model encodes structural and functional information without explicit MSA. ESMFold (Lin 2023) predicts structures from single sequences without MSA, trading some accuracy for massive speed-up over AlphaFold2. Applications: zero-shot effect prediction, directed evolution, de novo protein design (ESM-IF inverse folding, ProGen generative).

4. Graph Neural Networks for Chemistry

GNNs treat molecules as graphs of atoms + bonds. Message-passing layers aggregate neighbour information, producing fixed-size molecular fingerprints. Applications: ADMET prediction (solubility, toxicity), binding-affinity prediction, reaction yield prediction, retrosynthesis planning. Chemprop (Yang 2019), SchNet, and MEGNet are reference libraries. Foundation models for chemistry (MolFormer, Uni-Mol) are starting to dominate.

Key References

• Avsec, Z. et al. (2021). “Effective gene expression prediction from sequence by integrating long-range interactions.” Nat. Methods, 18, 1196–1203.

• Cheng, J. et al. (2023). “Accurate proteome-wide missense variant effect prediction with AlphaMissense.” Science, 381, eadg7492.

• Lin, Z. et al. (2023). “Evolutionary-scale prediction of atomic-level protein structure.” Science, 379, 1123–1130.

• Poli, M. et al. (2023). “HyenaDNA: long-range genomic sequence modeling at single nucleotide resolution.” NeurIPS.

Share:X Reddit LinkedIn

← Module 7 Course Overview →