BIO-7810 · Graduate Research Course · 48 Lectures / 16 weeks
Introduction to Bioinformatics
Sequence · Structure · Networks · Machine Learning
A rigorous graduate course covering the mathematical and algorithmic foundations of modern bioinformatics. From dynamic programming on sequences to deep learning for protein structure prediction, students develop both theoretical fluency and practical computational skills across genomics, transcriptomics, proteomics, and systems biology.
Lectures
48
16 weeks
Prerequisites
Lin. Alg.
Prob., Python
Track
CompBio
MSc / PhD Y1
Hands-on
25+
labs & sims
About This Course
Bioinformatics is the discipline that turns biological data into biological understanding through algorithms, statistics, and software. A single modern experiment — a mammalian genome, a single-cell RNA-seq atlas, an AlphaFold prediction run — can generate terabytes of data whose interpretation depends entirely on computational methods. This course treats the foundational algorithms (Needleman-Wunsch, Smith-Waterman, BLAST, de Bruijn assembly, profile HMMs, Viterbi, neighbour-joining, phylogenetic ML) with the same rigour as the modern deep-learning methods (Evoformer in AlphaFold2, DNA language models, graph neural networks).
Cross-links: Biochemistry,Molecular Biology,Statistics,Machine Learning,Cytoskeleton,Biogeography.
Key Equations
Needleman-Wunsch Recurrence
\( F(i,j) = \max\{F(i-1,j-1)+s, F(i-1,j)-d, F(i,j-1)-d\} \)
Smith-Waterman (local)
\( H(i,j) = \max\{0, H(i-1,j-1)+s, H(i-1,j)-d, H(i,j-1)-d\} \)
BLAST E-value (Karlin-Altschul)
\( E = K\,mn\,e^{-\lambda S} \)
Log-odds Score (BLOSUM/PAM)
\( s(a,b) = \log\!\left(p_{ab}/q_a q_b\right) \)
Viterbi on Profile HMM
\( v_k^M(i) = \max_{k'}\,v_{k'}(i-1)\,a_{k'k}\,e_k^M(x_i) \)
Negative-Binomial (DESeq2)
\( K_{ij} \sim \text{NB}(\mu_{ij},\alpha_i),\ \mu_{ij}=s_j\,q_{ij} \)
Nine Modules
M0
Biological Foundations & Data Ecosystems
Central dogma, genome architecture, biological databases (NCBI, UniProt, PDB, Ensembl). Data formats: FASTA, FASTQ, SAM/BAM, VCF, GFF3.
M1
Sequence Alignment & Dynamic Programming
Needleman-Wunsch, Smith-Waterman, BLAST heuristics. Affine gap penalties, Hirschberg linear-space, Karlin-Altschul E-values.
M2
Substitution Matrices & MSA
PAM and BLOSUM log-odds derivation, profile HMMs, ClustalW, MUSCLE, MAFFT, T-Coffee. HMMER and sequence family modelling.
M3
Genome Assembly & Comparative Genomics
De Bruijn graphs, overlap-layout-consensus, long-read assembly (ONT, PacBio). Synteny, rearrangements, whole-genome alignment.
M4
Transcriptomics & Gene Expression
RNA-seq pipeline: STAR/HISAT2 alignment, Salmon/kallisto quantification, DESeq2/edgeR differential expression. Single-cell RNA-seq and pseudotime.
M5
Protein Structure & AlphaFold2
Homology modelling, threading, ab initio. AlphaFold2 Evoformer + IPA structure module. Binding-site prediction, docking, confidence metrics (pLDDT, PAE).
M6
Phylogenetics & Molecular Evolution
Distance, parsimony, maximum likelihood and Bayesian phylogenetic inference. Molecular clocks, coalescent theory, ancestral sequence reconstruction.
M7
Systems Biology & Networks
Protein-protein interaction networks, gene regulatory networks, metabolic flux analysis. Topology, community detection, Boolean & ODE models.
M8
Machine Learning in Genomics
Variant effect prediction, DNA language models (DNABERT, Enformer), graph neural networks for molecular property prediction, foundation models.
Algorithmic Core — Complexity
| Algorithm | Problem | Time | Tool |
|---|---|---|---|
| Needleman-Wunsch | Global pairwise alignment | O(mn) | Biopython, EMBOSS Needle |
| Smith-Waterman | Local pairwise alignment | O(mn) | EMBOSS Water, SSW |
| BLAST | Heuristic DB search | O(n) avg | NCBI BLAST+ |
| Viterbi HMM | Gene prediction / profile | O(nK²) | HMMER, Augustus |
| de Bruijn Assembly | Genome assembly | O(n) | SPAdes, Velvet |
| Neighbor-Joining | Phylogenetic tree | O(n³) | PHYLIP, FastME |
| PageRank / centrality | Network hubs | O(V+E) | NetworkX, igraph |
Smith-Waterman in Python / NumPy
import numpy as np
from itertools import product
def smith_waterman(seq1, seq2, match=2, mismatch=-1, gap=-2):
"""Local alignment with simple gap penalty."""
m, n = len(seq1), len(seq2)
H = np.zeros((m+1, n+1), dtype=np.int32)
for i, j in product(range(1, m+1), range(1, n+1)):
score = match if seq1[i-1] == seq2[j-1] else mismatch
H[i,j] = max(0,
H[i-1, j-1] + score, # diagonal
H[i-1, j] + gap, # up: gap in seq2
H[i, j-1] + gap, # left: gap in seq1
)
# Traceback from max cell
i0, j0 = divmod(H.argmax(), n+1)
a1, a2 = [], []
while H[i0, j0] > 0:
score_d = match if seq1[i0-1] == seq2[j0-1] else mismatch
if H[i0,j0] == H[i0-1,j0-1] + score_d:
a1.append(seq1[i0-1]); a2.append(seq2[j0-1]); i0 -= 1; j0 -= 1
elif H[i0,j0] == H[i0-1,j0] + gap:
a1.append(seq1[i0-1]); a2.append('-'); i0 -= 1
else:
a1.append('-'); a2.append(seq2[j0-1]); j0 -= 1
return ''.join(reversed(a1)), ''.join(reversed(a2)), int(H.max())
a1, a2, s = smith_waterman("ACGTACGT", "TACGTTCG")
print(f"Score: {s}\n{a1}\n{a2}")Core References
- [1] Durbin, Eddy, Krogh & Mitchison (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press.
- [2] Jumper, J. et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596, 583–589.
- [3] Love, M. I., Huber, W. & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15, 550.
- [4] Bankevich, A. et al. (2012). SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. J. Comp. Biol., 19, 455–477.
- [5] Altschul, S. F. et al. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 25, 3389–3402.
- [6] Yang, Z., Ji, Y. et al. (2021). DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language. Bioinformatics, 37, 2112–2120.
- [7] Barabási, A.-L. & Albert, R. (1999). Emergence of scaling in random networks. Science, 286, 509–512.