BIO-7810 · Graduate Research Course · 48 Lectures / 16 weeks

Introduction to Bioinformatics

Sequence · Structure · Networks · Machine Learning

A rigorous graduate course covering the mathematical and algorithmic foundations of modern bioinformatics. From dynamic programming on sequences to deep learning for protein structure prediction, students develop both theoretical fluency and practical computational skills across genomics, transcriptomics, proteomics, and systems biology.

Lectures

48

16 weeks

Prerequisites

Lin. Alg.

Prob., Python

Track

CompBio

MSc / PhD Y1

Hands-on

25+

labs & sims

About This Course

Bioinformatics is the discipline that turns biological data into biological understanding through algorithms, statistics, and software. A single modern experiment — a mammalian genome, a single-cell RNA-seq atlas, an AlphaFold prediction run — can generate terabytes of data whose interpretation depends entirely on computational methods. This course treats the foundational algorithms (Needleman-Wunsch, Smith-Waterman, BLAST, de Bruijn assembly, profile HMMs, Viterbi, neighbour-joining, phylogenetic ML) with the same rigour as the modern deep-learning methods (Evoformer in AlphaFold2, DNA language models, graph neural networks).

Cross-links: Biochemistry,Molecular Biology,Statistics,Machine Learning,Cytoskeleton,Biogeography.

Key Equations

Needleman-Wunsch Recurrence

\( F(i,j) = \max\{F(i-1,j-1)+s, F(i-1,j)-d, F(i,j-1)-d\} \)

Smith-Waterman (local)

\( H(i,j) = \max\{0, H(i-1,j-1)+s, H(i-1,j)-d, H(i,j-1)-d\} \)

BLAST E-value (Karlin-Altschul)

\( E = K\,mn\,e^{-\lambda S} \)

Log-odds Score (BLOSUM/PAM)

\( s(a,b) = \log\!\left(p_{ab}/q_a q_b\right) \)

Viterbi on Profile HMM

\( v_k^M(i) = \max_{k'}\,v_{k'}(i-1)\,a_{k'k}\,e_k^M(x_i) \)

Negative-Binomial (DESeq2)

\( K_{ij} \sim \text{NB}(\mu_{ij},\alpha_i),\ \mu_{ij}=s_j\,q_{ij} \)

Nine Modules

M0

Biological Foundations & Data Ecosystems

Central dogma, genome architecture, biological databases (NCBI, UniProt, PDB, Ensembl). Data formats: FASTA, FASTQ, SAM/BAM, VCF, GFF3.

NCBIUniProtBiopython

M1

Sequence Alignment & Dynamic Programming

Needleman-Wunsch, Smith-Waterman, BLAST heuristics. Affine gap penalties, Hirschberg linear-space, Karlin-Altschul E-values.

DPBLASTCore

M2

Substitution Matrices & MSA

PAM and BLOSUM log-odds derivation, profile HMMs, ClustalW, MUSCLE, MAFFT, T-Coffee. HMMER and sequence family modelling.

BLOSUMHMMERMAFFT

M3

Genome Assembly & Comparative Genomics

De Bruijn graphs, overlap-layout-consensus, long-read assembly (ONT, PacBio). Synteny, rearrangements, whole-genome alignment.

de BruijnSPAdesMUMmer

M4

Transcriptomics & Gene Expression

RNA-seq pipeline: STAR/HISAT2 alignment, Salmon/kallisto quantification, DESeq2/edgeR differential expression. Single-cell RNA-seq and pseudotime.

RNA-seqscRNA-seqTrending

M5

Protein Structure & AlphaFold2

Homology modelling, threading, ab initio. AlphaFold2 Evoformer + IPA structure module. Binding-site prediction, docking, confidence metrics (pLDDT, PAE).

AlphaFold2PyMOLDeep Learning

M6

Phylogenetics & Molecular Evolution

Distance, parsimony, maximum likelihood and Bayesian phylogenetic inference. Molecular clocks, coalescent theory, ancestral sequence reconstruction.

IQ-TREEBEAST2PAML

M7

Systems Biology & Networks

Protein-protein interaction networks, gene regulatory networks, metabolic flux analysis. Topology, community detection, Boolean & ODE models.

CytoscapeCOBRASystems

M8

Machine Learning in Genomics

Variant effect prediction, DNA language models (DNABERT, Enformer), graph neural networks for molecular property prediction, foundation models.

TransformersGNNAI/ML

Algorithmic Core — Complexity

AlgorithmProblemTimeTool
Needleman-WunschGlobal pairwise alignmentO(mn)Biopython, EMBOSS Needle
Smith-WatermanLocal pairwise alignmentO(mn)EMBOSS Water, SSW
BLASTHeuristic DB searchO(n) avgNCBI BLAST+
Viterbi HMMGene prediction / profileO(nK²)HMMER, Augustus
de Bruijn AssemblyGenome assemblyO(n)SPAdes, Velvet
Neighbor-JoiningPhylogenetic treeO(n³)PHYLIP, FastME
PageRank / centralityNetwork hubsO(V+E)NetworkX, igraph

Smith-Waterman in Python / NumPy

import numpy as np
from itertools import product

def smith_waterman(seq1, seq2, match=2, mismatch=-1, gap=-2):
    """Local alignment with simple gap penalty."""
    m, n = len(seq1), len(seq2)
    H = np.zeros((m+1, n+1), dtype=np.int32)

    for i, j in product(range(1, m+1), range(1, n+1)):
        score = match if seq1[i-1] == seq2[j-1] else mismatch
        H[i,j] = max(0,
            H[i-1, j-1] + score,    # diagonal
            H[i-1, j]   + gap,      # up: gap in seq2
            H[i,   j-1] + gap,      # left: gap in seq1
        )

    # Traceback from max cell
    i0, j0 = divmod(H.argmax(), n+1)
    a1, a2 = [], []
    while H[i0, j0] > 0:
        score_d = match if seq1[i0-1] == seq2[j0-1] else mismatch
        if H[i0,j0] == H[i0-1,j0-1] + score_d:
            a1.append(seq1[i0-1]); a2.append(seq2[j0-1]); i0 -= 1; j0 -= 1
        elif H[i0,j0] == H[i0-1,j0] + gap:
            a1.append(seq1[i0-1]); a2.append('-'); i0 -= 1
        else:
            a1.append('-'); a2.append(seq2[j0-1]); j0 -= 1
    return ''.join(reversed(a1)), ''.join(reversed(a2)), int(H.max())

a1, a2, s = smith_waterman("ACGTACGT", "TACGTTCG")
print(f"Score: {s}\n{a1}\n{a2}")

Core References

  • [1] Durbin, Eddy, Krogh & Mitchison (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press.
  • [2] Jumper, J. et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596, 583–589.
  • [3] Love, M. I., Huber, W. & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15, 550.
  • [4] Bankevich, A. et al. (2012). SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. J. Comp. Biol., 19, 455–477.
  • [5] Altschul, S. F. et al. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 25, 3389–3402.
  • [6] Yang, Z., Ji, Y. et al. (2021). DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language. Bioinformatics, 37, 2112–2120.
  • [7] Barabási, A.-L. & Albert, R. (1999). Emergence of scaling in random networks. Science, 286, 509–512.