BIO-7810 · Graduate Research Course · 48 Lectures / 16 weeks

Introduction to Bioinformatics

Sequence · Structure · Networks · Machine Learning

A rigorous graduate course covering the mathematical and algorithmic foundations of modern bioinformatics. From dynamic programming on sequences to deep learning for protein structure prediction, students develop both theoretical fluency and practical computational skills across genomics, transcriptomics, proteomics, and systems biology.

Lectures

16 weeks

Prerequisites

Lin. Alg.

Prob., Python

Track

CompBio

MSc / PhD Y1

Hands-on

25+

labs & sims

About This Course

Bioinformatics is the discipline that turns biological data into biological understanding through algorithms, statistics, and software. A single modern experiment — a mammalian genome, a single-cell RNA-seq atlas, an AlphaFold prediction run — can generate terabytes of data whose interpretation depends entirely on computational methods. This course treats the foundational algorithms (Needleman-Wunsch, Smith-Waterman, BLAST, de Bruijn assembly, profile HMMs, Viterbi, neighbour-joining, phylogenetic ML) with the same rigour as the modern deep-learning methods (Evoformer in AlphaFold2, DNA language models, graph neural networks).

Cross-links: Biochemistry,Molecular Biology,Statistics,Machine Learning,Cytoskeleton,Biogeography.

Key Equations

Needleman-Wunsch Recurrence

\( F(i,j) = \max\{F(i-1,j-1)+s, F(i-1,j)-d, F(i,j-1)-d\} \)

Smith-Waterman (local)

\( H(i,j) = \max\{0, H(i-1,j-1)+s, H(i-1,j)-d, H(i,j-1)-d\} \)

BLAST E-value (Karlin-Altschul)

\( E = K\,mn\,e^{-\lambda S} \)

Log-odds Score (BLOSUM/PAM)

\( s(a,b) = \log\!\left(p_{ab}/q_a q_b\right) \)

Viterbi on Profile HMM

\( v_k^M(i) = \max_{k'}\,v_{k'}(i-1)\,a_{k'k}\,e_k^M(x_i) \)

Negative-Binomial (DESeq2)

\( K_{ij} \sim \text{NB}(\mu_{ij},\alpha_i),\ \mu_{ij}=s_j\,q_{ij} \)

Nine Modules

Biological Foundations & Data Ecosystems

Central dogma, genome architecture, biological databases (NCBI, UniProt, PDB, Ensembl). Data formats: FASTA, FASTQ, SAM/BAM, VCF, GFF3.

NCBIUniProtBiopython

Sequence Alignment & Dynamic Programming

Needleman-Wunsch, Smith-Waterman, BLAST heuristics. Affine gap penalties, Hirschberg linear-space, Karlin-Altschul E-values.

DPBLASTCore

Substitution Matrices & MSA

PAM and BLOSUM log-odds derivation, profile HMMs, ClustalW, MUSCLE, MAFFT, T-Coffee. HMMER and sequence family modelling.

BLOSUMHMMERMAFFT

Genome Assembly & Comparative Genomics

De Bruijn graphs, overlap-layout-consensus, long-read assembly (ONT, PacBio). Synteny, rearrangements, whole-genome alignment.

de BruijnSPAdesMUMmer

Transcriptomics & Gene Expression

RNA-seq pipeline: STAR/HISAT2 alignment, Salmon/kallisto quantification, DESeq2/edgeR differential expression. Single-cell RNA-seq and pseudotime.

RNA-seqscRNA-seqTrending

Protein Structure & AlphaFold2

Homology modelling, threading, ab initio. AlphaFold2 Evoformer + IPA structure module. Binding-site prediction, docking, confidence metrics (pLDDT, PAE).

AlphaFold2PyMOLDeep Learning

Phylogenetics & Molecular Evolution

Distance, parsimony, maximum likelihood and Bayesian phylogenetic inference. Molecular clocks, coalescent theory, ancestral sequence reconstruction.

IQ-TREEBEAST2PAML

Systems Biology & Networks

Protein-protein interaction networks, gene regulatory networks, metabolic flux analysis. Topology, community detection, Boolean & ODE models.

CytoscapeCOBRASystems

Machine Learning in Genomics

Variant effect prediction, DNA language models (DNABERT, Enformer), graph neural networks for molecular property prediction, foundation models.

TransformersGNNAI/ML

Algorithmic Core — Complexity

Algorithm	Problem	Time	Tool
Needleman-Wunsch	Global pairwise alignment	O(mn)	Biopython, EMBOSS Needle
Smith-Waterman	Local pairwise alignment	O(mn)	EMBOSS Water, SSW
BLAST	Heuristic DB search	O(n) avg	NCBI BLAST+
Viterbi HMM	Gene prediction / profile	O(nK²)	HMMER, Augustus
de Bruijn Assembly	Genome assembly	O(n)	SPAdes, Velvet
Neighbor-Joining	Phylogenetic tree	O(n³)	PHYLIP, FastME
PageRank / centrality	Network hubs	O(V+E)	NetworkX, igraph

Smith-Waterman in Python / NumPy

import numpy as np
from itertools import product

def smith_waterman(seq1, seq2, match=2, mismatch=-1, gap=-2):
    """Local alignment with simple gap penalty."""
    m, n = len(seq1), len(seq2)
    H = np.zeros((m+1, n+1), dtype=np.int32)

    for i, j in product(range(1, m+1), range(1, n+1)):
        score = match if seq1[i-1] == seq2[j-1] else mismatch
        H[i,j] = max(0,
            H[i-1, j-1] + score,    # diagonal
            H[i-1, j]   + gap,      # up: gap in seq2
            H[i,   j-1] + gap,      # left: gap in seq1
        )

    # Traceback from max cell
    i0, j0 = divmod(H.argmax(), n+1)
    a1, a2 = [], []
    while H[i0, j0] > 0:
        score_d = match if seq1[i0-1] == seq2[j0-1] else mismatch
        if H[i0,j0] == H[i0-1,j0-1] + score_d:
            a1.append(seq1[i0-1]); a2.append(seq2[j0-1]); i0 -= 1; j0 -= 1
        elif H[i0,j0] == H[i0-1,j0] + gap:
            a1.append(seq1[i0-1]); a2.append('-'); i0 -= 1
        else:
            a1.append('-'); a2.append(seq2[j0-1]); j0 -= 1
    return ''.join(reversed(a1)), ''.join(reversed(a2)), int(H.max())

a1, a2, s = smith_waterman("ACGTACGT", "TACGTTCG")
print(f"Score: {s}\n{a1}\n{a2}")

Core References

[1] Durbin, Eddy, Krogh & Mitchison (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press.
[2] Jumper, J. et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596, 583–589.
[3] Love, M. I., Huber, W. & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15, 550.
[4] Bankevich, A. et al. (2012). SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. J. Comp. Biol., 19, 455–477.
[5] Altschul, S. F. et al. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 25, 3389–3402.
[6] Yang, Z., Ji, Y. et al. (2021). DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language. Bioinformatics, 37, 2112–2120.
[7] Barabási, A.-L. & Albert, R. (1999). Emergence of scaling in random networks. Science, 286, 509–512.

Begin Module 0: Foundations & Databases →

Share:X Reddit LinkedIn