Computational Biology & Molecular Simulation

Bridging bioinformatics, molecular dynamics, and structural biology through computational methods. From sequence analysis to atomic-level simulations of biomolecular systems.

🧬 Sequence Analysis📐 Structure Prediction⚛️ Molecular Dynamics🔬 Bioinformatics

Computational Biology Framework

Computational biology integrates data from multiple sources—genomic databases, structural repositories, spectroscopic measurements—to build dynamic 3D atomic models of biological systems. This multidisciplinary field combines bioinformatics for data analysis with molecular simulations for understanding dynamic processes.

Key Data Sources & Methods

Sequence Databases

  • • GenomeDB/UniProt
  • • Homology information
  • • Evolutionary analysis

Structural Data

  • • PDB (X-ray, NMR, Cryo-EM)
  • • Atomic coordinates
  • • Co-evolution analysis

Biophysical Data

  • • FRET, Biophysics
  • • SAXS, SANS
  • • 3DEM volume

Spectroscopy

  • • NMR (chemical shifts)
  • • BMRB database
  • • Distance constraints

Cross-Linking

  • • Cross-link & MS
  • • Distance information
  • • Protein interactions

Diffraction

  • • Crystallography
  • • SASBDB (SAXS/SANS)
  • • Shape information

Molecular Dynamics Simulations

Molecular dynamics (MD) simulations integrate Newton's equations of motion for every atom in a biomolecular system, revealing dynamic behavior, conformational changes, and interaction mechanisms at atomic resolution.

Molecular Dynamics: Protein in Water Box

Simple MD simulation framework for a protein system using Verlet integration

python
import numpy as np
import matplotlib.pyplot as plt

# Physical constants
kB = 1.381e-23  # Boltzmann constant (J/K)
NA = 6.022e23   # Avogadro's number

class MDSimulation:
    """Simple molecular dynamics simulation framework"""

    def __init__(self, n_atoms, box_size, dt=1e-15):
        """
        Initialize MD simulation

        Parameters:
        n_atoms: number of atoms
        box_size: simulation box size (nm)
        dt: timestep (seconds, default 1 fs)
        """
        self.n_atoms = n_atoms
        self.box_size = box_size * 1e-9  # Convert to meters
        self.dt = dt

        # Initialize positions randomly in box
        self.positions = np.random.rand(n_atoms, 3) * self.box_size

        # Initialize velocities (Maxwell-Boltzmann at 300K)
        self.velocities = self.maxwell_boltzmann_velocities(300)

        # Initialize forces
        self.forces = np.zeros((n_atoms, 3))

        # Energy tracking
        self.kinetic_energy = []
        self.potential_energy = []
        self.total_energy = []

    def maxwell_boltzmann_velocities(self, T, mass=1.67e-27):
        """Initialize velocities from Maxwell-Boltzmann distribution"""
        sigma = np.sqrt(kB * T / mass)
        velocities = np.random.normal(0, sigma, (self.n_atoms, 3))
        # Remove center of mass motion
        velocities -= np.mean(velocities, axis=0)
        return velocities

    def lennard_jones_force(self, epsilon=0.996e-21, sigma=3.4e-10):
        """
        Calculate Lennard-Jones forces between all atom pairs
        V(r) = 4ε[(σ/r)^12 - (σ/r)^6]
        """
        forces = np.zeros_like(self.positions)
        potential = 0.0

        for i in range(self.n_atoms):
            for j in range(i+1, self.n_atoms):
                # Distance vector with periodic boundary conditions
                rij = self.positions[i] - self.positions[j]
                rij -= self.box_size * np.round(rij / self.box_size)

                r = np.linalg.norm(rij)

                if r < 2.5 * sigma:  # Cutoff distance
                    # LJ potential and force
                    sr6 = (sigma / r)**6
                    sr12 = sr6 * sr6

                    potential += 4 * epsilon * (sr12 - sr6)

                    force_mag = 24 * epsilon * (2*sr12 - sr6) / r**2
                    force_vec = force_mag * rij

                    forces[i] += force_vec
                    forces[j] -= force_vec

        return forces, potential

    def velocity_verlet_step(self):
        """Velocity Verlet integration step"""
        # Update positions: r(t+dt) = r(t) + v(t)*dt + 0.5*a(t)*dt^2
        self.positions += self.velocities * self.dt + 0.5 * self.forces * self.dt**2 / 1.67e-27

        # Apply periodic boundary conditions
        self.positions = np.mod(self.positions, self.box_size)

        # Calculate new forces
        old_forces = self.forces.copy()
        self.forces, potential = self.lennard_jones_force()

        # Update velocities: v(t+dt) = v(t) + 0.5*[a(t) + a(t+dt)]*dt
        self.velocities += 0.5 * (old_forces + self.forces) * self.dt / 1.67e-27

        # Calculate energies
        kinetic = 0.5 * 1.67e-27 * np.sum(self.velocities**2)

        return kinetic, potential

    def run(self, n_steps, sample_interval=10):
        """Run MD simulation"""
        print(f"Starting MD simulation with {self.n_atoms} atoms")
        print(f"Box size: {self.box_size*1e9:.2f} nm")
        print(f"Timestep: {self.dt*1e15:.2f} fs")
        print(f"Total time: {n_steps*self.dt*1e12:.2f} ps")
        print("=" * 60)

        for step in range(n_steps):
            kinetic, potential = self.velocity_verlet_step()

            if step % sample_interval == 0:
                self.kinetic_energy.append(kinetic / 1.602e-19)  # Convert to eV
                self.potential_energy.append(potential / 1.602e-19)
                self.total_energy.append((kinetic + potential) / 1.602e-19)

                if step % (n_steps // 10) == 0:
                    T_inst = 2 * kinetic / (3 * self.n_atoms * kB)
                    print(f"Step {step:6d}: T = {T_inst:6.1f} K, "
                          f"E_total = {(kinetic+potential)/1.602e-19:8.3f} eV")

# Run simulation
md = MDSimulation(n_atoms=100, box_size=5.0, dt=1e-15)  # 5 nm box, 1 fs timestep
md.run(n_steps=5000, sample_interval=10)

# Plot results
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 8))

# Energy evolution
time = np.arange(len(md.total_energy)) * 10 * 1e-15 * 1e12  # ps
ax1.plot(time, md.kinetic_energy, 'r-', label='Kinetic', alpha=0.7)
ax1.plot(time, md.potential_energy, 'b-', label='Potential', alpha=0.7)
ax1.plot(time, md.total_energy, 'k-', label='Total', linewidth=2)
ax1.set_xlabel('Time (ps)')
ax1.set_ylabel('Energy (eV)')
ax1.set_title('Energy Conservation in MD Simulation')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Temperature evolution
temperatures = 2 * np.array(md.kinetic_energy) * 1.602e-19 / (3 * md.n_atoms * kB)
ax2.plot(time, temperatures, 'g-', linewidth=1.5)
ax2.axhline(y=300, color='r', linestyle='--', label='Target (300 K)')
ax2.set_xlabel('Time (ps)')
ax2.set_ylabel('Temperature (K)')
ax2.set_title('Temperature Fluctuations')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('md_simulation.png', dpi=150, bbox_inches='tight')
print("\nSimulation complete! Plot saved as 'md_simulation.png'")

💡 This example demonstrates the computational approach to solving physics problems

🚀 High-Performance Fortran Implementation

Fortran excels at MD simulations due to array operations, numerical precision, and compiler optimizations. Production codes like GROMACS and AMBER use Fortran/C++ for performance-critical components.

Molecular Dynamics in Fortran 90: High-Performance Implementation

Optimized MD simulation with Lennard-Jones potential and Velocity Verlet integration

fortran
program md_simulation
  implicit none

  ! Physical constants
  real(8), parameter :: kB = 1.381d-23      ! Boltzmann constant (J/K)
  real(8), parameter :: mass = 1.67d-27     ! Particle mass (kg)
  real(8), parameter :: epsilon = 0.996d-21 ! LJ epsilon (J)
  real(8), parameter :: sigma = 3.4d-10     ! LJ sigma (m)

  ! Simulation parameters
  integer, parameter :: n_atoms = 100
  integer, parameter :: n_steps = 5000
  integer, parameter :: sample_interval = 10
  real(8), parameter :: dt = 1.0d-15        ! Timestep (1 fs)
  real(8), parameter :: box_size = 5.0d-9   ! Box size (5 nm)
  real(8), parameter :: cutoff = 2.5d0 * sigma
  real(8), parameter :: target_temp = 300.0d0

  ! Arrays
  real(8), dimension(3, n_atoms) :: positions, velocities, forces, old_forces
  real(8), dimension(n_steps/sample_interval) :: kinetic_E, potential_E, total_E
  real(8) :: kinetic, potential, temperature
  integer :: i, step, sample_idx

  ! Initialize simulation
  call initialize_system()

  ! Main MD loop
  print *, 'Starting Fortran MD simulation'
  print *, 'Atoms:', n_atoms, '  Box size:', box_size*1d9, 'nm'
  print *, 'Timestep:', dt*1d15, 'fs  Total time:', n_steps*dt*1d12, 'ps'
  print *, '============================================================'

  sample_idx = 1
  do step = 0, n_steps-1
    ! Velocity Verlet integration
    call velocity_verlet_step(kinetic, potential)

    ! Sample and output
    if (mod(step, sample_interval) == 0) then
      kinetic_E(sample_idx) = kinetic
      potential_E(sample_idx) = potential
      total_E(sample_idx) = kinetic + potential
      sample_idx = sample_idx + 1

      if (mod(step, n_steps/10) == 0) then
        temperature = 2.0d0 * kinetic / (3.0d0 * n_atoms * kB)
        print '(A,I6,A,F7.1,A,F10.3,A)', &
          'Step', step, ': T =', temperature, ' K, E_total =', &
          (kinetic+potential)/1.602d-19, ' eV'
      end if
    end if
  end do

  print *, ''
  print *, 'Simulation complete!'
  call print_statistics()

contains

  subroutine initialize_system()
    ! Initialize positions randomly in box
    call random_number(positions)
    positions = positions * box_size

    ! Initialize velocities (Maxwell-Boltzmann distribution)
    call initialize_velocities(target_temp)

    ! Calculate initial forces
    call calculate_forces(potential)
  end subroutine initialize_system

  subroutine initialize_velocities(T)
    real(8), intent(in) :: T
    real(8) :: sigma_v, rand_normal
    integer :: i, j

    sigma_v = sqrt(kB * T / mass)

    ! Generate Gaussian random velocities
    do i = 1, n_atoms
      do j = 1, 3
        call random_normal(rand_normal)
        velocities(j, i) = sigma_v * rand_normal
      end do
    end do

    ! Remove center of mass motion
    do j = 1, 3
      velocities(j, :) = velocities(j, :) - sum(velocities(j, :)) / n_atoms
    end do
  end subroutine initialize_velocities

  subroutine random_normal(x)
    ! Box-Muller transform for Gaussian random numbers
    real(8), intent(out) :: x
    real(8) :: u1, u2
    call random_number(u1)
    call random_number(u2)
    x = sqrt(-2.0d0 * log(u1)) * cos(2.0d0 * 3.14159265359d0 * u2)
  end subroutine random_normal

  subroutine calculate_forces(pot_energy)
    real(8), intent(out) :: pot_energy
    real(8), dimension(3) :: rij, force_vec
    real(8) :: r, r2, sr6, sr12, force_mag
    integer :: i, j, k

    forces = 0.0d0
    pot_energy = 0.0d0

    ! Loop over all atom pairs
    do i = 1, n_atoms-1
      do j = i+1, n_atoms
        ! Distance vector with periodic boundary conditions
        rij = positions(:, i) - positions(:, j)

        ! Apply minimum image convention
        do k = 1, 3
          if (rij(k) > box_size/2.0d0) rij(k) = rij(k) - box_size
          if (rij(k) < -box_size/2.0d0) rij(k) = rij(k) + box_size
        end do

        r2 = dot_product(rij, rij)
        r = sqrt(r2)

        ! Only calculate if within cutoff
        if (r < cutoff) then
          sr6 = (sigma / r)**6
          sr12 = sr6 * sr6

          ! Lennard-Jones potential
          pot_energy = pot_energy + 4.0d0 * epsilon * (sr12 - sr6)

          ! Lennard-Jones force
          force_mag = 24.0d0 * epsilon * (2.0d0*sr12 - sr6) / r2
          force_vec = force_mag * rij

          forces(:, i) = forces(:, i) + force_vec
          forces(:, j) = forces(:, j) - force_vec
        end if
      end do
    end do
  end subroutine calculate_forces

  subroutine velocity_verlet_step(kin_energy, pot_energy)
    real(8), intent(out) :: kin_energy, pot_energy
    integer :: i

    ! Update positions: r(t+dt) = r(t) + v(t)*dt + 0.5*a(t)*dt^2
    positions = positions + velocities * dt + 0.5d0 * forces * dt**2 / mass

    ! Apply periodic boundary conditions
    do i = 1, 3
      where (positions(i, :) < 0.0d0) positions(i, :) = positions(i, :) + box_size
      where (positions(i, :) >= box_size) positions(i, :) = positions(i, :) - box_size
    end do

    ! Save old forces
    old_forces = forces

    ! Calculate new forces
    call calculate_forces(pot_energy)

    ! Update velocities: v(t+dt) = v(t) + 0.5*[a(t) + a(t+dt)]*dt
    velocities = velocities + 0.5d0 * (old_forces + forces) * dt / mass

    ! Calculate kinetic energy
    kin_energy = 0.5d0 * mass * sum(velocities**2)
  end subroutine velocity_verlet_step

  subroutine print_statistics()
    real(8) :: avg_temp, avg_total_E, std_total_E
    integer :: n_samples

    n_samples = size(total_E)

    avg_total_E = sum(total_E) / n_samples
    std_total_E = sqrt(sum((total_E - avg_total_E)**2) / n_samples)

    avg_temp = 2.0d0 * sum(kinetic_E) / (3.0d0 * n_atoms * kB * n_samples)

    print *, ''
    print *, 'Performance Statistics:'
    print *, '---------------------'
    print '(A,F7.1,A)', '  Average Temperature: ', avg_temp, ' K'
    print '(A,F10.3,A)', '  Average Total Energy: ', avg_total_E/1.602d-19, ' eV'
    print '(A,F10.5,A)', '  Energy Conservation: ', std_total_E/abs(avg_total_E)*100, ' %'
    print *, ''
    print *, 'Fortran Performance Advantages:'
    print *, '  • Native array operations (SIMD vectorization)'
    print *, '  • Efficient memory layout (column-major)'
    print *, '  • Aggressive compiler optimizations (-O3, -march=native)'
    print *, '  • 5-10× faster than Python for large systems'
  end subroutine print_statistics

end program md_simulation

💡 This example demonstrates the computational approach to solving physics problems

Sequence Analysis & Bioinformatics

Sequence analysis tools identify evolutionary relationships, predict structure, and locate functional motifs. Algorithms like Smith-Waterman and BLAST enable homology detection across vast genomic databases.

Sequence Alignment: Needleman-Wunsch Algorithm

Global sequence alignment using dynamic programming to find optimal alignment between two sequences

python
import numpy as np

class SequenceAlignment:
    """Needleman-Wunsch global alignment algorithm"""

    def __init__(self, match_score=1, mismatch_penalty=-1, gap_penalty=-2):
        self.match_score = match_score
        self.mismatch_penalty = mismatch_penalty
        self.gap_penalty = gap_penalty

    def align(self, seq1, seq2):
        """
        Perform global alignment of two sequences

        Returns: (aligned_seq1, aligned_seq2, score)
        """
        n, m = len(seq1), len(seq2)

        # Initialize scoring matrix
        score_matrix = np.zeros((n+1, m+1))

        # Initialize first row and column (gap penalties)
        for i in range(n+1):
            score_matrix[i, 0] = i * self.gap_penalty
        for j in range(m+1):
            score_matrix[0, j] = j * self.gap_penalty

        # Fill scoring matrix using dynamic programming
        for i in range(1, n+1):
            for j in range(1, m+1):
                # Calculate scores for three possible moves
                if seq1[i-1] == seq2[j-1]:
                    match = score_matrix[i-1, j-1] + self.match_score
                else:
                    match = score_matrix[i-1, j-1] + self.mismatch_penalty

                delete = score_matrix[i-1, j] + self.gap_penalty
                insert = score_matrix[i, j-1] + self.gap_penalty

                # Take maximum score
                score_matrix[i, j] = max(match, delete, insert)

        # Traceback to find alignment
        align1, align2 = "", ""
        i, j = n, m

        while i > 0 or j > 0:
            current_score = score_matrix[i, j]

            if i > 0 and j > 0:
                if seq1[i-1] == seq2[j-1]:
                    diagonal = score_matrix[i-1, j-1] + self.match_score
                else:
                    diagonal = score_matrix[i-1, j-1] + self.mismatch_penalty
            else:
                diagonal = float('-inf')

            up = score_matrix[i-1, j] + self.gap_penalty if i > 0 else float('-inf')
            left = score_matrix[i, j-1] + self.gap_penalty if j > 0 else float('-inf')

            if current_score == diagonal:
                align1 = seq1[i-1] + align1
                align2 = seq2[j-1] + align2
                i -= 1
                j -= 1
            elif current_score == up:
                align1 = seq1[i-1] + align1
                align2 = "-" + align2
                i -= 1
            else:
                align1 = "-" + align1
                align2 = seq2[j-1] + align2
                j -= 1

        return align1, align2, score_matrix[n, m]

    def print_alignment(self, seq1, seq2, score):
        """Pretty print alignment with matches indicated"""
        match_string = ""
        matches = 0

        for i in range(len(seq1)):
            if seq1[i] == seq2[i] and seq1[i] != '-':
                match_string += "|"
                matches += 1
            elif seq1[i] != '-' and seq2[i] != '-':
                match_string += "."
            else:
                match_string += " "

        identity = (matches / max(len(seq1.replace('-', '')),
                                  len(seq2.replace('-', ''))) * 100)

        print(f"Alignment Score: {score}")
        print(f"Sequence Identity: {identity:.1f}%")
        print(f"Matches: {matches}/{len(seq1)}")
        print()

        # Print alignment in blocks of 60
        for i in range(0, len(seq1), 60):
            print(f"Seq1: {seq1[i:i+60]}")
            print(f"      {match_string[i:i+60]}")
            print(f"Seq2: {seq2[i:i+60]}")
            print()

# Example: Align two protein sequences
aligner = SequenceAlignment(match_score=2, mismatch_penalty=-1, gap_penalty=-2)

# Two similar protein sequences (fragments)
seq1 = "HEAGAWGHEE"
seq2 = "PAWHEAE"

print("Global Sequence Alignment")
print("=" * 70)
print(f"Sequence 1: {seq1} (length: {len(seq1)})")
print(f"Sequence 2: {seq2} (length: {len(seq2)})")
print()

aligned1, aligned2, score = aligner.align(seq1, seq2)
aligner.print_alignment(aligned1, aligned2, score)

# More realistic example: Beta-globin fragments from human and mouse
print("\n" + "=" * 70)
print("Comparing Beta-Globin Fragments (Human vs Mouse)")
print("=" * 70)

human_beta = "VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG"
mouse_beta = "VHLTDAEKAAVSCLWGKVNSDDEVGGEALGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT"

aligned_h, aligned_m, score = aligner.align(human_beta, mouse_beta)
aligner.print_alignment(aligned_h, aligned_m, score)

print("Biological Interpretation:")
print("-" * 70)
print("High sequence identity indicates evolutionary conservation")
print("Beta-globin is highly conserved across mammals")
print("Conserved residues are critical for oxygen-binding function")
print("Gaps represent insertions/deletions (indels) during evolution")

💡 This example demonstrates the computational approach to solving physics problems

🚀 Fortran Implementation for Large-Scale Analysis

Fortran's string handling and 2D array efficiency make it ideal for sequence alignment at genomic scale. Used in production bioinformatics pipelines for processing millions of sequence pairs.

Needleman-Wunsch Algorithm in Fortran 90

High-performance global sequence alignment with dynamic programming

fortran
program needleman_wunsch
  implicit none

  ! Scoring parameters
  integer, parameter :: match_score = 2
  integer, parameter :: mismatch_penalty = -1
  integer, parameter :: gap_penalty = -2

  ! Maximum sequence length
  integer, parameter :: max_len = 256

  ! Sequences
  character(len=max_len) :: seq1, seq2
  character(len=max_len) :: aligned1, aligned2, match_string
  integer :: len1, len2, alignment_len

  ! Scoring matrix
  integer, dimension(0:max_len, 0:max_len) :: score_matrix

  ! Results
  integer :: final_score, matches, i
  real(8) :: identity

  ! Example 1: Short sequences
  print *, '========================================='
  print *, 'Global Sequence Alignment (Fortran 90)'
  print *, '========================================='
  print *, ''

  seq1 = 'HEAGAWGHEE'
  seq2 = 'PAWHEAE'
  len1 = 10
  len2 = 7

  print *, 'Sequence 1: ', trim(seq1(1:len1))
  print *, 'Sequence 2: ', trim(seq2(1:len2))
  print *, ''

  call align_sequences(seq1(1:len1), seq2(1:len2), len1, len2, &
                       aligned1, aligned2, alignment_len, final_score)

  call print_alignment(aligned1, aligned2, alignment_len, final_score)

  ! Example 2: Beta-globin fragments
  print *, ''
  print *, '========================================='
  print *, 'Beta-Globin Fragments (Human vs Mouse)'
  print *, '========================================='
  print *, ''

  seq1 = 'VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG'
  seq2 = 'VHLTDAEKAAVSCLWGKVNSDDEVGGEALGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT'
  len1 = 70
  len2 = 71

  call align_sequences(seq1(1:len1), seq2(1:len2), len1, len2, &
                       aligned1, aligned2, alignment_len, final_score)

  call print_alignment(aligned1, aligned2, alignment_len, final_score)

  print *, ''
  print *, 'Fortran Performance Benefits:'
  print *, '  • Efficient 2D array indexing (cache-friendly)'
  print *, '  • Fast string operations with fixed-width arrays'
  print *, '  • O(n×m) time, O(n×m) space complexity'
  print *, '  • Suitable for genome-scale alignments'

contains

  subroutine align_sequences(s1, s2, n, m, a1, a2, alen, score)
    character(len=*), intent(in) :: s1, s2
    integer, intent(in) :: n, m
    character(len=max_len), intent(out) :: a1, a2
    integer, intent(out) :: alen, score
    integer :: i, j, diag, up, left, current_score

    ! Initialize scoring matrix
    do i = 0, n
      score_matrix(i, 0) = i * gap_penalty
    end do
    do j = 0, m
      score_matrix(0, j) = j * gap_penalty
    end do

    ! Fill scoring matrix using dynamic programming
    do i = 1, n
      do j = 1, m
        ! Calculate scores for three possible moves
        if (s1(i:i) == s2(j:j)) then
          diag = score_matrix(i-1, j-1) + match_score
        else
          diag = score_matrix(i-1, j-1) + mismatch_penalty
        end if

        up = score_matrix(i-1, j) + gap_penalty
        left = score_matrix(i, j-1) + gap_penalty

        ! Take maximum score
        score_matrix(i, j) = max(diag, max(up, left))
      end do
    end do

    score = score_matrix(n, m)

    ! Traceback to construct alignment
    call traceback(s1, s2, n, m, a1, a2, alen)
  end subroutine align_sequences

  subroutine traceback(s1, s2, n, m, a1, a2, alen)
    character(len=*), intent(in) :: s1, s2
    integer, intent(in) :: n, m
    character(len=max_len), intent(out) :: a1, a2
    integer, intent(out) :: alen
    character(len=max_len) :: temp1, temp2
    integer :: i, j, pos, diag, up, left, current_score

    i = n
    j = m
    pos = 0
    temp1 = ''
    temp2 = ''

    do while (i > 0 .or. j > 0)
      current_score = score_matrix(i, j)

      ! Calculate possible previous scores
      if (i > 0 .and. j > 0) then
        if (s1(i:i) == s2(j:j)) then
          diag = score_matrix(i-1, j-1) + match_score
        else
          diag = score_matrix(i-1, j-1) + mismatch_penalty
        end if
      else
        diag = -999999
      end if

      if (i > 0) then
        up = score_matrix(i-1, j) + gap_penalty
      else
        up = -999999
      end if

      if (j > 0) then
        left = score_matrix(i, j-1) + gap_penalty
      else
        left = -999999
      end if

      ! Traceback
      pos = pos + 1
      if (current_score == diag) then
        temp1(pos:pos) = s1(i:i)
        temp2(pos:pos) = s2(j:j)
        i = i - 1
        j = j - 1
      else if (current_score == up) then
        temp1(pos:pos) = s1(i:i)
        temp2(pos:pos) = '-'
        i = i - 1
      else
        temp1(pos:pos) = '-'
        temp2(pos:pos) = s2(j:j)
        j = j - 1
      end if
    end do

    alen = pos

    ! Reverse strings
    do i = 1, alen
      a1(i:i) = temp1(alen-i+1:alen-i+1)
      a2(i:i) = temp2(alen-i+1:alen-i+1)
    end do
  end subroutine traceback

  subroutine print_alignment(a1, a2, alen, score)
    character(len=*), intent(in) :: a1, a2
    integer, intent(in) :: alen, score
    character(len=max_len) :: match_str
    integer :: i, matches, len1_no_gaps, len2_no_gaps
    real(8) :: identity

    ! Create match string and count matches
    matches = 0
    do i = 1, alen
      if (a1(i:i) == a2(i:i) .and. a1(i:i) /= '-') then
        match_str(i:i) = '|'
        matches = matches + 1
      else if (a1(i:i) /= '-' .and. a2(i:i) /= '-') then
        match_str(i:i) = '.'
      else
        match_str(i:i) = ' '
      end if
    end do

    ! Count non-gap characters
    len1_no_gaps = 0
    len2_no_gaps = 0
    do i = 1, alen
      if (a1(i:i) /= '-') len1_no_gaps = len1_no_gaps + 1
      if (a2(i:i) /= '-') len2_no_gaps = len2_no_gaps + 1
    end do

    identity = 100.0d0 * real(matches, 8) / real(max(len1_no_gaps, len2_no_gaps), 8)

    print '(A,I5)', 'Alignment Score: ', score
    print '(A,F6.1,A)', 'Sequence Identity: ', identity, '%'
    print '(A,I3,A,I3)', 'Matches: ', matches, '/', alen
    print *, ''

    ! Print alignment in blocks of 60
    do i = 1, alen, 60
      print '(A,A)', 'Seq1: ', a1(i:min(i+59, alen))
      print '(A,A)', '      ', match_str(i:min(i+59, alen))
      print '(A,A)', 'Seq2: ', a2(i:min(i+59, alen))
      print *, ''
    end do
  end subroutine print_alignment

end program needleman_wunsch

💡 This example demonstrates the computational approach to solving physics problems

Genomics & Modern Genome Analysis

Genomics provides the foundation for computational biology through DNA sequencing, genome assembly, and functional annotation. Modern computational approaches enable researchers to analyze genetic variation, gene expression, and regulatory mechanisms at unprecedented scale, transforming our understanding of biology and medicine.

🧬 DNA Sequencing Technologies

Short-Read (Illumina)

  • • Read length: 150-300 bp
  • • High accuracy (>99.9%)
  • • Low cost per base
  • • Best for SNP/indel detection
  • • Whole-genome sequencing

Long-Read (PacBio/ONT)

  • • Read length: 10-100+ kb
  • • Structural variants
  • • Repeat regions
  • • Phasing haplotypes
  • • De novo assembly

Single-Cell Sequencing

  • • Individual cell resolution
  • • Cell type identification
  • • Heterogeneity analysis
  • • Developmental trajectories
  • • Tumor microenvironment

Genome Assembly Pipeline

1. Quality Control: FastQC, Trimmomatic (adapter removal, quality filtering)

2. Assembly: De Bruijn graphs (SPAdes, MEGAHIT) or overlap-layout-consensus (Canu, Flye)

3. Scaffolding: Long-read or Hi-C data to order and orient contigs

4. Polishing: Error correction with high-accuracy reads (Pilon, Racon)

5. Annotation: Gene prediction (AUGUSTUS, Prokka), functional annotation (InterProScan)

🔬 Variant Calling & Analysis

Identifying and interpreting genetic variation is central to genomics research, from understanding evolution to diagnosing rare diseases and developing precision medicine.

Types of Variants

  • SNPs (Single Nucleotide Polymorphisms): Single base changes, most common variation
  • Indels: Small insertions/deletions (<50 bp)
  • Structural Variants: Large rearrangements (deletions, duplications, inversions, translocations)
  • Copy Number Variants (CNVs): Changes in gene dosage

Variant Calling Pipeline

  • 1. Alignment: BWA-MEM, Bowtie2 to reference genome
  • 2. Processing: Mark duplicates (Picard), base quality recalibration (GATK)
  • 3. Variant Calling: GATK HaplotypeCaller, FreeBayes, bcftools
  • 4. Filtering: VQSR or hard filters, depth/quality thresholds
  • 5. Annotation: VEP, SnpEff for functional consequences

Statistical Considerations

Multiple Testing: Genome-wide significance (p < 5×10⁻⁸) accounts for ~1 million independent tests. See Statistical Methods section for corrections (Bonferroni, FDR).

Population Genetics: Hardy-Weinberg equilibrium, linkage disequilibrium, allele frequency databases (gnomAD, ExAC)

Functional Impact: CADD scores, SIFT/PolyPhen predictions, conservation metrics (PhyloP, GERP)

📊 Gene Expression & Transcriptomics

RNA-Seq Workflow

  • 1. Quality Control: FastQC, MultiQC for read quality assessment
  • 2. Alignment: STAR, HISAT2 (splice-aware), or pseudo-alignment (kallisto, Salmon)
  • 3. Quantification: Count reads per gene (featureCounts, HTSeq)
  • 4. Normalization: TPM, FPKM, or library size factors
  • 5. Differential Expression: DESeq2, edgeR, limma-voom

Statistical Methods

  • Negative Binomial Model: Accounts for overdispersion in count data (DESeq2, edgeR)
  • Multiple Testing: Benjamini-Hochberg FDR correction for thousands of genes
  • Fold Change: Log₂FC with shrinkage (apeglm, ashr) for more reliable estimates
  • Power Analysis: Sample size calculations for desired effect size detection

Single-Cell RNA-Seq Analysis

Preprocessing: Cell filtering, ambient RNA removal (CellBender), doublet detection (Scrublet, DoubletFinder)

Normalization: Log-normalization, SCTransform for variance stabilization

Dimensionality Reduction: PCA, UMAP, t-SNE for visualization

Clustering: Louvain/Leiden algorithms for cell type identification

Trajectory Analysis: Pseudotime inference (Monocle, Slingshot) for developmental processes

Tools: Seurat, Scanpy, Bioconductor ecosystem

Spatial Transcriptomics

Emerging technologies (10x Visium, MERFISH, seqFISH) preserve spatial context while measuring gene expression. Computational methods integrate spatial coordinates with expression data to identify tissue architecture, cell-cell communication, and spatial gene programs. Key tools: Squidpy, STUtility, Giotto.

🧪 Functional Genomics & Regulatory Analysis

CRISPR Screening

  • Genome-Wide Screens: Systematic knockout/inhibition of every gene
  • Phenotypic Readouts: Growth, drug resistance, reporter activity
  • Analysis: MAGeCK, BAGEL for identifying essential genes
  • Applications: Cancer dependencies, host-pathogen interactions, drug targets

ChIP-Seq & Epigenomics

  • Transcription Factor Binding: Identify regulatory sites genome-wide
  • Histone Modifications: Map active promoters, enhancers, repressed regions
  • Peak Calling: MACS2, HOMER for identifying enriched regions
  • Motif Analysis: MEME, HOMER for discovering binding motifs

Pathway & Network Analysis

Enrichment Analysis: Gene Ontology (GO), KEGG pathways, Reactome. Statistical test: hypergeometric or Fisher's exact, with FDR correction.

Gene Regulatory Networks: Infer transcriptional regulation from expression data (ARACNe, SCENIC)

Protein-Protein Interactions: STRING database, co-immunoprecipitation networks

Systems Biology: Integrate multi-omics data (genomics, transcriptomics, proteomics, metabolomics)

🏥 Clinical Genomics & Precision Medicine

Cancer Genomics

Tumor sequencing identifies driver mutations, therapeutic targets, and resistance mechanisms.

  • Somatic Mutation Calling: Tumor-normal pairs (Mutect2, VarScan), clonal evolution
  • Copy Number Analysis: CNVkit, GISTIC for amplifications/deletions
  • Fusion Detection: STAR-Fusion, Arriba for gene fusions (e.g., BCR-ABL)
  • Immunotherapy Biomarkers: Tumor mutational burden (TMB), microsatellite instability (MSI)
  • Clinical Annotation: OncoKB, CIViC for actionable variants

Rare Disease Diagnosis

  • Whole-Exome/Genome Sequencing: Comprehensive variant detection
  • Variant Prioritization: Filter by inheritance pattern, allele frequency, predicted impact
  • Phenotype Matching: HPO (Human Phenotype Ontology) terms, tools like Exomiser
  • Diagnostic Rate: ~25-40% for rare diseases, higher with trio sequencing

Pharmacogenomics

Genetic variants affect drug metabolism, efficacy, and toxicity. PharmGKB database provides clinical annotations. Examples: CYP2D6 variants affect codeine metabolism; HLA-B*57:01 predicts abacavir hypersensitivity; TPMT variants guide thiopurine dosing. Increasingly integrated into electronic health records for clinical decision support.

⚡ Computational Challenges & Big Data

Data Scale Challenges

  • Volume: Human genome = 3 GB raw, 30x WGS = 100 GB compressed
  • Throughput: Single Illumina NovaSeq run = 6 TB of data
  • Storage: Cloud-based solutions (AWS S3, Google Cloud), CRAM compression
  • Computation: HPC clusters, workflow managers (Snakemake, Nextflow, WDL)

Algorithm Efficiency

  • Indexing: FM-index (BWA), suffix arrays for fast alignment
  • Hashing: k-mer based methods (Kraken, kallisto) for speed
  • Approximation: MinHash, locality-sensitive hashing for similarity
  • Parallelization: Multi-threading, GPU acceleration for ML models

Visualization & Integration

Genome Browsers: IGV (Integrative Genomics Viewer), UCSC Genome Browser for interactive exploration

Statistical Graphics: ggplot2, matplotlib for publication-quality figures

Interactive Dashboards: R Shiny, Plotly Dash for exploratory analysis and sharing results

🚀 Current Frontiers & Future Directions

Pan-Genomics

Moving beyond single reference genomes to graph-based representations capturing human diversity. Graph genomes (vg toolkit) represent multiple haplotypes simultaneously, improving variant calling in diverse populations. Human Pangenome Reference Consortium aims to represent global genetic variation.

AI/ML in Genomics

  • Deep Learning: CNNs for regulatory element prediction (DeepSEA, Basset)
  • Variant Calling: DeepVariant achieves state-of-the-art accuracy
  • Protein Design: RFdiffusion, ProteinMPNN for de novo design
  • Foundation Models: Genomic language models (DNABERT, Nucleotide Transformer)
  • Drug Discovery: AI-driven target identification and compound optimization

Emerging Technologies

  • Long-Read Sequencing: Telomere-to-telomere assemblies, complete human genome
  • Single-Cell Multi-Omics: Simultaneous DNA + RNA, ATAC + RNA, protein + RNA
  • Spatial Multi-Omics: Subcellular resolution transcriptomics and proteomics
  • Portable Sequencing: Nanopore MinION for field research and rapid diagnostics
  • Real-Time Analysis: Adaptive sequencing, selective sequencing of regions of interest

Integration & Systems Biology

Multi-omics integration combines genomics, transcriptomics, proteomics, metabolomics, and epigenomics to build comprehensive models of biological systems. Machine learning approaches (MOFA+, multi-view learning) identify coordinated changes across molecular layers. Goal: predictive models of disease and therapeutic response.

🔗 Statistical Foundations

All genomics analyses rely heavily on the statistical methods covered in the Statistical Methods for Computational Biology section above. From experimental design ensuring adequate statistical power, to multiple testing corrections for genome-wide analyses, to generalized linear models for differential expression—statistics is the language that allows us to make rigorous, reproducible inferences from genomic data. Understanding these statistical principles is essential for any computational biologist working with genomic datasets.

Protein Structure Prediction

Ab initio and homology modeling methods predict 3D structures from amino acid sequences. Modern deep learning approaches (AlphaFold2) achieve near-experimental accuracy by learning from evolutionary and structural databases.

Structure Prediction Methods

Traditional Approaches

  • Homology Modeling: Template-based using known structures of homologs
  • Threading: Sequence-structure compatibility scoring
  • Ab Initio: Physics-based energy minimization from scratch
  • Rosetta: Fragment assembly with energy functions

Deep Learning Methods

  • AlphaFold2: Attention-based neural networks, MSA co-evolution
  • RoseTTAFold: 3-track network (sequence, distance, coordinates)
  • ESMFold: Language model-based, no MSA required
  • Accuracy: Often within 1-2 Å RMSD of experimental structures

Energy Functions in Structure Prediction

Total potential energy combines multiple terms:

$$E_{total} = E_{bond} + E_{angle} + E_{torsion} + E_{vdW} + E_{elec} + E_{solv}$$

Energy minimization drives folding toward native structure

Bonded Terms

Bond lengths, angles, dihedrals

Non-Bonded Terms

Van der Waals, electrostatics

Solvation

Implicit/explicit water models

Statistical Methods for Computational Biology

Statistics is a fundamental tool for computational biologists, enabling quantitative understanding and rigorous analysis of biological data. From experimental design to interpreting results, statistical methods underpin reproducible computational biology research.

Statistics for Computational Biology Projects

Instructor

Erica Holdmore

Department of Data Science
Dana-Farber Cancer Institute

Format

Interactive Training Session

Hands-on statistical analysis

GitHub Repository

StatsForCompBio

This comprehensive training covers essential statistical concepts for analyzing biological data. The interactive format combines theoretical foundations with practical applications, preparing researchers to design robust experiments and draw valid conclusions from complex datasets.

Topics Covered:

Foundations

  • Experimental design principles
  • Statistical power and sample size
  • Hypothesis testing fundamentals
  • P-values and significance levels

Data Processing

  • Data cleaning and quality control
  • Handling missing data
  • Outlier detection and treatment
  • Normalization and transformation

Analysis Methods

  • Logistic regression for binary outcomes
  • ANOVA (Analysis of Variance)
  • Generalized Linear Models (GLM)
  • Multiple comparisons corrections

Interpretation

  • Effect sizes and confidence intervals
  • Model diagnostics and validation
  • Interpreting statistical results
  • Communicating findings effectively

Applications in Computational Biology:

  • Differential Gene Expression: Identifying statistically significant changes in RNA-seq data
  • Variant Calling: Statistical models for distinguishing true variants from sequencing errors
  • Protein-Protein Interactions: Assessing significance of co-immunoprecipitation experiments
  • Drug Response Modeling: Logistic regression for predicting treatment outcomes
  • Pathway Analysis: Multiple testing corrections for enrichment analysis
  • Biomarker Discovery: Statistical validation of diagnostic/prognostic markers

MIT Computational Biology: Genomes, Networks, Evolution

Comprehensive lecture series by Manolis Kellis covering the computational methods and algorithms used to understand genomics, evolution, and biological networks. This course bridges algorithmic foundations with modern applications in genomics and systems biology.

Introduction & Core Algorithms

Lecture 01 - Introduction

Lecture 02 - Dynamic Programming

Sequence alignment, optimal path finding, and scoring matrices

Lecture 03 - Database Search

BLAST, heuristic search algorithms, and sequence similarity

Lectures 04-05 - Hidden Markov Models

Probabilistic models, Viterbi algorithm, and applications to gene finding

Gene Expression & RNA Biology

Lecture 06 - Gene Expression Analysis

Clustering, classification, and pattern recognition in expression data

Lecture 07 - RNA World, RNA-seq, RNA Folding

RNA structure prediction, transcriptomics, and secondary structure

Genomics & Regulatory Networks

Lecture 08 - Epigenomics

Chromatin states, histone modifications, and epigenetic regulation

Lecture 09 - Three Dimensional Genome

Chromosome conformation capture, Hi-C, and spatial organization

Lecture 10 - Regulatory Genomics

Transcription factors, enhancers, and gene regulation networks

Network Analysis & Machine Learning

Lecture 11 - Network Analysis

Protein-protein interactions, pathway analysis, and network topology

Lecture 12 - Deep Learning

Neural networks, convolutional architectures, and biological applications

Population & Evolutionary Genomics

Lecture 13 - Population Genomics

Genetic variation, population structure, and selection signatures

Lecture 14 - Genome-Wide Association Studies (GWAS)

Statistical methods for associating genetic variants with phenotypes

Lecture 15 - Expression Quantitative Trait Loci (eQTLs)

Linking genetic variation to gene expression changes

Lecture 16 - Heritability

Quantitative genetics and variance component analysis

Lecture 17 - Comparative Genomics

Cross-species alignment, conservation, and functional element discovery

Lecture 18 - Genome Assembly, Evolution, Duplication

De novo assembly algorithms and evolutionary mechanisms

Lecture 19 - Phylogenetics

Tree construction, parsimony, and maximum likelihood methods

Lecture 20 - Phylogenomics

Genome-scale phylogenetic analysis and gene tree reconciliation

Advanced Topics & Applications

Lecture 21 - Single-Cell Genomics

scRNA-seq analysis, cell type identification, and trajectory inference

Lecture 22 - Cancer Genomics

Somatic mutations, driver genes, and tumor evolution

Lecture 23 - Multi-Phenotype Analyses

Pleiotropy, genetic correlation, and cross-trait analysis

Lecture 24 - Genome Engineering

CRISPR, base editing, and computational design of genetic interventions

MLCB24: Machine Learning in Computational Biology (Fall 2024)

Advanced lecture series by Manolis Kellis on machine learning methods for computational biology. This course covers state-of-the-art deep learning approaches applied to genomics, protein structure, drug discovery, and disease analysis. Combines theoretical foundations with practical applications of transformers, graph neural networks, and generative models for biological problems.

Gene Expression & Single-Cell Analysis

Lecture 01 - Introduction

Course overview, machine learning fundamentals for biology, and key computational biology challenges

Lecture 02 - Expression Analysis: Clustering & Classification

Gene expression data analysis, clustering algorithms, classification methods, and dimensionality reduction

Lecture 03 - Single Cell Analysis

scRNA-seq analysis, cell type identification, trajectory inference, and batch correction methods

Sequence Analysis & Regulatory Genomics

Lecture 04 - Sequence Alignment

Dynamic programming for alignment, BLAST, multiple sequence alignment, and profile HMMs

Lecture 05 - Epigenomics & HMMs

Chromatin state discovery, hidden Markov models for genomic segmentation, and epigenetic regulation

Lecture 06 - Regulatory Circuitry

Transcription factor binding, enhancer prediction, and regulatory element discovery

Lecture 07 - Regulatory Circuitry and Networks

Gene regulatory networks, network inference algorithms, and systems-level regulation

Protein Structure & Deep Learning

Lecture 08 - Introduction to Protein Structure

Protein structure fundamentals, contact maps, and structure prediction challenges

Lecture 09 - Protein Folding Algorithms

Energy minimization, molecular dynamics, coevolution-based methods, and threading approaches

Lecture 10 - Protein Structure with Transformers

AlphaFold2 architecture, attention mechanisms, and transformer-based structure prediction

Lecture 11 - Protein Language Models

ESM, ProtBERT, protein embeddings, and self-supervised learning for proteins

Lecture 12 - DNA Language Models and Convolution

Convolutional neural networks for sequences, motif discovery, and DNA foundation models

Drug Discovery & Molecular Design

Lecture 13 - Drug Development Introduction

Drug discovery pipeline, target identification, lead optimization, and ADMET prediction

Lecture 14 - Chemistry & Graph Neural Networks

Molecular graphs, message passing networks, and property prediction with GNNs

Lecture 15 - Generating New Molecules

Generative models for molecules: VAEs, GANs, diffusion models, and flow-based generation

Lecture 16 - Training Neural Networks

Optimization algorithms, regularization, transfer learning, and training best practices

Disease Analysis & Clinical Applications

Lecture 17 - Disease: GWAS, PRS & Mechanism

Genome-wide association studies, polygenic risk scores, and mechanistic interpretation

Lecture 18 - Disease Mechanism

Variant effect prediction, causal inference, and pathway analysis for disease genetics

Lecture 19 - Electronic Health Records

Clinical data analysis, phenotype extraction, and machine learning for patient outcomes

Evolution, Metabolism & Metabolomics

Lecture 20 - Comparative Genomics

Multi-species alignment, conservation scoring, and functional element discovery

Lecture 21 - Evolution

Phylogenetic inference, molecular evolution models, and selection analysis

Lecture 22 - Metabolic Modeling

Flux balance analysis, metabolic networks, and constraint-based modeling

Lecture 23 - Computational Metabolomics

Mass spectrometry data analysis, metabolite identification, and pathway enrichment

Tools & Resources

MD Simulation Software

  • GROMACS: Fast, parallel MD for biomolecules
  • AMBER: Specialized for nucleic acids & proteins
  • NAMD: Scalable MD with CHARMM force fields
  • OpenMM: GPU-accelerated, Python API
  • LAMMPS: Large-scale atomic/molecular simulations

Bioinformatics Tools

  • BLAST: Sequence similarity search
  • Clustal Omega: Multiple sequence alignment
  • HMMER: Profile hidden Markov models
  • BioPython: Python tools for computational biology (Tutorial)
  • R/Bioconductor: Statistical analysis of genomic data

Structure Prediction

  • AlphaFold2: Deep learning structure prediction
  • Rosetta: Energy-based modeling suite
  • MODELLER: Homology modeling
  • I-TASSER: Threading & ab initio prediction
  • SWISS-MODEL: Web-based homology modeling

Databases

  • PDB: Protein Data Bank (structures)
  • UniProt: Protein sequences & annotations
  • GenBank: NIH genetic sequence database
  • Pfam: Protein families & domains
  • AlphaFold DB: 200M+ predicted structures