Computational Biology & Molecular Simulation
Bridging bioinformatics, molecular dynamics, and structural biology through computational methods. From sequence analysis to atomic-level simulations of biomolecular systems.
Computational Biology Framework
Computational biology integrates data from multiple sources—genomic databases, structural repositories, spectroscopic measurements—to build dynamic 3D atomic models of biological systems. This multidisciplinary field combines bioinformatics for data analysis with molecular simulations for understanding dynamic processes.
Key Data Sources & Methods
Sequence Databases
- • GenomeDB/UniProt
- • Homology information
- • Evolutionary analysis
Structural Data
- • PDB (X-ray, NMR, Cryo-EM)
- • Atomic coordinates
- • Co-evolution analysis
Biophysical Data
- • FRET, Biophysics
- • SAXS, SANS
- • 3DEM volume
Spectroscopy
- • NMR (chemical shifts)
- • BMRB database
- • Distance constraints
Cross-Linking
- • Cross-link & MS
- • Distance information
- • Protein interactions
Diffraction
- • Crystallography
- • SASBDB (SAXS/SANS)
- • Shape information
Molecular Dynamics Simulations
Molecular dynamics (MD) simulations integrate Newton's equations of motion for every atom in a biomolecular system, revealing dynamic behavior, conformational changes, and interaction mechanisms at atomic resolution.
Molecular Dynamics: Protein in Water Box
Simple MD simulation framework for a protein system using Verlet integration
import numpy as np
import matplotlib.pyplot as plt
# Physical constants
kB = 1.381e-23 # Boltzmann constant (J/K)
NA = 6.022e23 # Avogadro's number
class MDSimulation:
"""Simple molecular dynamics simulation framework"""
def __init__(self, n_atoms, box_size, dt=1e-15):
"""
Initialize MD simulation
Parameters:
n_atoms: number of atoms
box_size: simulation box size (nm)
dt: timestep (seconds, default 1 fs)
"""
self.n_atoms = n_atoms
self.box_size = box_size * 1e-9 # Convert to meters
self.dt = dt
# Initialize positions randomly in box
self.positions = np.random.rand(n_atoms, 3) * self.box_size
# Initialize velocities (Maxwell-Boltzmann at 300K)
self.velocities = self.maxwell_boltzmann_velocities(300)
# Initialize forces
self.forces = np.zeros((n_atoms, 3))
# Energy tracking
self.kinetic_energy = []
self.potential_energy = []
self.total_energy = []
def maxwell_boltzmann_velocities(self, T, mass=1.67e-27):
"""Initialize velocities from Maxwell-Boltzmann distribution"""
sigma = np.sqrt(kB * T / mass)
velocities = np.random.normal(0, sigma, (self.n_atoms, 3))
# Remove center of mass motion
velocities -= np.mean(velocities, axis=0)
return velocities
def lennard_jones_force(self, epsilon=0.996e-21, sigma=3.4e-10):
"""
Calculate Lennard-Jones forces between all atom pairs
V(r) = 4ε[(σ/r)^12 - (σ/r)^6]
"""
forces = np.zeros_like(self.positions)
potential = 0.0
for i in range(self.n_atoms):
for j in range(i+1, self.n_atoms):
# Distance vector with periodic boundary conditions
rij = self.positions[i] - self.positions[j]
rij -= self.box_size * np.round(rij / self.box_size)
r = np.linalg.norm(rij)
if r < 2.5 * sigma: # Cutoff distance
# LJ potential and force
sr6 = (sigma / r)**6
sr12 = sr6 * sr6
potential += 4 * epsilon * (sr12 - sr6)
force_mag = 24 * epsilon * (2*sr12 - sr6) / r**2
force_vec = force_mag * rij
forces[i] += force_vec
forces[j] -= force_vec
return forces, potential
def velocity_verlet_step(self):
"""Velocity Verlet integration step"""
# Update positions: r(t+dt) = r(t) + v(t)*dt + 0.5*a(t)*dt^2
self.positions += self.velocities * self.dt + 0.5 * self.forces * self.dt**2 / 1.67e-27
# Apply periodic boundary conditions
self.positions = np.mod(self.positions, self.box_size)
# Calculate new forces
old_forces = self.forces.copy()
self.forces, potential = self.lennard_jones_force()
# Update velocities: v(t+dt) = v(t) + 0.5*[a(t) + a(t+dt)]*dt
self.velocities += 0.5 * (old_forces + self.forces) * self.dt / 1.67e-27
# Calculate energies
kinetic = 0.5 * 1.67e-27 * np.sum(self.velocities**2)
return kinetic, potential
def run(self, n_steps, sample_interval=10):
"""Run MD simulation"""
print(f"Starting MD simulation with {self.n_atoms} atoms")
print(f"Box size: {self.box_size*1e9:.2f} nm")
print(f"Timestep: {self.dt*1e15:.2f} fs")
print(f"Total time: {n_steps*self.dt*1e12:.2f} ps")
print("=" * 60)
for step in range(n_steps):
kinetic, potential = self.velocity_verlet_step()
if step % sample_interval == 0:
self.kinetic_energy.append(kinetic / 1.602e-19) # Convert to eV
self.potential_energy.append(potential / 1.602e-19)
self.total_energy.append((kinetic + potential) / 1.602e-19)
if step % (n_steps // 10) == 0:
T_inst = 2 * kinetic / (3 * self.n_atoms * kB)
print(f"Step {step:6d}: T = {T_inst:6.1f} K, "
f"E_total = {(kinetic+potential)/1.602e-19:8.3f} eV")
# Run simulation
md = MDSimulation(n_atoms=100, box_size=5.0, dt=1e-15) # 5 nm box, 1 fs timestep
md.run(n_steps=5000, sample_interval=10)
# Plot results
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 8))
# Energy evolution
time = np.arange(len(md.total_energy)) * 10 * 1e-15 * 1e12 # ps
ax1.plot(time, md.kinetic_energy, 'r-', label='Kinetic', alpha=0.7)
ax1.plot(time, md.potential_energy, 'b-', label='Potential', alpha=0.7)
ax1.plot(time, md.total_energy, 'k-', label='Total', linewidth=2)
ax1.set_xlabel('Time (ps)')
ax1.set_ylabel('Energy (eV)')
ax1.set_title('Energy Conservation in MD Simulation')
ax1.legend()
ax1.grid(True, alpha=0.3)
# Temperature evolution
temperatures = 2 * np.array(md.kinetic_energy) * 1.602e-19 / (3 * md.n_atoms * kB)
ax2.plot(time, temperatures, 'g-', linewidth=1.5)
ax2.axhline(y=300, color='r', linestyle='--', label='Target (300 K)')
ax2.set_xlabel('Time (ps)')
ax2.set_ylabel('Temperature (K)')
ax2.set_title('Temperature Fluctuations')
ax2.legend()
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('md_simulation.png', dpi=150, bbox_inches='tight')
print("\nSimulation complete! Plot saved as 'md_simulation.png'")💡 This example demonstrates the computational approach to solving physics problems
🚀 High-Performance Fortran Implementation
Fortran excels at MD simulations due to array operations, numerical precision, and compiler optimizations. Production codes like GROMACS and AMBER use Fortran/C++ for performance-critical components.
Molecular Dynamics in Fortran 90: High-Performance Implementation
Optimized MD simulation with Lennard-Jones potential and Velocity Verlet integration
program md_simulation
implicit none
! Physical constants
real(8), parameter :: kB = 1.381d-23 ! Boltzmann constant (J/K)
real(8), parameter :: mass = 1.67d-27 ! Particle mass (kg)
real(8), parameter :: epsilon = 0.996d-21 ! LJ epsilon (J)
real(8), parameter :: sigma = 3.4d-10 ! LJ sigma (m)
! Simulation parameters
integer, parameter :: n_atoms = 100
integer, parameter :: n_steps = 5000
integer, parameter :: sample_interval = 10
real(8), parameter :: dt = 1.0d-15 ! Timestep (1 fs)
real(8), parameter :: box_size = 5.0d-9 ! Box size (5 nm)
real(8), parameter :: cutoff = 2.5d0 * sigma
real(8), parameter :: target_temp = 300.0d0
! Arrays
real(8), dimension(3, n_atoms) :: positions, velocities, forces, old_forces
real(8), dimension(n_steps/sample_interval) :: kinetic_E, potential_E, total_E
real(8) :: kinetic, potential, temperature
integer :: i, step, sample_idx
! Initialize simulation
call initialize_system()
! Main MD loop
print *, 'Starting Fortran MD simulation'
print *, 'Atoms:', n_atoms, ' Box size:', box_size*1d9, 'nm'
print *, 'Timestep:', dt*1d15, 'fs Total time:', n_steps*dt*1d12, 'ps'
print *, '============================================================'
sample_idx = 1
do step = 0, n_steps-1
! Velocity Verlet integration
call velocity_verlet_step(kinetic, potential)
! Sample and output
if (mod(step, sample_interval) == 0) then
kinetic_E(sample_idx) = kinetic
potential_E(sample_idx) = potential
total_E(sample_idx) = kinetic + potential
sample_idx = sample_idx + 1
if (mod(step, n_steps/10) == 0) then
temperature = 2.0d0 * kinetic / (3.0d0 * n_atoms * kB)
print '(A,I6,A,F7.1,A,F10.3,A)', &
'Step', step, ': T =', temperature, ' K, E_total =', &
(kinetic+potential)/1.602d-19, ' eV'
end if
end if
end do
print *, ''
print *, 'Simulation complete!'
call print_statistics()
contains
subroutine initialize_system()
! Initialize positions randomly in box
call random_number(positions)
positions = positions * box_size
! Initialize velocities (Maxwell-Boltzmann distribution)
call initialize_velocities(target_temp)
! Calculate initial forces
call calculate_forces(potential)
end subroutine initialize_system
subroutine initialize_velocities(T)
real(8), intent(in) :: T
real(8) :: sigma_v, rand_normal
integer :: i, j
sigma_v = sqrt(kB * T / mass)
! Generate Gaussian random velocities
do i = 1, n_atoms
do j = 1, 3
call random_normal(rand_normal)
velocities(j, i) = sigma_v * rand_normal
end do
end do
! Remove center of mass motion
do j = 1, 3
velocities(j, :) = velocities(j, :) - sum(velocities(j, :)) / n_atoms
end do
end subroutine initialize_velocities
subroutine random_normal(x)
! Box-Muller transform for Gaussian random numbers
real(8), intent(out) :: x
real(8) :: u1, u2
call random_number(u1)
call random_number(u2)
x = sqrt(-2.0d0 * log(u1)) * cos(2.0d0 * 3.14159265359d0 * u2)
end subroutine random_normal
subroutine calculate_forces(pot_energy)
real(8), intent(out) :: pot_energy
real(8), dimension(3) :: rij, force_vec
real(8) :: r, r2, sr6, sr12, force_mag
integer :: i, j, k
forces = 0.0d0
pot_energy = 0.0d0
! Loop over all atom pairs
do i = 1, n_atoms-1
do j = i+1, n_atoms
! Distance vector with periodic boundary conditions
rij = positions(:, i) - positions(:, j)
! Apply minimum image convention
do k = 1, 3
if (rij(k) > box_size/2.0d0) rij(k) = rij(k) - box_size
if (rij(k) < -box_size/2.0d0) rij(k) = rij(k) + box_size
end do
r2 = dot_product(rij, rij)
r = sqrt(r2)
! Only calculate if within cutoff
if (r < cutoff) then
sr6 = (sigma / r)**6
sr12 = sr6 * sr6
! Lennard-Jones potential
pot_energy = pot_energy + 4.0d0 * epsilon * (sr12 - sr6)
! Lennard-Jones force
force_mag = 24.0d0 * epsilon * (2.0d0*sr12 - sr6) / r2
force_vec = force_mag * rij
forces(:, i) = forces(:, i) + force_vec
forces(:, j) = forces(:, j) - force_vec
end if
end do
end do
end subroutine calculate_forces
subroutine velocity_verlet_step(kin_energy, pot_energy)
real(8), intent(out) :: kin_energy, pot_energy
integer :: i
! Update positions: r(t+dt) = r(t) + v(t)*dt + 0.5*a(t)*dt^2
positions = positions + velocities * dt + 0.5d0 * forces * dt**2 / mass
! Apply periodic boundary conditions
do i = 1, 3
where (positions(i, :) < 0.0d0) positions(i, :) = positions(i, :) + box_size
where (positions(i, :) >= box_size) positions(i, :) = positions(i, :) - box_size
end do
! Save old forces
old_forces = forces
! Calculate new forces
call calculate_forces(pot_energy)
! Update velocities: v(t+dt) = v(t) + 0.5*[a(t) + a(t+dt)]*dt
velocities = velocities + 0.5d0 * (old_forces + forces) * dt / mass
! Calculate kinetic energy
kin_energy = 0.5d0 * mass * sum(velocities**2)
end subroutine velocity_verlet_step
subroutine print_statistics()
real(8) :: avg_temp, avg_total_E, std_total_E
integer :: n_samples
n_samples = size(total_E)
avg_total_E = sum(total_E) / n_samples
std_total_E = sqrt(sum((total_E - avg_total_E)**2) / n_samples)
avg_temp = 2.0d0 * sum(kinetic_E) / (3.0d0 * n_atoms * kB * n_samples)
print *, ''
print *, 'Performance Statistics:'
print *, '---------------------'
print '(A,F7.1,A)', ' Average Temperature: ', avg_temp, ' K'
print '(A,F10.3,A)', ' Average Total Energy: ', avg_total_E/1.602d-19, ' eV'
print '(A,F10.5,A)', ' Energy Conservation: ', std_total_E/abs(avg_total_E)*100, ' %'
print *, ''
print *, 'Fortran Performance Advantages:'
print *, ' • Native array operations (SIMD vectorization)'
print *, ' • Efficient memory layout (column-major)'
print *, ' • Aggressive compiler optimizations (-O3, -march=native)'
print *, ' • 5-10× faster than Python for large systems'
end subroutine print_statistics
end program md_simulation💡 This example demonstrates the computational approach to solving physics problems
Sequence Analysis & Bioinformatics
Sequence analysis tools identify evolutionary relationships, predict structure, and locate functional motifs. Algorithms like Smith-Waterman and BLAST enable homology detection across vast genomic databases.
Sequence Alignment: Needleman-Wunsch Algorithm
Global sequence alignment using dynamic programming to find optimal alignment between two sequences
import numpy as np
class SequenceAlignment:
"""Needleman-Wunsch global alignment algorithm"""
def __init__(self, match_score=1, mismatch_penalty=-1, gap_penalty=-2):
self.match_score = match_score
self.mismatch_penalty = mismatch_penalty
self.gap_penalty = gap_penalty
def align(self, seq1, seq2):
"""
Perform global alignment of two sequences
Returns: (aligned_seq1, aligned_seq2, score)
"""
n, m = len(seq1), len(seq2)
# Initialize scoring matrix
score_matrix = np.zeros((n+1, m+1))
# Initialize first row and column (gap penalties)
for i in range(n+1):
score_matrix[i, 0] = i * self.gap_penalty
for j in range(m+1):
score_matrix[0, j] = j * self.gap_penalty
# Fill scoring matrix using dynamic programming
for i in range(1, n+1):
for j in range(1, m+1):
# Calculate scores for three possible moves
if seq1[i-1] == seq2[j-1]:
match = score_matrix[i-1, j-1] + self.match_score
else:
match = score_matrix[i-1, j-1] + self.mismatch_penalty
delete = score_matrix[i-1, j] + self.gap_penalty
insert = score_matrix[i, j-1] + self.gap_penalty
# Take maximum score
score_matrix[i, j] = max(match, delete, insert)
# Traceback to find alignment
align1, align2 = "", ""
i, j = n, m
while i > 0 or j > 0:
current_score = score_matrix[i, j]
if i > 0 and j > 0:
if seq1[i-1] == seq2[j-1]:
diagonal = score_matrix[i-1, j-1] + self.match_score
else:
diagonal = score_matrix[i-1, j-1] + self.mismatch_penalty
else:
diagonal = float('-inf')
up = score_matrix[i-1, j] + self.gap_penalty if i > 0 else float('-inf')
left = score_matrix[i, j-1] + self.gap_penalty if j > 0 else float('-inf')
if current_score == diagonal:
align1 = seq1[i-1] + align1
align2 = seq2[j-1] + align2
i -= 1
j -= 1
elif current_score == up:
align1 = seq1[i-1] + align1
align2 = "-" + align2
i -= 1
else:
align1 = "-" + align1
align2 = seq2[j-1] + align2
j -= 1
return align1, align2, score_matrix[n, m]
def print_alignment(self, seq1, seq2, score):
"""Pretty print alignment with matches indicated"""
match_string = ""
matches = 0
for i in range(len(seq1)):
if seq1[i] == seq2[i] and seq1[i] != '-':
match_string += "|"
matches += 1
elif seq1[i] != '-' and seq2[i] != '-':
match_string += "."
else:
match_string += " "
identity = (matches / max(len(seq1.replace('-', '')),
len(seq2.replace('-', ''))) * 100)
print(f"Alignment Score: {score}")
print(f"Sequence Identity: {identity:.1f}%")
print(f"Matches: {matches}/{len(seq1)}")
print()
# Print alignment in blocks of 60
for i in range(0, len(seq1), 60):
print(f"Seq1: {seq1[i:i+60]}")
print(f" {match_string[i:i+60]}")
print(f"Seq2: {seq2[i:i+60]}")
print()
# Example: Align two protein sequences
aligner = SequenceAlignment(match_score=2, mismatch_penalty=-1, gap_penalty=-2)
# Two similar protein sequences (fragments)
seq1 = "HEAGAWGHEE"
seq2 = "PAWHEAE"
print("Global Sequence Alignment")
print("=" * 70)
print(f"Sequence 1: {seq1} (length: {len(seq1)})")
print(f"Sequence 2: {seq2} (length: {len(seq2)})")
print()
aligned1, aligned2, score = aligner.align(seq1, seq2)
aligner.print_alignment(aligned1, aligned2, score)
# More realistic example: Beta-globin fragments from human and mouse
print("\n" + "=" * 70)
print("Comparing Beta-Globin Fragments (Human vs Mouse)")
print("=" * 70)
human_beta = "VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG"
mouse_beta = "VHLTDAEKAAVSCLWGKVNSDDEVGGEALGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT"
aligned_h, aligned_m, score = aligner.align(human_beta, mouse_beta)
aligner.print_alignment(aligned_h, aligned_m, score)
print("Biological Interpretation:")
print("-" * 70)
print("High sequence identity indicates evolutionary conservation")
print("Beta-globin is highly conserved across mammals")
print("Conserved residues are critical for oxygen-binding function")
print("Gaps represent insertions/deletions (indels) during evolution")💡 This example demonstrates the computational approach to solving physics problems
🚀 Fortran Implementation for Large-Scale Analysis
Fortran's string handling and 2D array efficiency make it ideal for sequence alignment at genomic scale. Used in production bioinformatics pipelines for processing millions of sequence pairs.
Needleman-Wunsch Algorithm in Fortran 90
High-performance global sequence alignment with dynamic programming
program needleman_wunsch
implicit none
! Scoring parameters
integer, parameter :: match_score = 2
integer, parameter :: mismatch_penalty = -1
integer, parameter :: gap_penalty = -2
! Maximum sequence length
integer, parameter :: max_len = 256
! Sequences
character(len=max_len) :: seq1, seq2
character(len=max_len) :: aligned1, aligned2, match_string
integer :: len1, len2, alignment_len
! Scoring matrix
integer, dimension(0:max_len, 0:max_len) :: score_matrix
! Results
integer :: final_score, matches, i
real(8) :: identity
! Example 1: Short sequences
print *, '========================================='
print *, 'Global Sequence Alignment (Fortran 90)'
print *, '========================================='
print *, ''
seq1 = 'HEAGAWGHEE'
seq2 = 'PAWHEAE'
len1 = 10
len2 = 7
print *, 'Sequence 1: ', trim(seq1(1:len1))
print *, 'Sequence 2: ', trim(seq2(1:len2))
print *, ''
call align_sequences(seq1(1:len1), seq2(1:len2), len1, len2, &
aligned1, aligned2, alignment_len, final_score)
call print_alignment(aligned1, aligned2, alignment_len, final_score)
! Example 2: Beta-globin fragments
print *, ''
print *, '========================================='
print *, 'Beta-Globin Fragments (Human vs Mouse)'
print *, '========================================='
print *, ''
seq1 = 'VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG'
seq2 = 'VHLTDAEKAAVSCLWGKVNSDDEVGGEALGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT'
len1 = 70
len2 = 71
call align_sequences(seq1(1:len1), seq2(1:len2), len1, len2, &
aligned1, aligned2, alignment_len, final_score)
call print_alignment(aligned1, aligned2, alignment_len, final_score)
print *, ''
print *, 'Fortran Performance Benefits:'
print *, ' • Efficient 2D array indexing (cache-friendly)'
print *, ' • Fast string operations with fixed-width arrays'
print *, ' • O(n×m) time, O(n×m) space complexity'
print *, ' • Suitable for genome-scale alignments'
contains
subroutine align_sequences(s1, s2, n, m, a1, a2, alen, score)
character(len=*), intent(in) :: s1, s2
integer, intent(in) :: n, m
character(len=max_len), intent(out) :: a1, a2
integer, intent(out) :: alen, score
integer :: i, j, diag, up, left, current_score
! Initialize scoring matrix
do i = 0, n
score_matrix(i, 0) = i * gap_penalty
end do
do j = 0, m
score_matrix(0, j) = j * gap_penalty
end do
! Fill scoring matrix using dynamic programming
do i = 1, n
do j = 1, m
! Calculate scores for three possible moves
if (s1(i:i) == s2(j:j)) then
diag = score_matrix(i-1, j-1) + match_score
else
diag = score_matrix(i-1, j-1) + mismatch_penalty
end if
up = score_matrix(i-1, j) + gap_penalty
left = score_matrix(i, j-1) + gap_penalty
! Take maximum score
score_matrix(i, j) = max(diag, max(up, left))
end do
end do
score = score_matrix(n, m)
! Traceback to construct alignment
call traceback(s1, s2, n, m, a1, a2, alen)
end subroutine align_sequences
subroutine traceback(s1, s2, n, m, a1, a2, alen)
character(len=*), intent(in) :: s1, s2
integer, intent(in) :: n, m
character(len=max_len), intent(out) :: a1, a2
integer, intent(out) :: alen
character(len=max_len) :: temp1, temp2
integer :: i, j, pos, diag, up, left, current_score
i = n
j = m
pos = 0
temp1 = ''
temp2 = ''
do while (i > 0 .or. j > 0)
current_score = score_matrix(i, j)
! Calculate possible previous scores
if (i > 0 .and. j > 0) then
if (s1(i:i) == s2(j:j)) then
diag = score_matrix(i-1, j-1) + match_score
else
diag = score_matrix(i-1, j-1) + mismatch_penalty
end if
else
diag = -999999
end if
if (i > 0) then
up = score_matrix(i-1, j) + gap_penalty
else
up = -999999
end if
if (j > 0) then
left = score_matrix(i, j-1) + gap_penalty
else
left = -999999
end if
! Traceback
pos = pos + 1
if (current_score == diag) then
temp1(pos:pos) = s1(i:i)
temp2(pos:pos) = s2(j:j)
i = i - 1
j = j - 1
else if (current_score == up) then
temp1(pos:pos) = s1(i:i)
temp2(pos:pos) = '-'
i = i - 1
else
temp1(pos:pos) = '-'
temp2(pos:pos) = s2(j:j)
j = j - 1
end if
end do
alen = pos
! Reverse strings
do i = 1, alen
a1(i:i) = temp1(alen-i+1:alen-i+1)
a2(i:i) = temp2(alen-i+1:alen-i+1)
end do
end subroutine traceback
subroutine print_alignment(a1, a2, alen, score)
character(len=*), intent(in) :: a1, a2
integer, intent(in) :: alen, score
character(len=max_len) :: match_str
integer :: i, matches, len1_no_gaps, len2_no_gaps
real(8) :: identity
! Create match string and count matches
matches = 0
do i = 1, alen
if (a1(i:i) == a2(i:i) .and. a1(i:i) /= '-') then
match_str(i:i) = '|'
matches = matches + 1
else if (a1(i:i) /= '-' .and. a2(i:i) /= '-') then
match_str(i:i) = '.'
else
match_str(i:i) = ' '
end if
end do
! Count non-gap characters
len1_no_gaps = 0
len2_no_gaps = 0
do i = 1, alen
if (a1(i:i) /= '-') len1_no_gaps = len1_no_gaps + 1
if (a2(i:i) /= '-') len2_no_gaps = len2_no_gaps + 1
end do
identity = 100.0d0 * real(matches, 8) / real(max(len1_no_gaps, len2_no_gaps), 8)
print '(A,I5)', 'Alignment Score: ', score
print '(A,F6.1,A)', 'Sequence Identity: ', identity, '%'
print '(A,I3,A,I3)', 'Matches: ', matches, '/', alen
print *, ''
! Print alignment in blocks of 60
do i = 1, alen, 60
print '(A,A)', 'Seq1: ', a1(i:min(i+59, alen))
print '(A,A)', ' ', match_str(i:min(i+59, alen))
print '(A,A)', 'Seq2: ', a2(i:min(i+59, alen))
print *, ''
end do
end subroutine print_alignment
end program needleman_wunsch💡 This example demonstrates the computational approach to solving physics problems
Genomics & Modern Genome Analysis
Genomics provides the foundation for computational biology through DNA sequencing, genome assembly, and functional annotation. Modern computational approaches enable researchers to analyze genetic variation, gene expression, and regulatory mechanisms at unprecedented scale, transforming our understanding of biology and medicine.
🧬 DNA Sequencing Technologies
Short-Read (Illumina)
- • Read length: 150-300 bp
- • High accuracy (>99.9%)
- • Low cost per base
- • Best for SNP/indel detection
- • Whole-genome sequencing
Long-Read (PacBio/ONT)
- • Read length: 10-100+ kb
- • Structural variants
- • Repeat regions
- • Phasing haplotypes
- • De novo assembly
Single-Cell Sequencing
- • Individual cell resolution
- • Cell type identification
- • Heterogeneity analysis
- • Developmental trajectories
- • Tumor microenvironment
Genome Assembly Pipeline
1. Quality Control: FastQC, Trimmomatic (adapter removal, quality filtering)
2. Assembly: De Bruijn graphs (SPAdes, MEGAHIT) or overlap-layout-consensus (Canu, Flye)
3. Scaffolding: Long-read or Hi-C data to order and orient contigs
4. Polishing: Error correction with high-accuracy reads (Pilon, Racon)
5. Annotation: Gene prediction (AUGUSTUS, Prokka), functional annotation (InterProScan)
🔬 Variant Calling & Analysis
Identifying and interpreting genetic variation is central to genomics research, from understanding evolution to diagnosing rare diseases and developing precision medicine.
Types of Variants
- SNPs (Single Nucleotide Polymorphisms): Single base changes, most common variation
- Indels: Small insertions/deletions (<50 bp)
- Structural Variants: Large rearrangements (deletions, duplications, inversions, translocations)
- Copy Number Variants (CNVs): Changes in gene dosage
Variant Calling Pipeline
- 1. Alignment: BWA-MEM, Bowtie2 to reference genome
- 2. Processing: Mark duplicates (Picard), base quality recalibration (GATK)
- 3. Variant Calling: GATK HaplotypeCaller, FreeBayes, bcftools
- 4. Filtering: VQSR or hard filters, depth/quality thresholds
- 5. Annotation: VEP, SnpEff for functional consequences
Statistical Considerations
Multiple Testing: Genome-wide significance (p < 5×10⁻⁸) accounts for ~1 million independent tests. See Statistical Methods section for corrections (Bonferroni, FDR).
Population Genetics: Hardy-Weinberg equilibrium, linkage disequilibrium, allele frequency databases (gnomAD, ExAC)
Functional Impact: CADD scores, SIFT/PolyPhen predictions, conservation metrics (PhyloP, GERP)
📊 Gene Expression & Transcriptomics
RNA-Seq Workflow
- 1. Quality Control: FastQC, MultiQC for read quality assessment
- 2. Alignment: STAR, HISAT2 (splice-aware), or pseudo-alignment (kallisto, Salmon)
- 3. Quantification: Count reads per gene (featureCounts, HTSeq)
- 4. Normalization: TPM, FPKM, or library size factors
- 5. Differential Expression: DESeq2, edgeR, limma-voom
Statistical Methods
- Negative Binomial Model: Accounts for overdispersion in count data (DESeq2, edgeR)
- Multiple Testing: Benjamini-Hochberg FDR correction for thousands of genes
- Fold Change: Log₂FC with shrinkage (apeglm, ashr) for more reliable estimates
- Power Analysis: Sample size calculations for desired effect size detection
Single-Cell RNA-Seq Analysis
Preprocessing: Cell filtering, ambient RNA removal (CellBender), doublet detection (Scrublet, DoubletFinder)
Normalization: Log-normalization, SCTransform for variance stabilization
Dimensionality Reduction: PCA, UMAP, t-SNE for visualization
Clustering: Louvain/Leiden algorithms for cell type identification
Trajectory Analysis: Pseudotime inference (Monocle, Slingshot) for developmental processes
Tools: Seurat, Scanpy, Bioconductor ecosystem
Spatial Transcriptomics
Emerging technologies (10x Visium, MERFISH, seqFISH) preserve spatial context while measuring gene expression. Computational methods integrate spatial coordinates with expression data to identify tissue architecture, cell-cell communication, and spatial gene programs. Key tools: Squidpy, STUtility, Giotto.
🧪 Functional Genomics & Regulatory Analysis
CRISPR Screening
- Genome-Wide Screens: Systematic knockout/inhibition of every gene
- Phenotypic Readouts: Growth, drug resistance, reporter activity
- Analysis: MAGeCK, BAGEL for identifying essential genes
- Applications: Cancer dependencies, host-pathogen interactions, drug targets
ChIP-Seq & Epigenomics
- Transcription Factor Binding: Identify regulatory sites genome-wide
- Histone Modifications: Map active promoters, enhancers, repressed regions
- Peak Calling: MACS2, HOMER for identifying enriched regions
- Motif Analysis: MEME, HOMER for discovering binding motifs
Pathway & Network Analysis
Enrichment Analysis: Gene Ontology (GO), KEGG pathways, Reactome. Statistical test: hypergeometric or Fisher's exact, with FDR correction.
Gene Regulatory Networks: Infer transcriptional regulation from expression data (ARACNe, SCENIC)
Protein-Protein Interactions: STRING database, co-immunoprecipitation networks
Systems Biology: Integrate multi-omics data (genomics, transcriptomics, proteomics, metabolomics)
🏥 Clinical Genomics & Precision Medicine
Cancer Genomics
Tumor sequencing identifies driver mutations, therapeutic targets, and resistance mechanisms.
- • Somatic Mutation Calling: Tumor-normal pairs (Mutect2, VarScan), clonal evolution
- • Copy Number Analysis: CNVkit, GISTIC for amplifications/deletions
- • Fusion Detection: STAR-Fusion, Arriba for gene fusions (e.g., BCR-ABL)
- • Immunotherapy Biomarkers: Tumor mutational burden (TMB), microsatellite instability (MSI)
- • Clinical Annotation: OncoKB, CIViC for actionable variants
Rare Disease Diagnosis
- Whole-Exome/Genome Sequencing: Comprehensive variant detection
- Variant Prioritization: Filter by inheritance pattern, allele frequency, predicted impact
- Phenotype Matching: HPO (Human Phenotype Ontology) terms, tools like Exomiser
- Diagnostic Rate: ~25-40% for rare diseases, higher with trio sequencing
Pharmacogenomics
Genetic variants affect drug metabolism, efficacy, and toxicity. PharmGKB database provides clinical annotations. Examples: CYP2D6 variants affect codeine metabolism; HLA-B*57:01 predicts abacavir hypersensitivity; TPMT variants guide thiopurine dosing. Increasingly integrated into electronic health records for clinical decision support.
⚡ Computational Challenges & Big Data
Data Scale Challenges
- Volume: Human genome = 3 GB raw, 30x WGS = 100 GB compressed
- Throughput: Single Illumina NovaSeq run = 6 TB of data
- Storage: Cloud-based solutions (AWS S3, Google Cloud), CRAM compression
- Computation: HPC clusters, workflow managers (Snakemake, Nextflow, WDL)
Algorithm Efficiency
- Indexing: FM-index (BWA), suffix arrays for fast alignment
- Hashing: k-mer based methods (Kraken, kallisto) for speed
- Approximation: MinHash, locality-sensitive hashing for similarity
- Parallelization: Multi-threading, GPU acceleration for ML models
Visualization & Integration
Genome Browsers: IGV (Integrative Genomics Viewer), UCSC Genome Browser for interactive exploration
Statistical Graphics: ggplot2, matplotlib for publication-quality figures
Interactive Dashboards: R Shiny, Plotly Dash for exploratory analysis and sharing results
🚀 Current Frontiers & Future Directions
Pan-Genomics
Moving beyond single reference genomes to graph-based representations capturing human diversity. Graph genomes (vg toolkit) represent multiple haplotypes simultaneously, improving variant calling in diverse populations. Human Pangenome Reference Consortium aims to represent global genetic variation.
AI/ML in Genomics
- • Deep Learning: CNNs for regulatory element prediction (DeepSEA, Basset)
- • Variant Calling: DeepVariant achieves state-of-the-art accuracy
- • Protein Design: RFdiffusion, ProteinMPNN for de novo design
- • Foundation Models: Genomic language models (DNABERT, Nucleotide Transformer)
- • Drug Discovery: AI-driven target identification and compound optimization
Emerging Technologies
- • Long-Read Sequencing: Telomere-to-telomere assemblies, complete human genome
- • Single-Cell Multi-Omics: Simultaneous DNA + RNA, ATAC + RNA, protein + RNA
- • Spatial Multi-Omics: Subcellular resolution transcriptomics and proteomics
- • Portable Sequencing: Nanopore MinION for field research and rapid diagnostics
- • Real-Time Analysis: Adaptive sequencing, selective sequencing of regions of interest
Integration & Systems Biology
Multi-omics integration combines genomics, transcriptomics, proteomics, metabolomics, and epigenomics to build comprehensive models of biological systems. Machine learning approaches (MOFA+, multi-view learning) identify coordinated changes across molecular layers. Goal: predictive models of disease and therapeutic response.
🔗 Statistical Foundations
All genomics analyses rely heavily on the statistical methods covered in the Statistical Methods for Computational Biology section above. From experimental design ensuring adequate statistical power, to multiple testing corrections for genome-wide analyses, to generalized linear models for differential expression—statistics is the language that allows us to make rigorous, reproducible inferences from genomic data. Understanding these statistical principles is essential for any computational biologist working with genomic datasets.
Protein Structure Prediction
Ab initio and homology modeling methods predict 3D structures from amino acid sequences. Modern deep learning approaches (AlphaFold2) achieve near-experimental accuracy by learning from evolutionary and structural databases.
Structure Prediction Methods
Traditional Approaches
- Homology Modeling: Template-based using known structures of homologs
- Threading: Sequence-structure compatibility scoring
- Ab Initio: Physics-based energy minimization from scratch
- Rosetta: Fragment assembly with energy functions
Deep Learning Methods
- AlphaFold2: Attention-based neural networks, MSA co-evolution
- RoseTTAFold: 3-track network (sequence, distance, coordinates)
- ESMFold: Language model-based, no MSA required
- Accuracy: Often within 1-2 Å RMSD of experimental structures
Energy Functions in Structure Prediction
Total potential energy combines multiple terms:
Energy minimization drives folding toward native structure
Bonded Terms
Bond lengths, angles, dihedrals
Non-Bonded Terms
Van der Waals, electrostatics
Solvation
Implicit/explicit water models
Statistical Methods for Computational Biology
Statistics is a fundamental tool for computational biologists, enabling quantitative understanding and rigorous analysis of biological data. From experimental design to interpreting results, statistical methods underpin reproducible computational biology research.
Statistics for Computational Biology Projects
Instructor
Erica Holdmore
Department of Data Science
Dana-Farber Cancer Institute
Format
Interactive Training Session
Hands-on statistical analysis
GitHub Repository
This comprehensive training covers essential statistical concepts for analyzing biological data. The interactive format combines theoretical foundations with practical applications, preparing researchers to design robust experiments and draw valid conclusions from complex datasets.
Topics Covered:
Foundations
- Experimental design principles
- Statistical power and sample size
- Hypothesis testing fundamentals
- P-values and significance levels
Data Processing
- Data cleaning and quality control
- Handling missing data
- Outlier detection and treatment
- Normalization and transformation
Analysis Methods
- Logistic regression for binary outcomes
- ANOVA (Analysis of Variance)
- Generalized Linear Models (GLM)
- Multiple comparisons corrections
Interpretation
- Effect sizes and confidence intervals
- Model diagnostics and validation
- Interpreting statistical results
- Communicating findings effectively
Applications in Computational Biology:
- Differential Gene Expression: Identifying statistically significant changes in RNA-seq data
- Variant Calling: Statistical models for distinguishing true variants from sequencing errors
- Protein-Protein Interactions: Assessing significance of co-immunoprecipitation experiments
- Drug Response Modeling: Logistic regression for predicting treatment outcomes
- Pathway Analysis: Multiple testing corrections for enrichment analysis
- Biomarker Discovery: Statistical validation of diagnostic/prognostic markers
MIT Computational Biology: Genomes, Networks, Evolution
Comprehensive lecture series by Manolis Kellis covering the computational methods and algorithms used to understand genomics, evolution, and biological networks. This course bridges algorithmic foundations with modern applications in genomics and systems biology.
Introduction & Core Algorithms
Lecture 01 - Introduction
Lecture 02 - Dynamic Programming
Sequence alignment, optimal path finding, and scoring matrices
Lecture 03 - Database Search
BLAST, heuristic search algorithms, and sequence similarity
Lectures 04-05 - Hidden Markov Models
Probabilistic models, Viterbi algorithm, and applications to gene finding
Gene Expression & RNA Biology
Lecture 06 - Gene Expression Analysis
Clustering, classification, and pattern recognition in expression data
Lecture 07 - RNA World, RNA-seq, RNA Folding
RNA structure prediction, transcriptomics, and secondary structure
Genomics & Regulatory Networks
Lecture 08 - Epigenomics
Chromatin states, histone modifications, and epigenetic regulation
Lecture 09 - Three Dimensional Genome
Chromosome conformation capture, Hi-C, and spatial organization
Lecture 10 - Regulatory Genomics
Transcription factors, enhancers, and gene regulation networks
Network Analysis & Machine Learning
Lecture 11 - Network Analysis
Protein-protein interactions, pathway analysis, and network topology
Lecture 12 - Deep Learning
Neural networks, convolutional architectures, and biological applications
Population & Evolutionary Genomics
Lecture 13 - Population Genomics
Genetic variation, population structure, and selection signatures
Lecture 14 - Genome-Wide Association Studies (GWAS)
Statistical methods for associating genetic variants with phenotypes
Lecture 15 - Expression Quantitative Trait Loci (eQTLs)
Linking genetic variation to gene expression changes
Lecture 16 - Heritability
Quantitative genetics and variance component analysis
Lecture 17 - Comparative Genomics
Cross-species alignment, conservation, and functional element discovery
Lecture 18 - Genome Assembly, Evolution, Duplication
De novo assembly algorithms and evolutionary mechanisms
Lecture 19 - Phylogenetics
Tree construction, parsimony, and maximum likelihood methods
Lecture 20 - Phylogenomics
Genome-scale phylogenetic analysis and gene tree reconciliation
Advanced Topics & Applications
Lecture 21 - Single-Cell Genomics
scRNA-seq analysis, cell type identification, and trajectory inference
Lecture 22 - Cancer Genomics
Somatic mutations, driver genes, and tumor evolution
Lecture 23 - Multi-Phenotype Analyses
Pleiotropy, genetic correlation, and cross-trait analysis
Lecture 24 - Genome Engineering
CRISPR, base editing, and computational design of genetic interventions
MLCB24: Machine Learning in Computational Biology (Fall 2024)
Advanced lecture series by Manolis Kellis on machine learning methods for computational biology. This course covers state-of-the-art deep learning approaches applied to genomics, protein structure, drug discovery, and disease analysis. Combines theoretical foundations with practical applications of transformers, graph neural networks, and generative models for biological problems.
Gene Expression & Single-Cell Analysis
Lecture 01 - Introduction
Course overview, machine learning fundamentals for biology, and key computational biology challenges
Lecture 02 - Expression Analysis: Clustering & Classification
Gene expression data analysis, clustering algorithms, classification methods, and dimensionality reduction
Lecture 03 - Single Cell Analysis
scRNA-seq analysis, cell type identification, trajectory inference, and batch correction methods
Sequence Analysis & Regulatory Genomics
Lecture 04 - Sequence Alignment
Dynamic programming for alignment, BLAST, multiple sequence alignment, and profile HMMs
Lecture 05 - Epigenomics & HMMs
Chromatin state discovery, hidden Markov models for genomic segmentation, and epigenetic regulation
Lecture 06 - Regulatory Circuitry
Transcription factor binding, enhancer prediction, and regulatory element discovery
Lecture 07 - Regulatory Circuitry and Networks
Gene regulatory networks, network inference algorithms, and systems-level regulation
Protein Structure & Deep Learning
Lecture 08 - Introduction to Protein Structure
Protein structure fundamentals, contact maps, and structure prediction challenges
Lecture 09 - Protein Folding Algorithms
Energy minimization, molecular dynamics, coevolution-based methods, and threading approaches
Lecture 10 - Protein Structure with Transformers
AlphaFold2 architecture, attention mechanisms, and transformer-based structure prediction
Lecture 11 - Protein Language Models
ESM, ProtBERT, protein embeddings, and self-supervised learning for proteins
Lecture 12 - DNA Language Models and Convolution
Convolutional neural networks for sequences, motif discovery, and DNA foundation models
Drug Discovery & Molecular Design
Lecture 13 - Drug Development Introduction
Drug discovery pipeline, target identification, lead optimization, and ADMET prediction
Lecture 14 - Chemistry & Graph Neural Networks
Molecular graphs, message passing networks, and property prediction with GNNs
Lecture 15 - Generating New Molecules
Generative models for molecules: VAEs, GANs, diffusion models, and flow-based generation
Lecture 16 - Training Neural Networks
Optimization algorithms, regularization, transfer learning, and training best practices
Disease Analysis & Clinical Applications
Lecture 17 - Disease: GWAS, PRS & Mechanism
Genome-wide association studies, polygenic risk scores, and mechanistic interpretation
Lecture 18 - Disease Mechanism
Variant effect prediction, causal inference, and pathway analysis for disease genetics
Lecture 19 - Electronic Health Records
Clinical data analysis, phenotype extraction, and machine learning for patient outcomes
Evolution, Metabolism & Metabolomics
Lecture 20 - Comparative Genomics
Multi-species alignment, conservation scoring, and functional element discovery
Lecture 21 - Evolution
Phylogenetic inference, molecular evolution models, and selection analysis
Lecture 22 - Metabolic Modeling
Flux balance analysis, metabolic networks, and constraint-based modeling
Lecture 23 - Computational Metabolomics
Mass spectrometry data analysis, metabolite identification, and pathway enrichment
Tools & Resources
MD Simulation Software
- • GROMACS: Fast, parallel MD for biomolecules
- • AMBER: Specialized for nucleic acids & proteins
- • NAMD: Scalable MD with CHARMM force fields
- • OpenMM: GPU-accelerated, Python API
- • LAMMPS: Large-scale atomic/molecular simulations
Bioinformatics Tools
- • BLAST: Sequence similarity search
- • Clustal Omega: Multiple sequence alignment
- • HMMER: Profile hidden Markov models
- • BioPython: Python tools for computational biology (Tutorial)
- • R/Bioconductor: Statistical analysis of genomic data
Structure Prediction
- • AlphaFold2: Deep learning structure prediction
- • Rosetta: Energy-based modeling suite
- • MODELLER: Homology modeling
- • I-TASSER: Threading & ab initio prediction
- • SWISS-MODEL: Web-based homology modeling
Databases
- • PDB: Protein Data Bank (structures)
- • UniProt: Protein sequences & annotations
- • GenBank: NIH genetic sequence database
- • Pfam: Protein families & domains
- • AlphaFold DB: 200M+ predicted structures