Part V: Multi-Omics Integration | Chapter 19

Machine Learning in Omics

From classical statistical learning to deep neural networks for biological discovery and clinical prediction

19.1 Supervised Learning for Omics

Supervised learning algorithms learn a mapping from input features (e.g., gene expression profiles) to output labels (e.g., disease status, drug response) using labeled training data. In omics, the defining challenge is the "large p, small n" problem: datasets typically contain thousands to millions of features ($p$) but only tens to hundreds of samples ($n$). This high dimensionality creates severe risks of overfitting and demands careful regularization, feature selection, and validation strategies.

Support Vector Machines (SVMs)

SVMs find the maximum-margin hyperplane that separates classes in feature space. For linearly separable data with labels $y_i \in \{-1, +1\}$ and features$\mathbf{x}_i \in \mathbb{R}^p$, the soft-margin SVM solves:

SVM Primal Objective

$$\min_{\mathbf{w}, b, \boldsymbol{\xi}} \; \frac{1}{2} \|\mathbf{w}\|^2 + C \sum_{i=1}^{n} \xi_i$$
$$\text{subject to} \quad y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0$$

Where $\mathbf{w}$ is the weight vector, $b$ is the bias,$\xi_i$ are slack variables allowing misclassification, and $C > 0$controls the trade-off between margin width and training error. The kernel trick maps data into a higher-dimensional space via $K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^\top \phi(\mathbf{x}_j)$, enabling non-linear decision boundaries. Common kernels include the radial basis function (RBF):$K(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma \|\mathbf{x}_i - \mathbf{x}_j\|^2)$.

Random Forests

Random forests are ensemble methods that aggregate predictions from many decision trees, each trained on a bootstrap sample of the data with a random subset of features at each split. They handle high-dimensional data naturally, provide built-in feature importance measures, and are relatively robust to hyperparameter choices. For classification, the predicted class is the majority vote; for regression, it is the average prediction. Out-of-bag (OOB) error provides an unbiased estimate of generalization performance without a separate validation set.

Neural Networks

Feedforward neural networks (multilayer perceptrons) learn non-linear mappings through compositions of affine transformations and non-linear activation functions. A layer computes$\mathbf{h} = \sigma(\mathbf{W}\mathbf{x} + \mathbf{b})$ where $\sigma$is an activation function (ReLU, sigmoid, tanh). For multi-class classification with $K$ classes, the output layer uses the softmax function:

Softmax Function

$$\text{softmax}(z_k) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}, \quad k = 1, \dots, K$$

Training minimizes the cross-entropy loss:

$$\mathcal{L} = -\sum_{i=1}^{n} \sum_{k=1}^{K} y_{ik} \log \hat{p}_{ik}$$

Where $y_{ik}$ is the one-hot encoded true label and $\hat{p}_{ik}$ is the predicted probability from softmax. Backpropagation computes gradients, and optimizers (Adam, SGD with momentum) update weights iteratively.

Comparison of Supervised Methods for Omics

MethodStrengthsWeaknessesBest For
SVMEffective in high-dim; kernel flexibilityScaling $O(n^2)$; limited interpretabilitySmall-to-medium sample sizes
Random ForestFeature importance; robust; handles mixed typesCorrelated features share importanceExploratory analysis; biomarker ranking
Neural NetworksArbitrary non-linearities; representation learningData-hungry; black-box; overfitting riskLarge datasets; multi-modal inputs
Logistic RegressionInterpretable coefficients; probabilistic outputLinear decision boundary; needs regularizationBaseline model; clinical risk scores

19.2 Unsupervised Learning & Dimensionality Reduction

Unsupervised methods discover structure in omics data without predefined labels. They are essential for exploratory analysis: identifying patient subtypes, revealing batch effects, detecting co-regulated gene modules, and visualizing high-dimensional data in two or three dimensions.

Principal Component Analysis (PCA)

PCA finds orthogonal directions of maximum variance by computing the eigendecomposition of the covariance matrix. Given centered data matrix $X \in \mathbb{R}^{n \times p}$:

PCA Eigenvalue Decomposition

$$\Sigma = \frac{1}{n-1} X^\top X = V \Lambda V^\top$$

Where $\Sigma$ is the $p \times p$ sample covariance matrix,$V = [\mathbf{v}_1, \dots, \mathbf{v}_p]$ contains the eigenvectors (principal components), and $\Lambda = \text{diag}(\lambda_1, \dots, \lambda_p)$ with$\lambda_1 \geq \lambda_2 \geq \dots$ are the eigenvalues (variance explained). The proportion of variance explained by the first $k$ components is$\sum_{i=1}^k \lambda_i / \sum_{i=1}^p \lambda_i$. In practice, singular value decomposition (SVD) of $X$ is preferred for numerical stability.

t-SNE & UMAP

While PCA preserves global variance structure, non-linear methods are better at revealing local neighborhood relationships in complex datasets, particularly single-cell data.

t-SNE

t-distributed Stochastic Neighbor Embedding converts high-dimensional pairwise distances into conditional probabilities using Gaussian kernels in high-D and Student-t distributions in low-D. Minimizes KL divergence between the two distributions. The perplexity parameter (typically 5–50) balances local vs. global structure. Note: t-SNE does not preserve global distances—cluster separations and sizes are not meaningful.

UMAP

Uniform Manifold Approximation and Projection is grounded in Riemannian geometry and algebraic topology. Constructs a fuzzy simplicial complex from high-D data and optimizes a low-D representation to preserve its topological structure. Faster than t-SNE, better preservation of global structure, and supports embedding new data points. Default parameters: n_neighbors=15, min_dist=0.1.

Clustering Methods

Clustering assigns samples or features to groups based on similarity. In omics, clustering identifies patient subtypes, cell populations, or co-expressed gene modules.

  • k-means: Partitions data into $k$ clusters by iteratively minimizing within-cluster sum of squares: $\min \sum_{j=1}^{k} \sum_{\mathbf{x}_i \in C_j} \|\mathbf{x}_i - \boldsymbol{\mu}_j\|^2$. Requires specifying $k$ (use silhouette scores or gap statistic for selection). Fast but assumes spherical clusters.
  • Hierarchical clustering: Builds a tree (dendrogram) of nested clusters using agglomerative (bottom-up) or divisive (top-down) approaches. Linkage criteria (Ward, complete, average) determine merge distances. Widely used for heatmaps of gene expression. Does not require specifying $k$ a priori.
  • Leiden / Louvain: Graph-based community detection used extensively in single-cell analysis. Constructs a k-nearest-neighbor graph and optimizes modularity. Leiden improves upon Louvain by guaranteeing well-connected communities. Resolution parameter controls granularity.
  • Gaussian Mixture Models (GMM): Probabilistic clustering assuming data arise from a mixture of Gaussians. Fit via EM algorithm. Provides soft cluster assignments (probabilities). BIC or AIC for model selection.

19.3 Feature Selection & Regularization

With omics datasets containing thousands of features and relatively few samples, feature selection is critical for building interpretable and generalizable models. Reducing the feature space mitigates the curse of dimensionality, reduces computational cost, and improves model performance by eliminating noisy or redundant variables.

LASSO Regression ($\ell_1$ Regularization)

The Least Absolute Shrinkage and Selection Operator adds an $\ell_1$ penalty to the ordinary least squares objective, inducing sparsity by shrinking many coefficients exactly to zero:

LASSO Objective

$$\min_{\boldsymbol{\beta}} \; \|\mathbf{y} - X\boldsymbol{\beta}\|_2^2 + \lambda \|\boldsymbol{\beta}\|_1$$

Where $\lambda > 0$ is the regularization strength. As $\lambda$ increases, more coefficients become zero, yielding a sparser model. The non-differentiability of the$\ell_1$ norm at zero is what produces exact sparsity. LASSO can select at most$\min(n, p)$ features and tends to arbitrarily select one from groups of correlated features.

Elastic Net

Elastic net combines $\ell_1$ and $\ell_2$ penalties to overcome LASSO's limitations with correlated features:

Elastic Net Objective

$$\min_{\boldsymbol{\beta}} \; \|\mathbf{y} - X\boldsymbol{\beta}\|_2^2 + \lambda_1 \|\boldsymbol{\beta}\|_1 + \lambda_2 \|\boldsymbol{\beta}\|_2^2$$

The $\ell_2$ penalty encourages grouping of correlated features (selecting all or none together), while the $\ell_1$ penalty maintains sparsity. The mixing parameter$\alpha = \lambda_1 / (\lambda_1 + \lambda_2)$ balances between ridge ($\alpha = 0$) and LASSO ($\alpha = 1$).

Other Feature Selection Approaches

Mutual Information-Based Selection

Ranks features by their mutual information with the target variable$I(X_j; Y)$. Captures non-linear dependencies unlike correlation-based methods. mRMR (minimum Redundancy Maximum Relevance) extends this by also penalizing redundancy among selected features: $\max \left[ I(X_j; Y) - \frac{1}{|S|} \sum_{X_s \in S} I(X_j; X_s) \right]$.

Recursive Feature Elimination (RFE)

Wrapper method that iteratively trains a model (e.g., SVM or RF), ranks features by importance, removes the least important, and repeats until a desired number of features remain. RFE-CV uses cross-validation to select the optimal feature subset size. Computationally expensive but accounts for feature interactions during selection.

19.4 Cross-Validation & Model Evaluation

Reliable estimation of model performance is crucial in omics, where small sample sizes make overfitting a constant threat. Cross-validation (CV) provides an unbiased estimate of how well a model generalizes to unseen data by systematically partitioning the dataset into training and validation folds.

Cross-Validation Strategies

  • k-Fold CV: Data split into $k$ folds; each fold serves as validation once while remaining $k-1$ folds are used for training. Typically $k = 5$ or $k = 10$. Bias-variance trade-off: larger $k$ reduces bias but increases variance and computation.
  • Leave-One-Out CV (LOOCV): Special case where $k = n$. Nearly unbiased but high variance. Practical only for small datasets or computationally cheap models.
  • Stratified CV: Ensures each fold has approximately the same class distribution as the full dataset. Essential for imbalanced classes (common in clinical omics: few disease cases vs. many controls).
  • Nested CV: Outer loop estimates generalization performance; inner loop tunes hyperparameters. Prevents optimistic bias from tuning on the test fold. Critical for honest reporting in omics studies.

Evaluation Metrics

The choice of evaluation metric depends on the clinical context. Accuracy can be misleading for imbalanced datasets; a model predicting "no cancer" for every patient achieves 99% accuracy if only 1% have cancer.

MetricFormulaInterpretation
Accuracy$\frac{TP + TN}{TP + TN + FP + FN}$Overall correctness; misleading if classes imbalanced
Precision (PPV)$\frac{TP}{TP + FP}$Of predicted positives, how many are true?
Recall (Sensitivity)$\frac{TP}{TP + FN}$Of actual positives, how many were detected?
F1 Score$2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$Harmonic mean; balances precision and recall
AUC-ROC$\int_0^1 \text{TPR}(t) \, d\text{FPR}(t)$Threshold-independent; probability that a random positive ranks higher than a random negative

Common Pitfalls in Omics ML

  • Data leakage: Feature selection or normalization performed before train/test split allows information from the test set to influence the model, producing overly optimistic estimates.
  • Confounding batch effects: If disease status is correlated with processing batch, a model may learn batch signatures rather than biological signal.
  • No external validation: Performance assessed only on the same cohort is inflated. Independent validation cohorts are essential for clinical translatability.
  • Publication bias: Reporting only the best model from many trials produces the "winner's curse"—true performance is likely lower.

19.5 Deep Learning for Omics

Deep learning has achieved remarkable successes in omics, particularly for tasks involving raw sequence data, images, and large-scale multi-modal datasets where hand-crafted features are insufficient. The key advantage is automatic feature learning: deep networks learn hierarchical representations directly from data.

Autoencoders for Denoising & Imputation

Autoencoders learn compressed representations (bottleneck) of input data through an encoder-decoder architecture. Variational autoencoders (VAEs) add a probabilistic framework, enforcing a structured latent space. In single-cell analysis, scVI uses a deep generative model to account for library size, batch effects, and dropout noise simultaneously.

Variational Autoencoder (VAE) Loss

$$\mathcal{L}_{\text{VAE}} = \mathbb{E}_{q(\mathbf{z}|\mathbf{x})} \left[ \log p(\mathbf{x}|\mathbf{z}) \right] - D_{\text{KL}}\bigl(q(\mathbf{z}|\mathbf{x}) \,\|\, p(\mathbf{z})\bigr)$$

The first term is the reconstruction loss (how well the decoder recovers the input). The KL divergence term regularizes the latent space to approximate the prior$p(\mathbf{z}) = \mathcal{N}(0, I)$, ensuring smooth and interpretable latent representations.

CNNs for Sequence Data

Convolutional neural networks excel at detecting local patterns (motifs) in DNA and protein sequences. DeepBind and DeepSEA pioneered the use of 1D CNNs for predicting transcription factor binding sites, chromatin accessibility, and variant effects from sequence alone. The convolution operation scans learned filters across the sequence, and deeper layers capture higher-order combinatorial patterns.

Transformers & Protein Language Models

Transformer architectures, originally developed for natural language processing, have revolutionized protein structure prediction (AlphaFold2) and protein function annotation. Models such as ESM (Evolutionary Scale Modeling) and ProtTrans are trained on hundreds of millions of protein sequences using masked language modeling. They learn contextual embeddings that capture evolutionary and structural information, enabling zero-shot prediction of mutation effects, secondary structure, and protein-protein interactions.

Model Interpretation: SHAP & Attention

Interpretability is essential for biological discovery and clinical trust. SHAP (SHapley Additive exPlanations) values assign each feature an additive contribution to a prediction, based on cooperative game theory. For deep learning models, attention weights reveal which input positions (e.g., amino acid residues, genomic loci) the model focuses on when making predictions. Integrated gradients and saliency maps provide alternative attribution methods.

Transfer Learning in Omics

Transfer learning leverages knowledge from a model pre-trained on a large dataset to improve performance on a smaller target dataset. This is particularly valuable in omics where labeled clinical datasets are expensive to generate.

  • Domain adaptation: Pre-train on cell lines, fine-tune on patient samples
  • Cross-species transfer: Pre-train on mouse data, adapt to human
  • Foundation models: Large pre-trained models (Geneformer for single-cell, ESM for proteins) serve as general-purpose feature extractors that can be fine-tuned for specific downstream tasks
  • Multi-task learning: Jointly predict multiple related outcomes (e.g., drug sensitivity across cancer types) to share statistical strength

19.6 Overfitting, Regularization & Benchmarking

Overfitting occurs when a model captures noise in the training data rather than the underlying biological signal, resulting in excellent training performance but poor generalization. In omics, the extreme $p \gg n$ regime makes this an ever-present concern. Regularization techniques constrain model complexity to combat overfitting.

Regularization Techniques

TechniqueMechanismApplication
$\ell_1$ (LASSO)Drives weights to zero (sparsity)Feature selection in linear models
$\ell_2$ (Ridge / Weight decay)Shrinks weights toward zeroAll models; prevents extreme weights
DropoutRandomly zeroes neuron activations during trainingNeural networks; implicit ensemble
Early stoppingHalts training when validation loss stops improvingNeural networks; gradient boosting
Batch normalizationNormalizes layer inputs; smooths loss landscapeDeep networks; stabilizes training

Benchmarking Best Practices

Rigorous benchmarking is essential for comparing ML methods in omics. The DREAM challenges and MAQC consortium have established gold standards for evaluation. Key principles include:

  • Use standardized datasets: Published benchmarks with known ground truth enable fair comparison across studies
  • Report multiple metrics: AUC-ROC, AUC-PR, F1, and calibration plots provide complementary views of performance
  • Statistical comparison: Use paired tests (Wilcoxon signed-rank) or corrected resampled t-tests across CV folds to assess significance of performance differences
  • Include baselines: Always compare to simple baselines (logistic regression, random classifier) to quantify the value added by complex models
  • Report computational cost: Training time, memory, and inference speed matter for clinical deployment
  • Code and data availability: Full reproducibility requires sharing preprocessing code, model architectures, hyperparameters, and random seeds