Part II: Transcriptomics | Chapter 7

RNA-Seq: Principles & Analysis

Next-generation sequencing of the transcriptome: from library preparation to differential expression and pathway analysis

1. RNA-Seq Library Preparation

RNA-Seq (RNA sequencing) uses next-generation sequencing (NGS) to survey the transcriptome at unprecedented resolution and dynamic range. The preparation of a sequencing library from total RNA involves multiple biochemical steps, each of which can introduce biases that must be understood and accounted for in downstream analysis. The quality of the starting RNA material is critical; RNA integrity is typically assessed using the RNA Integrity Number (RIN) from an Agilent Bioanalyzer, with RIN values above 7 generally considered acceptable for most RNA-Seq protocols.

Library Preparation Steps

1
RNA Selection / Depletion

Ribosomal RNA (rRNA) constitutes 80-90% of total cellular RNA. Two strategies enrich for informative transcripts: Poly(A) selection uses oligo(dT) beads to capture polyadenylated mRNAs, which is efficient but loses non-polyadenylated transcripts (histones, many lncRNAs, degraded RNA). rRNA depletion uses biotinylated probes complementary to rRNA to subtract it, retaining both mRNA and non-coding RNA. This is preferred for degraded samples (e.g., FFPE tissue) and when non-coding RNA is of interest.

2
Fragmentation

RNA is fragmented to sizes compatible with sequencing (typically 200-500 nt). Chemical fragmentation using divalent cations (Mg2+, Zn2+) at elevated temperature provides relatively uniform fragment distribution. Enzymatic fragmentation with RNase III is an alternative. Fragmentation before reverse transcription (RNA fragmentation) produces more uniform coverage than post-cDNA fragmentation, as it avoids the 3' bias caused by oligo(dT) priming.

3
Reverse Transcription & Second-Strand Synthesis

RNA fragments are reverse-transcribed into first-strand cDNA using random hexamer primers (preferred for uniform coverage) or oligo(dT) primers. Second-strand cDNA is then synthesized. For strand-specific (stranded) libraries, second-strand synthesis incorporates dUTP instead of dTTP, and the dU-containing strand is later degraded by uracil-DNA glycosylase (UDG), preserving strand information. Stranded protocols enable disambiguation of overlapping sense and antisense transcripts.

4
Adapter Ligation & PCR Amplification

Sequencing adapters are ligated to both ends of the double-stranded cDNA fragments after end-repair and A-tailing. Adapters contain sequences for flow cell binding (P5/P7), sequencing primer binding, and sample-specific index sequences for multiplexing. Limited PCR amplification (8-15 cycles) enriches adapter-ligated fragments, but introduces GC bias and duplicate reads. PCR-free protocols or unique molecular identifiers (UMIs) mitigate amplification artifacts.

Illumina Sequencing Workflow

The Illumina sequencing-by-synthesis (SBS) platform dominates RNA-Seq. Library fragments hybridize to complementary oligos on the flow cell surface and undergo bridge amplification to form clonal clusters. During sequencing, fluorescently labeled reversible terminators are incorporated one nucleotide at a time. After each incorporation event, the flow cell is imaged to identify the incorporated base, and the terminator and fluorophore are chemically cleaved to allow the next cycle. Modern Illumina platforms (NovaSeq 6000, NovaSeq X Plus) generate 1-20 billion paired-end reads per run with read lengths of 100-300 bp, providing sufficient depth for most transcriptomic applications.

2. Quality Control & Read Alignment

Quality Control with FastQC

FastQC provides a comprehensive quality assessment of raw sequencing reads. Key metrics include: per-base sequence quality (Phred scores should be >28 across most positions), per-sequence quality scores, per-base sequence content (should be uniform after the first 10-15 bases, where random hexamer priming bias is common), GC content distribution (should approximate normal), sequence duplication levels, overrepresented sequences (may indicate adapter contamination or rRNA), and adapter content. Reads failing quality thresholds are trimmed or filtered using tools such as Trimmomatic, fastp, or Cutadapt before alignment.

Splice-Aware Alignment

Unlike DNA sequencing reads, RNA-Seq reads can span exon-exon junctions. Standard DNA aligners (e.g., BWA, Bowtie2) cannot handle these split alignments efficiently. Dedicated splice-aware aligners use different strategies to identify junction-spanning reads.

Major RNA-Seq Alignment Tools

ToolAlgorithmKey FeaturesSpeed
STARSuffix array-based seed search + stitchingHigh sensitivity for novel junctions; 2-pass mode improves junction detectionVery fast (~30 GB RAM)
HISAT2Hierarchical graph FM index (HGFM)Memory-efficient; supports SNP-aware alignment with graph genomesFast (~8 GB RAM)
minimap2Minimizer-based chainingDesigned for long reads (PacBio, ONT); also handles short readsVery fast

Pseudoalignment: Salmon & Kallisto

Pseudoalignment methods bypass the computationally expensive step of determining the exact genomic position of each read. Instead, they determine the set of transcripts compatible with each read using k-mer or fragment-level equivalence classes. Kallistouses a de Bruijn graph-based index of the transcriptome to rapidly assign reads to equivalence classes, then applies the expectation-maximization (EM) algorithm to estimate transcript abundances. Salmon uses a similar approach with quasi-mapping and additionally models sequence-specific and GC content biases, fragment length distributions, and positional biases. Both tools are orders of magnitude faster than alignment-based methods (processing 30 million reads in 2-5 minutes) and produce transcript-level quantifications directly.

3. Read Quantification & Normalization

Gene-Level Quantification

For alignment-based workflows, reads are counted against annotated gene models. featureCounts (from the Subread package) and HTSeq-countassign aligned reads to genomic features (genes, exons) based on overlap with the annotation file (GTF/GFF). Key parameters include the counting mode (union, intersection-strict, intersection-nonempty), how multi-mapping reads are handled (typically discarded or fractionally assigned), and whether strand information is used. For pseudoalignment outputs, the tximport R package aggregates transcript-level estimates to the gene level while accounting for changes in transcript length across samples.

DESeq2 Size Factors

DESeq2 uses the median-of-ratios method to estimate sample-specific size factors that account for differences in sequencing depth and RNA composition. This normalization is more robust than total count normalization because it is not influenced by a few very highly expressed genes.

The DESeq2 size factor for sample \(j\) is computed as: $$s_j = \text{median}_i \left( \frac{K_{ij}}{\left(\prod_{v=1}^{m} K_{iv}\right)^{1/m}} \right)$$ where \(K_{ij}\) is the raw count of gene \(i\) in sample \(j\), and \(m\) is the total number of samples. The denominator represents the geometric mean of counts for gene \(i\) across all samples, creating a pseudo-reference sample. Genes with a geometric mean of zero (unexpressed in any sample) are excluded from the calculation. The normalized count is then: $$q_{ij} = \frac{K_{ij}}{s_j}$$

TMM Normalization (edgeR)

The Trimmed Mean of M-values (TMM) method, implemented in edgeR, calculates a scaling factor by taking a weighted trimmed mean of the log-ratios between each sample and a reference sample. The trimming removes genes with extreme log-fold-changes (typically the top and bottom 30%) and extreme absolute expression levels (top and bottom 5%), which are most likely to violate the assumption that most genes are not differentially expressed.

Normalization Method Comparison

MethodPackageApproachAssumption
Size factorsDESeq2Median of ratios to pseudo-referenceMost genes not DE
TMMedgeRTrimmed mean of log-ratiosMost genes not DE
Upper quartileedgeR75th percentile of non-zero countsRobust to outliers
TPMVariousLength + depth normalizationWithin-sample comparison

4. Differential Expression with DESeq2

RNA-Seq count data are discrete and overdispersed relative to the Poisson distribution. The negative binomial (NB) distribution is the standard model for RNA-Seq counts because it accommodates both the mean-variance relationship inherent in count data and the biological variability between replicates. DESeq2 is the most widely used tool for differential expression analysis of RNA-Seq data.

The Negative Binomial Model

DESeq2 models the count \(K_{ij}\) for gene \(i\) in sample \(j\) as: $$K_{ij} \sim \text{NB}(\mu_{ij}, \alpha_i)$$ where: $$\mu_{ij} = s_j \cdot q_{ij}$$ and \(s_j\) is the size factor for sample \(j\) and \(q_{ij}\) is the true (normalized) expression. The negative binomial distribution has probability mass function: $$P(K = k) = \binom{k + r - 1}{k} p^r (1-p)^k$$ In DESeq2's parameterization with mean \(\mu\) and dispersion \(\alpha\): $$\text{Var}(K_{ij}) = \mu_{ij} + \alpha_i \mu_{ij}^2$$ The dispersion parameter \(\alpha_i\) captures the biological coefficient of variation squared. When \(\alpha_i = 0\), the NB reduces to the Poisson distribution.

DESeq2 estimates gene-wise dispersions using a three-step approach: (1) maximum likelihood estimation of gene-wise dispersions, (2) fitting a trend line (dispersion-mean relationship) across all genes, and (3) shrinking gene-wise estimates toward the trend using an empirical Bayes procedure. This approach is critical for experiments with small sample sizes, where individual gene dispersion estimates are imprecise.

Wald Test for Differential Expression

The log2 fold change \(\beta_i\) for gene \(i\) is estimated via a generalized linear model (GLM) with NB family: $$\log_2(\mu_{ij}) = \sum_r x_{jr} \beta_{ir}$$ where \(x_{jr}\) are design matrix elements. The Wald statistic tests whether the estimated log2 fold change differs significantly from zero: $$W_i = \frac{\hat{\beta}_i}{\text{SE}(\hat{\beta}_i)}$$ Under the null hypothesis, \(W_i \sim N(0,1)\). The p-value is calculated from the standard normal distribution: $$p_i = 2 \cdot \Phi\left(-|W_i|\right)$$ DESeq2 also applies log2 fold change shrinkage (apeglm, ashr, or normal prior) to reduce noise in LFC estimates for genes with low counts or high variability, which is essential for ranking genes and for visualization.

Likelihood Ratio Test (LRT)

For complex experimental designs with multiple factors or multiple levels, the Likelihood Ratio Test (LRT) is preferred over the Wald test. The LRT compares the fit of a full model (including the factor of interest) to a reduced model (without that factor). The test statistic follows a chi-squared distribution with degrees of freedom equal to the difference in parameters between the two models. The LRT is particularly useful for time-course experiments and designs with interaction terms.

5. Gene Set Enrichment & Pathway Analysis

While individual gene-level analysis identifies specific differentially expressed genes, gene set enrichment analysis (GSEA) and pathway analysis provide a higher-level biological interpretation by testing whether predefined sets of functionally related genes show coordinated changes in expression. This approach increases statistical power by aggregating modest individual gene effects and provides mechanistic insight into the biological processes affected by the experimental condition.

Gene Set Enrichment Analysis (GSEA)

GSEA, developed by Subramanian et al. (2005), uses all genes in the experiment rather than only those exceeding an arbitrary significance cutoff. Genes are ranked by their expression change (e.g., signal-to-noise ratio, log2 fold change, or Wald statistic). The algorithm then walks down the ranked list, increasing a running sum when a gene in the set is encountered and decreasing it when a gene not in the set is encountered.

The GSEA enrichment score (ES) for a gene set \(S\) is computed as follows. Given \(N\) total genes ranked by correlation \(r_j\), the running sum is: $$ES(S) = \max_{1 \leq j \leq N} \left| \sum_{\substack{g_j \in S \\ j' \leq j}} \frac{|r_{j'}|^p}{N_R} - \sum_{\substack{g_j \notin S \\ j' \leq j}} \frac{1}{N - |S|} \right|$$ where: $$N_R = \sum_{g_j \in S} |r_j|^p$$ and \(p\) is a weighting exponent (typically \(p = 1\) for weighted enrichment). The maximum deviation from zero of the running sum is the enrichment score. Statistical significance is assessed by permuting phenotype labels (preferred) or gene labels, computing ES for each permutation, and deriving empirical p-values. The normalized enrichment score (NES) accounts for gene set size differences.

Common Gene Set Databases

MSigDB (Molecular Signatures Database)

Curated collection including Hallmark gene sets (50 well-defined biological states/processes), C2 (curated pathways from KEGG, Reactome, BioCarta), C5 (Gene Ontology terms), and C7 (immunologic signatures). Over 33,000 gene sets for human.

Gene Ontology (GO)

Hierarchical classification of gene functions into three domains: Biological Process (BP), Molecular Function (MF), and Cellular Component (CC). Over-representation analysis (ORA) with hypergeometric test or Fisher's exact test identifies enriched GO terms among differentially expressed genes.

KEGG Pathways

Kyoto Encyclopedia of Genes and Genomes provides manually curated pathway maps representing molecular interaction networks for metabolism, signaling, disease, and other processes. Pathway topology-based methods (SPIA, clipper) incorporate network structure.

Reactome

Open-source, peer-reviewed pathway database with detailed molecular-level representations of biological pathways and reactions. Supports pathway enrichment analysis and visualization through its web interface and R packages.

6. Visualization & Exploratory Analysis

Effective visualization of RNA-Seq results is essential for quality assessment, hypothesis generation, and communication of findings. Multiple complementary visualizations address different aspects of the data.

Essential RNA-Seq Visualizations

Plot TypePurposeKey Information
PCA PlotSample-level exploratory analysisReveals sample groupings, outliers, and batch effects; PC1 often captures the primary biological variable
Volcano PlotDifferential expression overviewx-axis: log2FC; y-axis: -log10(padj); identifies genes with large and significant changes
MA PlotExpression vs. fold changex-axis: mean expression; y-axis: log2FC; reveals expression-dependent trends; DESeq2 provides shrunken LFC version
HeatmapGene expression patternsClustered rows (genes) and columns (samples); uses variance-stabilized or rlog-transformed counts; reveals co-expression modules
Dispersion PlotModel diagnosticsGene-wise dispersion vs. mean expression; final (shrunken) estimates should follow the fitted trend

Principal Component Analysis (PCA)

PCA reduces the high-dimensional gene expression matrix to a low-dimensional representation that captures the maximum variance in the data. For RNA-Seq data, PCA is typically performed on variance-stabilized counts (vst or rlog transformation in DESeq2) to avoid dominance by highly expressed genes. The resulting PCA plot, typically showing PC1 vs. PC2, reveals the main sources of variation in the experiment. Ideally, the primary biological variable (e.g., treatment vs. control) drives the separation along PC1, while secondary sources of variation (e.g., batch, sex) are captured by later components.

Chapter Summary

  • โ—RNA-Seq library preparation involves RNA selection, fragmentation, reverse transcription, adapter ligation, and amplification, each step introducing potential biases.
  • โ—Splice-aware aligners (STAR, HISAT2) and pseudoaligners (Salmon, Kallisto) offer complementary approaches for mapping reads to the transcriptome.
  • โ—DESeq2 models count data with the negative binomial distribution and uses empirical Bayes shrinkage for robust dispersion estimation.
  • โ—The Wald test evaluates whether log2 fold changes differ significantly from zero; the LRT is preferred for complex multi-factor designs.
  • โ—GSEA and pathway analysis provide functional interpretation by testing for coordinated changes in gene sets rather than individual genes.
  • โ—PCA, volcano plots, MA plots, and heatmaps are essential complementary visualizations for RNA-Seq analysis.