Part II: Transcriptomics | Chapter 8

Single-Cell & Spatial Transcriptomics

Resolving transcriptomic heterogeneity at single-cell resolution and mapping gene expression in spatial context

1. The Need for Single-Cell Resolution

Bulk RNA-Seq measures the average gene expression across millions of cells, obscuring the substantial heterogeneity that exists within seemingly homogeneous populations. This averaging effect is particularly problematic in tissues composed of multiple cell types (e.g., the brain contains hundreds of neuronal subtypes, glia, and vascular cells), in tumors exhibiting intratumoral heterogeneity, and in dynamic processes such as cellular differentiation where individual cells occupy different states along a continuum.

Simpson's Paradox in Bulk RNA-Seq

Bulk measurements can produce misleading results when cell composition changes between conditions. For example, if a treatment causes depletion of cell type A (which highly expresses gene X), bulk RNA-Seq would report gene X as downregulated, even if its expression within each cell type is unchanged. This compositional confounding is a form of Simpson's paradox that can only be resolved at single-cell resolution. Furthermore, bimodal gene expression distributions (where a gene is either fully ON or fully OFF in individual cells) appear as intermediate expression in bulk assays, masking the discrete, switch-like regulatory mechanisms governing many biological decisions.

Single-cell RNA sequencing (scRNA-seq) addresses these limitations by profiling the transcriptome of individual cells. Since its first application by Tang et al. in 2009, the technology has undergone remarkable scaling: early studies profiled dozens of cells, while modern high-throughput platforms routinely capture tens of thousands to millions of cells per experiment. This scaling has been driven by innovations in microfluidics, combinatorial barcoding, and droplet-based cell partitioning.

Key Applications of scRNA-seq

Cell Atlas Projects: The Human Cell Atlas aims to create comprehensive reference maps of all human cell types and states using scRNA-seq data.

Tumor Heterogeneity: Identifying drug-resistant subclones, characterizing the tumor microenvironment, and predicting therapy response.

Developmental Biology: Mapping cell fate decisions, lineage hierarchies, and signaling gradients during embryogenesis.

Immunology: Characterizing immune cell diversity, clonal expansion of T cells, and immune responses to infection or vaccination.

2. Single-Cell RNA-Seq Technologies

Droplet-Based Methods

Droplet-based platforms encapsulate individual cells in nanoliter-scale aqueous droplets along with barcoded gel beads in an oil emulsion. Each gel bead carries millions of copies of an oligonucleotide containing: (1) a PCR handle for downstream amplification, (2) a cell barcode (16 nt in 10x Genomics) that uniquely identifies the cell, (3) a unique molecular identifier (UMI, 12 nt) for PCR duplicate identification, and (4) a poly(dT) sequence for mRNA capture.

Major scRNA-seq Platforms

PlatformTypeThroughputCoverageSensitivity
10x ChromiumDroplet500-10,000 cells3' or 5' end~1,500-3,000 genes/cell
Drop-seqDroplet1,000-10,000 cells3' end~500-1,500 genes/cell
Smart-seq2Plate96-384 cellsFull-length~4,000-8,000 genes/cell
MARS-seqPlate (FACS)1,000-10,000 cells3' end~1,000-3,000 genes/cell
inDropDroplet1,000-10,000 cells3' end~1,000-2,500 genes/cell

Unique Molecular Identifiers (UMIs)

UMIs are short random nucleotide sequences (typically 8-12 nt) attached to each captured mRNA molecule before PCR amplification. Because each original mRNA molecule receives a distinct UMI, all PCR duplicates of the same molecule share the same UMI + cell barcode + gene identity combination. During analysis, reads sharing the same UMI within the same cell barcode are collapsed to a single count, providing an accurate digital measure of absolute transcript numbers. This eliminates PCR amplification bias, which is particularly severe in scRNA-seq due to the extremely low input amounts (5-20 pg of total RNA per cell).

The probability of a UMI collision (two different molecules receiving the same UMI) can be estimated by the birthday problem approximation: $$P(\text{collision}) \approx 1 - e^{-n^2 / (2 \cdot 4^k)}$$ where \(n\) is the number of captured mRNA molecules per gene per cell (typically 1-100) and \(k\) is the UMI length in nucleotides. For a 12-nt UMI (\(4^{12} = 16{,}777{,}216\) possible sequences) and \(n = 100\) molecules: $$P(\text{collision}) \approx 1 - e^{-100^2 / (2 \times 16{,}777{,}216)} \approx 0.0003$$ This extremely low collision rate ensures that UMI counting accurately reflects true molecular counts.

Doublet Rate Estimation

A major quality concern in droplet-based scRNA-seq is the formation of doublets (or multiplets), where two or more cells are captured in the same droplet and assigned the same cell barcode. Doublets appear as artificial hybrid transcriptomes and can confound clustering and cell type identification.

The doublet rate follows Poisson loading statistics. If \(\lambda\) is the mean number of cells per droplet: $$P(\text{doublet}) = \frac{P(k \geq 2 | k \geq 1)}{P(k \geq 1)} = \frac{1 - P(0) - P(1)}{1 - P(0)}$$ For 10x Genomics: $$\text{Doublet rate} \approx 0.8\% \times \frac{N_{cells}}{1000}$$ At typical loading of 10,000 cells, the expected doublet rate is approximately 8%. Computational tools such as Scrublet and DoubletFinder use simulated doublets to identify and remove putative doublets from the dataset.

3. scRNA-seq Analysis Pipeline

The standard scRNA-seq analysis workflow begins with raw sequencing reads and proceeds through preprocessing, quality filtering, normalization, feature selection, dimensionality reduction, clustering, and biological interpretation. Two major software ecosystems dominate the field: Seurat (R/Bioconductor) and Scanpy (Python/AnnData). Both implement similar algorithmic pipelines but differ in their data structures and syntax.

Preprocessing: Cell Ranger

For 10x Genomics data, Cell Ranger performs demultiplexing, barcode processing, alignment (using STAR), and UMI counting to produce a cell-by-gene count matrix. It distinguishes cell-containing droplets from empty droplets using the EmptyDrops algorithm, which tests whether the RNA profile of each barcode differs significantly from the ambient RNA background (the set of barcodes with very low total UMI counts).

Quality Filtering

Quality control removes low-quality cells and potential artifacts. Three primary metrics are used:

QC MetricLow-Quality SignatureTypical ThresholdBiological Interpretation
nFeature (genes detected)Very low or very high200-5000 genesLow: empty droplet or dying cell; High: doublet
nCount (total UMIs)Very low or very high500-25,000 UMIsLow: poor capture; High: doublet
percent.mt (mitochondrial %)High (>10-20%)<10-20%High: cell lysis (cytoplasmic mRNA lost, mito mRNA retained)

Normalization & Batch Correction

scRNA-seq count data are normalized to remove cell-to-cell differences in sequencing depth. Seurat uses log-normalization: divide by total counts per cell, multiply by a scale factor (default 10,000), and log-transform. scran uses pool-based size factors computed by deconvolution of summed counts from cell pools, which is more robust for zero-inflated data. SCTransform(implemented in Seurat v3+) uses regularized negative binomial regression to directly model the relationship between mean expression and variance, performing normalization, variance stabilization, and feature selection in a single step.

When integrating data from multiple batches, experiments, or technologies, batch correction methods such as Harmony, Scanorama, BBKNN, and Seurat CCA/RPCA integration align shared cell populations across batches while preserving biological differences. Harmony operates in PCA space, iteratively adjusting the principal component embeddings to remove batch effects while retaining biologically meaningful variation.

4. Dimensionality Reduction & Clustering

scRNA-seq datasets typically have thousands of cells and 2,000-5,000 highly variable genes. Direct analysis of this high-dimensional space is computationally expensive and statistically unreliable due to the curse of dimensionality. Dimensionality reduction and graph-based clustering are therefore central to scRNA-seq analysis.

PCA for Initial Reduction

Principal Component Analysis is typically the first dimensionality reduction step, reducing the data from ~2,000 highly variable genes to 10-50 principal components. The elbow plot (variance explained vs. PC number) or the JackStraw procedure (in Seurat) helps determine the optimal number of PCs to retain. These PCs serve as input for both non-linear visualization methods and graph-based clustering.

t-SNE (t-distributed Stochastic Neighbor Embedding)

t-SNE converts pairwise similarities between data points in high-dimensional space to probabilities, then seeks a low-dimensional (typically 2D) embedding that preserves these relationships. It excels at revealing local cluster structure but does not reliably preserve global relationships (i.e., distances between clusters are not meaningful). The perplexity parameter (typically 30-50) controls the effective number of neighbors and significantly affects the visualization.

UMAP (Uniform Manifold Approximation and Projection)

UMAP has largely replaced t-SNE as the preferred visualization method because it better preserves global structure, is faster, and provides a more consistent embedding across runs. UMAP is based on Riemannian geometry and algebraic topology, constructing a high-dimensional graph representation and then optimizing a low-dimensional embedding to match this topology.

UMAP minimizes the cross-entropy between the high-dimensional and low-dimensional fuzzy topological representations. The cost function is: $$C = \sum_{i \neq j} \left[ v_{ij} \log\left(\frac{v_{ij}}{w_{ij}}\right) + (1 - v_{ij}) \log\left(\frac{1 - v_{ij}}{1 - w_{ij}}\right) \right]$$ where \(v_{ij}\) represents the membership strength of the edge between points \(i\) and \(j\) in the high-dimensional fuzzy simplicial set: $$v_{ij} = \exp\left(-\frac{d(x_i, x_j) - \rho_i}{\sigma_i}\right)$$ with \(\rho_i\) being the distance to the nearest neighbor and \(\sigma_i\) chosen such that each point has approximately \(k\) effective neighbors (controlled by the n_neighbors parameter). In the low-dimensional space: $$w_{ij} = \left(1 + a \|y_i - y_j\|^{2b}\right)^{-1}$$ Key parameters: n_neighbors (typically 15-30) controls local vs. global structure balance; min_dist (typically 0.1-0.5) controls how tightly points are packed.

Graph-Based Clustering: Leiden & Louvain

Both algorithms operate on a k-nearest neighbor (KNN) graph constructed in PC space, where cells are nodes and edges connect cells with similar expression profiles (weighted by Jaccard similarity in the shared nearest neighbor variant). The Louvain algorithmpartitions the graph by iteratively optimizing modularity, a measure of how densely connected nodes within communities are compared to a random graph. The Leiden algorithm (an improvement over Louvain) guarantees well-connected communities by introducing a refinement phase that avoids arbitrarily poorly connected clusters. The resolution parameter controls cluster granularity: higher values yield more clusters.

Cell Type Annotation Strategies

1.
Marker Gene Approach: Identify differentially expressed marker genes for each cluster (Wilcoxon rank-sum test) and match against known cell-type markers from literature or databases (CellMarker, PanglaoDB).
2.
Reference-Based Transfer: Transfer cell type labels from annotated reference datasets using tools like SingleR, scmap, or Seurat label transfer (based on anchor identification in shared PCA space).
3.
Machine Learning Classifiers: Train supervised models (e.g., CellTypist, scANVI) on large reference atlases to automatically classify cells in query datasets with high accuracy and scalability.

5. Trajectory Analysis & RNA Velocity

Biological processes such as differentiation, activation, and cell cycle progression are continuous rather than discrete. Trajectory analysis (pseudotime inference) orders cells along these continuous processes based on their transcriptomic similarity, reconstructing the temporal dynamics of gene expression without requiring time-series experiments.

Pseudotime Methods

Monocle 3 constructs a principal graph (tree or graph structure) through the UMAP embedding using the SimplePPT algorithm and assigns each cell a pseudotime value based on its geodesic distance from a specified root cell. Slingshot fits principal curves through cluster centroids in reduced-dimensional space, handling bifurcating trajectories by identifying lineage-specific curves. PAGA (Partition-based Graph Abstraction, in Scanpy) provides a coarse-grained map of connectivity between cell clusters, which can guide more detailed trajectory inference.

RNA Velocity

RNA velocity, introduced by La Manno et al. (2018), infers the future transcriptional state of each cell by exploiting the ratio of unspliced (nascent) to spliced (mature) mRNA molecules. Standard scRNA-seq captures both spliced and unspliced reads, distinguished by the presence of intronic sequences. The key insight is that for a gene being upregulated, unspliced mRNA accumulates before the corresponding increase in spliced mRNA, and vice versa for downregulation.

The RNA velocity model describes the temporal dynamics of spliced (\(s\)) and unspliced (\(u\)) mRNA for each gene: $$\frac{du}{dt} = \alpha(t) - \beta u$$ $$\frac{ds}{dt} = \beta u - \gamma s$$ where \(\alpha(t)\) is the transcription rate, \(\beta\) is the splicing rate, and \(\gamma\) is the degradation rate. At steady state (\(du/dt = 0, ds/dt = 0\)): $$s_{steady} = \frac{\beta}{\gamma} u_{steady} = \frac{\alpha}{\gamma}$$ The RNA velocity for each gene in each cell is estimated as the deviation from this steady-state relationship: $$v_s = \frac{ds}{dt} = \beta u - \gamma s$$ Positive velocity indicates the gene is being upregulated; negative velocity indicates downregulation. The velocity vector for each cell is projected onto the UMAP or t-SNE embedding to visualize the predicted direction of transcriptional change, producing the characteristic arrow plots.

Shannon Entropy for Cell State Diversity

Shannon entropy can quantify the transcriptomic diversity within a cell population or the differentiation potential of individual cells. Higher entropy indicates a more uniform distribution of gene expression across many genes (characteristic of progenitor/stem cells), while lower entropy indicates specialization (characteristic of differentiated cells).

For a cell \(c\) with expression profile normalized to proportions \(p_i = x_i / \sum_j x_j\): $$H(c) = -\sum_{i=1}^{G} p_i \log_2(p_i)$$ where \(G\) is the number of genes. The maximum entropy \(H_{max} = \log_2(G)\) occurs when all genes are equally expressed. The relative entropy (or specificity) is: $$S(c) = 1 - \frac{H(c)}{H_{max}}$$ Similarly, mutual information between gene expression and cell type can identify genes most informative for distinguishing cell types: $$I(X; Y) = \sum_{x} \sum_{y} p(x, y) \log_2 \frac{p(x,y)}{p(x)p(y)}$$

6. Spatial Transcriptomics

While scRNA-seq provides single-cell resolution, it requires tissue dissociation, which destroys spatial context. Spatial transcriptomics technologies measure gene expression while preserving the original tissue architecture, enabling the study of cell-cell communication, tissue organization, and spatial gene expression gradients. The field has expanded rapidly, with Nature Methods naming spatially resolved transcriptomics the 2020 Method of the Year.

Spatial Transcriptomics Technologies

10x Visium

Sequencing-based approach where tissue sections are placed on slides containing arrays of barcoded oligonucleotides at ~55 micrometer spot diameter (capturing 1-10 cells per spot). mRNA is captured, reverse-transcribed with spatial barcodes, and sequenced. Provides whole-transcriptome data with near-cellular resolution. Visium HD achieves 2-micrometer resolution through smaller bin sizes.

MERFISH (Multiplexed Error-Robust FISH)

Imaging-based approach using combinatorial FISH with error-robust binary barcoding. Each RNA species is assigned a unique binary barcode across multiple rounds of hybridization and imaging. Achieves subcellular resolution for hundreds to thousands of genes simultaneously. Developed by Xiaowei Zhuang's group and commercialized by Vizgen (MERSCOPE platform).

seqFISH / seqFISH+

Sequential fluorescence in situ hybridization uses multiple rounds of probe hybridization, imaging, and probe removal to assign barcodes to individual RNA molecules. seqFISH+ can profile over 10,000 genes at single-molecule resolution in individual cells within intact tissue sections. Similar in principle to MERFISH but uses a different barcoding strategy.

Slide-seq / Slide-seqV2

Uses a dense array of DNA-barcoded beads (10 micrometer diameter) on a rubber-coated glass surface. Tissue sections are transferred onto the bead array, and mRNA is captured and sequenced with spatial barcode information. Achieves near-cellular resolution across the entire transcriptome.

Deconvolution & Integration

For sequencing-based spatial methods where each spot contains multiple cells, deconvolution methods estimate the cell-type composition of each spot by integrating spatial data with a reference scRNA-seq dataset. Tools such as cell2location, RCTD (Robust Cell Type Decomposition), SPOTlight, and Tangram use different statistical frameworks (Bayesian models, NMF, optimal transport) to map single-cell reference profiles onto spatial spots. These methods effectively achieve virtual single-cell resolution from multi-cell spatial measurements.

Multi-Omics Integration

Emerging technologies combine single-cell transcriptomics with other modalities: CITE-seq simultaneously measures mRNA and surface protein (ADTs) via oligonucleotide-conjugated antibodies; SHARE-seqand 10x Multiome jointly profile chromatin accessibility (ATAC-seq) and gene expression; scBS-seq combines single-cell bisulfite sequencing with transcriptomics. Computational integration methods like WNN (Weighted Nearest Neighbor) in Seurat v4 and MOFA+ identify shared and modality-specific sources of variation, providing a comprehensive view of cellular identity and regulation.

7. Key Concepts & Chapter Summary

Comparing scRNA-seq Approaches

PropertyDroplet-BasedPlate-BasedSpatial
Cell throughputHigh (10^3-10^6)Low (10^2-10^3)Medium-High
Sensitivity (genes/cell)1,500-3,0004,000-8,000Variable (100-10,000)
Transcript coverage3' or 5' endFull-lengthTargeted or whole
Spatial informationLostLostPreserved
Cost per cell$0.01-0.10$1-10Variable

Chapter Summary

  • โ—Single-cell RNA-seq resolves transcriptomic heterogeneity invisible to bulk methods, revealing rare cell types, continuous transitions, and cell-to-cell variability.
  • โ—Droplet-based methods (10x Chromium) offer high throughput with 3'/5' end coverage, while plate-based methods (Smart-seq2) provide full-length transcripts at lower throughput.
  • โ—UMIs eliminate PCR amplification bias by enabling digital counting of original mRNA molecules; doublet detection is essential for data quality.
  • โ—The standard analysis pipeline encompasses QC filtering, normalization, PCA, UMAP visualization, graph-based clustering (Leiden/Louvain), and cell type annotation.
  • โ—RNA velocity leverages the ratio of unspliced to spliced mRNA to predict future cell states, adding temporal directionality to snapshot data.
  • โ—Spatial transcriptomics (Visium, MERFISH, seqFISH) preserves tissue context; deconvolution methods integrate scRNA-seq references to achieve virtual single-cell spatial resolution.
  • โ—Multi-omics integration (CITE-seq, Multiome, WNN) combines transcriptomics with protein, chromatin, and epigenomic measurements for comprehensive cellular profiling.