Part I ยท Chapter 2

Probability & Statistics for ML

Probability theory formalises uncertainty. Machine learning is fundamentally about inference under uncertainty: given data, what can we infer about the underlying process? This chapter derives the core machinery from first principles.

1. Probability Axioms

A probability space \((\Omega, \mathcal{F}, P)\) consists of a sample space \(\Omega\), a \(\sigma\)-algebra \(\mathcal{F}\) of events, and a probability measure \(P\) satisfying Kolmogorov's three axioms:

\[ \text{(K1)}\quad P(A) \geq 0 \quad \forall A \in \mathcal{F} \]
\[ \text{(K2)}\quad P(\Omega) = 1 \]
\[ \text{(K3)}\quad P\!\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} P(A_i) \quad \text{for pairwise disjoint } A_i \]

From these axioms we derive the complement rule \(P(A^c) = 1 - P(A)\), monotonicity, and the inclusion-exclusion principle \(P(A \cup B) = P(A) + P(B) - P(A \cap B)\).

2. Bayes' Theorem โ€” Full Derivation

The joint probability of events \(A\) and \(B\) can be factored in two ways using the definition of conditional probability \(P(A|B) = P(A \cap B)/P(B)\):

\[ P(A \cap B) = P(A|B)\,P(B) = P(B|A)\,P(A) \]

Rearranging immediately gives Bayes' theorem:

\[ P(A|B) = \frac{P(B|A)\,P(A)}{P(B)} \]

In the ML context we write \(A = \theta\) (parameters) and \(B = \mathcal{D}\) (observed data):

\[ \underbrace{P(\theta | \mathcal{D})}_{\text{posterior}} = \frac{\overbrace{P(\mathcal{D}|\theta)}^{\text{likelihood}}\; \overbrace{P(\theta)}^{\text{prior}}}{\underbrace{P(\mathcal{D})}_{\text{evidence}}} \]

The denominator (marginal likelihood) follows from the law of total probability:

\[ P(\mathcal{D}) = \int P(\mathcal{D}|\theta)\,P(\theta)\,d\theta \]
PriorP(ฮธ)Belief before dataLikelihoodP(D|ฮธ)How well ฮธ explainsthe observed data DBayesupdatePosteriorP(ฮธ|D) โˆ P(D|ฮธ)P(ฮธ)Updated belief afterobserving data D

Bayesian inference updates a prior belief through the likelihood of observed data to form a posterior.

3. Common Distributions

Gaussian (Normal)

The Gaussian PDF is derived by requiring maximum entropy among all distributions with fixed mean and variance. For \(X \sim \mathcal{N}(\mu, \sigma^2)\):

\[ p(x; \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) \]

Bernoulli

Models a binary outcome \(X \in \{0,1\}\) with success probability \(\phi\):

\[ P(X = x) = \phi^x (1-\phi)^{1-x}, \quad \mathbb{E}[X] = \phi, \quad \mathrm{Var}(X) = \phi(1-\phi) \]

Categorical & Poisson

Categorical generalises Bernoulli to \(K\) classes: \(P(X = k) = \phi_k\) with \(\sum_k \phi_k = 1\). The Poisson distribution models count data:

\[ P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad \mathbb{E}[X] = \lambda, \quad \mathrm{Var}(X) = \lambda \]

4. Expectation, Variance, and Covariance

The expectation is a linear functional: \(\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]\). Variance quantifies spread:

\[ \mathrm{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 \]

The covariance matrix of a random vector \(\mathbf{x} \in \mathbb{R}^d\) encodes pairwise linear dependencies:

\[ \boldsymbol{\Sigma} = \mathrm{Cov}(\mathbf{x}) = \mathbb{E}\!\left[(\mathbf{x}-\boldsymbol{\mu})(\mathbf{x}-\boldsymbol{\mu})^\top\right], \quad \boldsymbol{\mu} = \mathbb{E}[\mathbf{x}] \]

\(\boldsymbol{\Sigma}\) is always symmetric and positive semi-definite. The multivariate Gaussian is \(\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\) with PDF \(p(\mathbf{x}) \propto \exp(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top\boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu}))\).

5. Maximum Likelihood Estimation (MLE)

Given i.i.d. samples \(\{x_1, \ldots, x_n\}\) from \(p(x;\theta)\), the MLE maximises the likelihood (equivalently, the log-likelihood for numerical stability):

\[ \hat{\theta}_{\mathrm{MLE}} = \arg\max_\theta \sum_{i=1}^{n} \log p(x_i; \theta) \]

Derivation for the Gaussian

The log-likelihood for \(\mathcal{N}(\mu, \sigma^2)\) is:

\[ \ell(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(x_i - \mu)^2 \]

Setting \(\partial \ell / \partial \mu = 0\):

\[ \frac{\partial \ell}{\partial \mu} = \frac{1}{\sigma^2}\sum_{i=1}^n (x_i - \mu) = 0 \quad \Longrightarrow \quad \hat{\mu}_{\mathrm{MLE}} = \frac{1}{n}\sum_{i=1}^n x_i = \bar{x} \]

Setting \(\partial \ell / \partial \sigma^2 = 0\) (letting \(v = \sigma^2\)):

\[ \frac{\partial \ell}{\partial v} = -\frac{n}{2v} + \frac{1}{2v^2}\sum_i(x_i-\mu)^2 = 0 \quad \Longrightarrow \quad \hat{\sigma}^2_{\mathrm{MLE}} = \frac{1}{n}\sum_{i=1}^n(x_i - \bar{x})^2 \]

Note: \(\hat{\sigma}^2_{\mathrm{MLE}}\) is biased โ€” it divides by \(n\) not \(n-1\). The unbiased estimator uses the Bessel correction: \(s^2 = \frac{1}{n-1}\sum_i (x_i - \bar{x})^2\).

6. MAP Estimation & Conjugate Priors

MAP incorporates a prior \(P(\theta)\) and maximises the log-posterior:

\[ \hat{\theta}_{\mathrm{MAP}} = \arg\max_\theta \Big[\log P(\mathcal{D}|\theta) + \log P(\theta)\Big] \]

For a Gaussian likelihood with Gaussian prior \(\mu \sim \mathcal{N}(\mu_0, \tau^2)\), the posterior is also Gaussian (conjugacy). Setting the gradient to zero yields the precision-weighted average:

\[ \hat{\mu}_{\mathrm{MAP}} = \frac{\frac{\mu_0}{\tau^2} + \frac{n\bar{x}}{\sigma^2}}{\frac{1}{\tau^2} + \frac{n}{\sigma^2}} \]

As \(n \to \infty\), the MAP converges to the MLE (\(\bar{x}\)). With small \(n\) the prior dominates โ€” a strong prior shrinks the estimate toward \(\mu_0\). MAP with a Gaussian prior is exactly equivalent to L2 (Ridge) regularisation.

Python: MLE vs MAP for Gaussian Estimation

We simulate how the MLE converges to the true parameter as \(n\) grows, contrast it with MAP under strong and weak priors, and visualise how the posterior sharpens with more data.

Python
script.py100 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Key Takeaways

  • โœ“ Kolmogorov axioms underpin all probability; Bayes' theorem follows directly from the definition of conditional probability.
  • โœ“ MLE for Gaussian yields the sample mean and (biased) sample variance in closed form via log-likelihood differentiation.
  • โœ“ MAP with a Gaussian prior produces a precision-weighted combination of prior and data, equivalent to L2 regularisation.
  • โœ“ Conjugate priors make the posterior analytically tractable; the Gaussian is its own conjugate for the mean.
  • โœ“ With enough data, both MLE and MAP converge โ€” the prior's influence diminishes as \(n \to \infty\).