Part I · Chapter 2

Probability & Statistics for ML

Probability theory formalises uncertainty. Machine learning is fundamentally about inference under uncertainty: given data, what can we infer about the underlying process? This chapter derives the core machinery from first principles.

1. Probability Axioms

A probability space \((\Omega, \mathcal{F}, P)\) consists of a sample space \(\Omega\), a \(\sigma\)-algebra \(\mathcal{F}\) of events, and a probability measure \(P\) satisfying Kolmogorov's three axioms:

\[ \text{(K1)}\quad P(A) \geq 0 \quad \forall A \in \mathcal{F} \]

\[ \text{(K2)}\quad P(\Omega) = 1 \]

\[ \text{(K3)}\quad P\!\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} P(A_i) \quad \text{for pairwise disjoint } A_i \]

From these axioms we derive the complement rule \(P(A^c) = 1 - P(A)\), monotonicity, and the inclusion-exclusion principle \(P(A \cup B) = P(A) + P(B) - P(A \cap B)\).

2. Bayes' Theorem — Full Derivation

The joint probability of events \(A\) and \(B\) can be factored in two ways using the definition of conditional probability \(P(A|B) = P(A \cap B)/P(B)\):

\[ P(A \cap B) = P(A|B)\,P(B) = P(B|A)\,P(A) \]

Rearranging immediately gives Bayes' theorem:

\[ P(A|B) = \frac{P(B|A)\,P(A)}{P(B)} \]

In the ML context we write \(A = \theta\) (parameters) and \(B = \mathcal{D}\) (observed data):

\[ \underbrace{P(\theta | \mathcal{D})}_{\text{posterior}} = \frac{\overbrace{P(\mathcal{D}|\theta)}^{\text{likelihood}}\; \overbrace{P(\theta)}^{\text{prior}}}{\underbrace{P(\mathcal{D})}_{\text{evidence}}} \]

The denominator (marginal likelihood) follows from the law of total probability:

\[ P(\mathcal{D}) = \int P(\mathcal{D}|\theta)\,P(\theta)\,d\theta \]

Bayesian inference updates a prior belief through the likelihood of observed data to form a posterior.

3. Common Distributions

Gaussian (Normal)

The Gaussian PDF is derived by requiring maximum entropy among all distributions with fixed mean and variance. For \(X \sim \mathcal{N}(\mu, \sigma^2)\):

\[ p(x; \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) \]

Bernoulli

Models a binary outcome \(X \in \{0,1\}\) with success probability \(\phi\):

\[ P(X = x) = \phi^x (1-\phi)^{1-x}, \quad \mathbb{E}[X] = \phi, \quad \mathrm{Var}(X) = \phi(1-\phi) \]

Categorical & Poisson

Categorical generalises Bernoulli to \(K\) classes: \(P(X = k) = \phi_k\) with \(\sum_k \phi_k = 1\). The Poisson distribution models count data:

\[ P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad \mathbb{E}[X] = \lambda, \quad \mathrm{Var}(X) = \lambda \]

Companion Video Walkthroughs

Six short tutorials covering the discrete and continuous distributions used throughout this course — each walks through the PMF/PDF, mean and variance, and worked examples. Recommended if you want a slower, intuition-first treatment alongside the derivations above.

Applied Statistics · Probability for ML

Bernoulli & Binomial Variables

Applied Statistics · Probability for ML

Binomial Distribution

Applied Statistics · Probability for ML

Geometric Distribution

Applied Statistics · Probability for ML

Poisson Distribution

Applied Statistics · Probability for ML

Normal Distribution

Applied Statistics · Probability for ML

Standardising the Normal Distribution

4. Expectation, Variance, and Covariance

The expectation is a linear functional: \(\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]\). Variance quantifies spread:

\[ \mathrm{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 \]

The covariance matrix of a random vector \(\mathbf{x} \in \mathbb{R}^d\) encodes pairwise linear dependencies:

\[ \boldsymbol{\Sigma} = \mathrm{Cov}(\mathbf{x}) = \mathbb{E}\!\left[(\mathbf{x}-\boldsymbol{\mu})(\mathbf{x}-\boldsymbol{\mu})^\top\right], \quad \boldsymbol{\mu} = \mathbb{E}[\mathbf{x}] \]

\(\boldsymbol{\Sigma}\) is always symmetric and positive semi-definite. The multivariate Gaussian is \(\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\) with PDF \(p(\mathbf{x}) \propto \exp(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top\boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu}))\).

5. Maximum Likelihood Estimation (MLE)

Given i.i.d. samples \(\{x_1, \ldots, x_n\}\) from \(p(x;\theta)\), the MLE maximises the likelihood (equivalently, the log-likelihood for numerical stability):

\[ \hat{\theta}_{\mathrm{MLE}} = \arg\max_\theta \sum_{i=1}^{n} \log p(x_i; \theta) \]

Derivation for the Gaussian

The log-likelihood for \(\mathcal{N}(\mu, \sigma^2)\) is:

\[ \ell(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(x_i - \mu)^2 \]

Setting \(\partial \ell / \partial \mu = 0\):

\[ \frac{\partial \ell}{\partial \mu} = \frac{1}{\sigma^2}\sum_{i=1}^n (x_i - \mu) = 0 \quad \Longrightarrow \quad \hat{\mu}_{\mathrm{MLE}} = \frac{1}{n}\sum_{i=1}^n x_i = \bar{x} \]

Setting \(\partial \ell / \partial \sigma^2 = 0\) (letting \(v = \sigma^2\)):

\[ \frac{\partial \ell}{\partial v} = -\frac{n}{2v} + \frac{1}{2v^2}\sum_i(x_i-\mu)^2 = 0 \quad \Longrightarrow \quad \hat{\sigma}^2_{\mathrm{MLE}} = \frac{1}{n}\sum_{i=1}^n(x_i - \bar{x})^2 \]

Note: \(\hat{\sigma}^2_{\mathrm{MLE}}\) is biased — it divides by \(n\) not \(n-1\). The unbiased estimator uses the Bessel correction: \(s^2 = \frac{1}{n-1}\sum_i (x_i - \bar{x})^2\).

6. MAP Estimation & Conjugate Priors

MAP incorporates a prior \(P(\theta)\) and maximises the log-posterior:

\[ \hat{\theta}_{\mathrm{MAP}} = \arg\max_\theta \Big[\log P(\mathcal{D}|\theta) + \log P(\theta)\Big] \]

For a Gaussian likelihood with Gaussian prior \(\mu \sim \mathcal{N}(\mu_0, \tau^2)\), the posterior is also Gaussian (conjugacy). Setting the gradient to zero yields the precision-weighted average:

\[ \hat{\mu}_{\mathrm{MAP}} = \frac{\frac{\mu_0}{\tau^2} + \frac{n\bar{x}}{\sigma^2}}{\frac{1}{\tau^2} + \frac{n}{\sigma^2}} \]

As \(n \to \infty\), the MAP converges to the MLE (\(\bar{x}\)). With small \(n\) the prior dominates — a strong prior shrinks the estimate toward \(\mu_0\). MAP with a Gaussian prior is exactly equivalent to L2 (Ridge) regularisation.

Python: MLE vs MAP for Gaussian Estimation

We simulate how the MLE converges to the true parameter as \(n\) grows, contrast it with MAP under strong and weak priors, and visualise how the posterior sharpens with more data.

Python

script.py100 lines

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

rng = np.random.default_rng(0)
TRUE_MU = 3.0
TRUE_SIGMA = 1.5

# ── MLE: closed-form for Gaussian ──
def mle_gaussian(data):
    mu_mle = np.mean(data)
    sigma2_mle = np.mean((data - mu_mle) ** 2)
    return mu_mle, np.sqrt(sigma2_mle)

# ── MAP with conjugate Normal prior: mu_0 = 0, sigma2_0 = tau^2 ──
def map_gaussian_mu(data, mu_0, tau2):
    n = len(data)
    sigma2 = TRUE_SIGMA ** 2
    mu_map = (mu_0 / tau2 + n * np.mean(data) / sigma2) / (1.0 / tau2 + n / sigma2)
    return mu_map

ns = [3, 5, 10, 20, 50, 100, 300]
mu_mle_vals, sigma_mle_vals, mu_map_strong, mu_map_weak = [], [], [], []

for n in ns:
    data = rng.normal(TRUE_MU, TRUE_SIGMA, n)
    mu_mle, sigma_mle = mle_gaussian(data)
    mu_mle_vals.append(mu_mle)
    sigma_mle_vals.append(sigma_mle)
    mu_map_strong.append(map_gaussian_mu(data, mu_0=0.0, tau2=0.5))
    mu_map_weak.append(map_gaussian_mu(data, mu_0=0.0, tau2=10.0))

fig, axes = plt.subplots(1, 3, figsize=(15, 5), facecolor='#0f0f1a')

# Plot 1: MLE mu estimate vs n
ax = axes[0]
ax.set_facecolor('#0f0f1a')
ax.semilogx(ns, mu_mle_vals, 'o-', color='#a78bfa', linewidth=2, markersize=6, label='MLE estimate')
ax.axhline(TRUE_MU, color='#34d399', linestyle='--', linewidth=1.5, label='True mu=3.0')
ax.fill_between(ns, TRUE_MU - TRUE_SIGMA, TRUE_MU + TRUE_SIGMA, alpha=0.1, color='#34d399')
ax.set_xlabel('Sample size n', color='#e2e8f0')
ax.set_ylabel('Estimated mu', color='#e2e8f0')
ax.set_title('MLE of Mean vs Sample Size', color='#c4b5fd', fontsize=11, fontweight='bold')
ax.legend(facecolor='#1e1b4b', labelcolor='#e2e8f0', edgecolor='#4c1d95', fontsize=8)
ax.tick_params(colors='#94a3b8')
for spine in ax.spines.values():
    spine.set_edgecolor('#334155')

# Plot 2: MLE vs MAP (strong vs weak prior)
ax = axes[1]
ax.set_facecolor('#0f0f1a')
ax.semilogx(ns, mu_mle_vals, 'o-', color='#a78bfa', linewidth=2, markersize=5, label='MLE')
ax.semilogx(ns, mu_map_strong, 's--', color='#f472b6', linewidth=2, markersize=5, label='MAP (strong prior tau=0.5)')
ax.semilogx(ns, mu_map_weak, '^:', color='#fb923c', linewidth=2, markersize=5, label='MAP (weak prior tau=10)')
ax.axhline(TRUE_MU, color='#34d399', linestyle='--', linewidth=1.5, label='True mu')
ax.axhline(0.0, color='#64748b', linestyle=':', linewidth=1, label='Prior mean')
ax.set_xlabel('Sample size n', color='#e2e8f0')
ax.set_ylabel('Estimated mu', color='#e2e8f0')
ax.set_title('MLE vs MAP: Prior Effect', color='#c4b5fd', fontsize=11, fontweight='bold')
ax.legend(facecolor='#1e1b4b', labelcolor='#e2e8f0', edgecolor='#4c1d95', fontsize=7)
ax.tick_params(colors='#94a3b8')
for spine in ax.spines.values():
    spine.set_edgecolor('#334155')

# Plot 3: Posterior distribution for n=5 vs n=50
ax = axes[2]
ax.set_facecolor('#0f0f1a')
mu_range = np.linspace(-1, 6, 400)
TAU2 = 1.0

for n, c, ls in [(5, '#f472b6', '-'), (20, '#fb923c', '--'), (100, '#a78bfa', ':')]:
    data_n = rng.normal(TRUE_MU, TRUE_SIGMA, n)
    sigma2_known = TRUE_SIGMA ** 2
    prec_prior = 1.0 / TAU2
    prec_likelihood = n / sigma2_known
    prec_post = prec_prior + prec_likelihood
    mu_post = (0.0 * prec_prior + np.mean(data_n) * prec_likelihood) / prec_post
    sigma2_post = 1.0 / prec_post
    posterior = np.exp(-0.5 * (mu_range - mu_post) ** 2 / sigma2_post) / np.sqrt(2 * np.pi * sigma2_post)
    ax.plot(mu_range, posterior, color=c, linewidth=2, linestyle=ls, label=f'n={n}, post mean={mu_post:.2f}')

prior = np.exp(-0.5 * mu_range ** 2 / TAU2) / np.sqrt(2 * np.pi * TAU2)
ax.plot(mu_range, prior, color='#64748b', linewidth=1.5, linestyle='-.', label='Prior N(0,1)')
ax.axvline(TRUE_MU, color='#34d399', linestyle='--', linewidth=1.5, label='True mu')
ax.set_xlabel('mu', color='#e2e8f0')
ax.set_ylabel('Posterior density', color='#e2e8f0')
ax.set_title('Posterior Sharpens with More Data', color='#c4b5fd', fontsize=11, fontweight='bold')
ax.legend(facecolor='#1e1b4b', labelcolor='#e2e8f0', edgecolor='#4c1d95', fontsize=7)
ax.tick_params(colors='#94a3b8')
for spine in ax.spines.values():
    spine.set_edgecolor('#334155')

plt.suptitle('MLE vs MAP for Gaussian Parameter Estimation', color='#ddd6fe', fontsize=13, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('output.png', dpi=130, bbox_inches='tight', facecolor='#0f0f1a')
print('Saved output.png')
print(f'MLE with n=300: mu={mu_mle_vals[-1]:.4f} (true {TRUE_MU})')
print(f'MAP strong prior n=5: mu={mu_map_strong[0]:.4f}')
print(f'MAP strong prior n=300: mu={mu_map_strong[-1]:.4f}')

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Key Takeaways

✓ Kolmogorov axioms underpin all probability; Bayes' theorem follows directly from the definition of conditional probability.
✓ MLE for Gaussian yields the sample mean and (biased) sample variance in closed form via log-likelihood differentiation.
✓ MAP with a Gaussian prior produces a precision-weighted combination of prior and data, equivalent to L2 regularisation.
✓ Conjugate priors make the posterior analytically tractable; the Gaussian is its own conjugate for the mean.
✓ With enough data, both MLE and MAP converge — the prior's influence diminishes as \(n \to \infty\).

Share:X Reddit LinkedIn

← Linear Algebra Next: Optimization Theory

Rate this chapter: