ML for Science/Part I: Foundations

Classification & Logistic Regression

From the logistic function to softmax — probabilistic classification for scientific discovery

Introduction

Classification is the task of predicting a discrete label from input features. While linear regression predicts continuous values, many scientific problems are inherently categorical: is this cell cancerous or healthy? Is this signal a gravitational wave or noise? What type of galaxy is this?

Logistic regression provides a principled probabilistic framework for classification. It models the probability of each class and learns by maximizing the likelihood of the observed labels. Despite its name, logistic regression is a classification algorithm, not a regression method.

Key Topics

  • 1. The Logistic (Sigmoid) Function
  • 2. Cross-Entropy Loss Derivation
  • 3. Gradient Descent for Logistic Regression
  • 4. Softmax for Multi-Class Classification
  • 5. Decision Boundaries
  • 6. Newton's Method and IRLS

1. The Logistic (Sigmoid) Function

The sigmoid function maps any real number to the interval $(0, 1)$, making it ideal for modeling probabilities:

$$\boxed{\sigma(z) = \frac{1}{1 + e^{-z}}}$$

Key Properties

  • Range: $\sigma(z) \in (0, 1)$ for all $z \in \mathbb{R}$
  • Symmetry: $\sigma(-z) = 1 - \sigma(z)$
  • Derivative: $\sigma'(z) = \sigma(z)(1 - \sigma(z))$
  • Limits: $\lim_{z \to -\infty}\sigma(z) = 0$, $\lim_{z \to \infty}\sigma(z) = 1$
  • Center: $\sigma(0) = 0.5$

Derivation of the derivative: Let $\sigma = (1 + e^{-z})^{-1}$. Then:

$$\frac{d\sigma}{dz} = \frac{e^{-z}}{(1 + e^{-z})^2} = \frac{1}{1 + e^{-z}} \cdot \frac{e^{-z}}{1 + e^{-z}} = \sigma(z) \cdot (1 - \sigma(z))$$

This self-referential derivative makes gradient computations remarkably elegant.

Log-Odds (Logit) Interpretation

The inverse of the sigmoid is the logit function:

$$\text{logit}(p) = \log\frac{p}{1-p} = z$$

This means logistic regression models the log-odds as a linear function of the features: $\log\frac{p(y=1|\mathbf{x})}{p(y=0|\mathbf{x})} = \mathbf{w}^T\mathbf{x} + b$. Increasing a feature by one unit changes the log-odds by its coefficient.

2. Binary Logistic Regression Model

For binary classification with labels $y_i \in \{0, 1\}$, the model is:

$$p(y_i = 1 | \mathbf{x}_i; \mathbf{w}) = \sigma(\mathbf{w}^T\mathbf{x}_i) = \frac{1}{1 + e^{-\mathbf{w}^T\mathbf{x}_i}}$$

This can be written compactly as:

$$p(y_i | \mathbf{x}_i; \mathbf{w}) = \sigma(\mathbf{w}^T\mathbf{x}_i)^{y_i}(1 - \sigma(\mathbf{w}^T\mathbf{x}_i))^{1-y_i}$$

Why Not Linear Regression for Classification?

Using linear regression ($p = \mathbf{w}^T\mathbf{x}$) for classification has critical problems:

  • Predictions can be outside $[0, 1]$, violating probability axioms
  • Minimizing squared error with binary labels is not well-motivated
  • Outliers in one class can shift the decision boundary toward the other class
  • The model does not arise from a principled probabilistic model

3. Cross-Entropy Loss Derivation

Maximum Likelihood Estimation

Assuming i.i.d. data, the log-likelihood is:

$$\ell(\mathbf{w}) = \sum_{i=1}^{n} \left[y_i \log \sigma(\mathbf{w}^T\mathbf{x}_i) + (1-y_i) \log(1 - \sigma(\mathbf{w}^T\mathbf{x}_i))\right]$$

The cross-entropy loss (negative log-likelihood) to minimize is:

$$\boxed{\mathcal{L}(\mathbf{w}) = -\sum_{i=1}^{n}\left[y_i \log \hat{p}_i + (1-y_i)\log(1-\hat{p}_i)\right]}$$

where $\hat{p}_i = \sigma(\mathbf{w}^T\mathbf{x}_i)$.

Information-Theoretic Interpretation

The cross-entropy between the true distribution $q$ and model distribution $p$ is:

$$H(q, p) = -\sum_x q(x) \log p(x)$$

This can be decomposed as:

$$H(q, p) = H(q) + D_{\text{KL}}(q \| p)$$

Since the entropy $H(q)$ of the true distribution is constant, minimizing cross-entropy is equivalent to minimizing the KL divergence from the model to the true distribution.

Deriving the Gradient

Let $z_i = \mathbf{w}^T\mathbf{x}_i$. The gradient of the loss with respect to $\mathbf{w}$:

$$\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = -\sum_{i=1}^{n}\left[y_i \frac{\sigma'(z_i)}{\sigma(z_i)} - (1-y_i)\frac{\sigma'(z_i)}{1-\sigma(z_i)}\right]\mathbf{x}_i$$

Using $\sigma'(z) = \sigma(z)(1-\sigma(z))$:

$$\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = -\sum_{i=1}^{n}\left[y_i(1-\sigma(z_i)) - (1-y_i)\sigma(z_i)\right]\mathbf{x}_i$$

Simplifying:

$$\boxed{\nabla_{\mathbf{w}}\mathcal{L} = \sum_{i=1}^{n}(\hat{p}_i - y_i)\mathbf{x}_i = \mathbf{X}^T(\hat{\mathbf{p}} - \mathbf{y})}$$

This has the same form as the linear regression gradient! The residuals $(\hat{p}_i - y_i)$ are now probability residuals rather than value residuals.

4. Gradient Descent for Logistic Regression

The Update Rule

Gradient descent updates the weights iteratively:

$$\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \eta \nabla_{\mathbf{w}}\mathcal{L}(\mathbf{w}^{(t)})$$

Substituting the gradient:

$$\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \eta \sum_{i=1}^{n}(\sigma(\mathbf{w}^{(t)T}\mathbf{x}_i) - y_i)\mathbf{x}_i$$

where $\eta > 0$ is the learning rate.

Convexity of Cross-Entropy

The Hessian of the cross-entropy loss is:

$$\mathbf{H} = \nabla^2\mathcal{L} = \mathbf{X}^T\mathbf{S}\mathbf{X}$$

where $\mathbf{S} = \text{diag}(\hat{p}_1(1-\hat{p}_1), \ldots, \hat{p}_n(1-\hat{p}_n))$ is a diagonal matrix of sigmoid derivatives. Since $\hat{p}_i \in (0,1)$, all diagonal entries of $\mathbf{S}$ are positive, making $\mathbf{H}$ positive semi-definite. The loss is convex, so gradient descent converges to the global optimum.

Newton's Method (IRLS)

For faster convergence, we can use Newton's method, which uses second-order information:

$$\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - (\mathbf{X}^T\mathbf{S}^{(t)}\mathbf{X})^{-1}\mathbf{X}^T(\hat{\mathbf{p}}^{(t)} - \mathbf{y})$$

This is equivalent to Iteratively Reweighted Least Squares (IRLS). Rearranging:

$$\mathbf{w}^{(t+1)} = (\mathbf{X}^T\mathbf{S}^{(t)}\mathbf{X})^{-1}\mathbf{X}^T\mathbf{S}^{(t)}\mathbf{z}^{(t)}$$

where $\mathbf{z}^{(t)} = \mathbf{X}\mathbf{w}^{(t)} + (\mathbf{S}^{(t)})^{-1}(\mathbf{y} - \hat{\mathbf{p}}^{(t)})$ is the "working response". Each step solves a weighted least-squares problem.

5. Softmax for Multi-Class Classification

For $K$ classes, we generalize the sigmoid to the softmax function:

$$\boxed{p(y = k | \mathbf{x}; \mathbf{W}) = \text{softmax}(\mathbf{z})_k = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}}$$

where $z_k = \mathbf{w}_k^T\mathbf{x}$ and $\mathbf{W} = [\mathbf{w}_1, \ldots, \mathbf{w}_K]$.

Properties of Softmax

  • Valid distribution: $\sum_{k=1}^{K} \text{softmax}(\mathbf{z})_k = 1$ and all entries are positive
  • Reduces to sigmoid: For $K=2$, $\text{softmax}(z_1, z_2)_1 = \sigma(z_1 - z_2)$
  • Translation invariance: $\text{softmax}(\mathbf{z} + c\mathbf{1}) = \text{softmax}(\mathbf{z})$ for any scalar $c$
  • Temperature scaling: $\text{softmax}(\mathbf{z}/T)$ becomes sharper as $T \to 0$ and uniform as $T \to \infty$

Multi-Class Cross-Entropy Loss

Using one-hot encoded labels $\mathbf{y}_i \in \{0,1\}^K$ with $\sum_k y_{ik} = 1$:

$$\mathcal{L}(\mathbf{W}) = -\sum_{i=1}^{n}\sum_{k=1}^{K} y_{ik} \log p(y_i = k | \mathbf{x}_i; \mathbf{W})$$

The gradient with respect to the weight vector for class $k$ is:

$$\frac{\partial \mathcal{L}}{\partial \mathbf{w}_k} = \sum_{i=1}^{n}(\hat{p}_{ik} - y_{ik})\mathbf{x}_i$$

Again, the gradient has the intuitive form of (prediction - truth) times feature.

Numerical Stability: Log-Sum-Exp Trick

Computing softmax naively causes overflow. The log-sum-exp trick subtracts the maximum:

$$\log \sum_{k=1}^{K} e^{z_k} = m + \log \sum_{k=1}^{K} e^{z_k - m}, \quad m = \max_k z_k$$

This ensures all exponents are non-positive, preventing overflow.

6. Decision Boundaries

Linear Decision Boundary

The decision boundary for binary logistic regression is where $p(y=1|\mathbf{x}) = 0.5$, which occurs when:

$$\sigma(\mathbf{w}^T\mathbf{x}) = 0.5 \implies \mathbf{w}^T\mathbf{x} = 0$$

This is a hyperplane in feature space. In 2D, the boundary is a line with normal vector $\mathbf{w}$. Points on the positive side are classified as class 1; points on the negative side as class 0.

The distance from the origin to the boundary is $|b|/\|\mathbf{w}\|$ where $b$ is the bias term. The confidence of predictions increases with distance from the boundary.

Multi-Class Decision Regions

For $K$-class softmax, the decision boundary between classes $j$ and $k$ is where:

$$(\mathbf{w}_j - \mathbf{w}_k)^T\mathbf{x} = 0$$

The decision regions are convex polytopes formed by the intersection of half-spaces. Each class region is convex, which limits the expressiveness of linear classifiers.

Nonlinear Boundaries via Feature Engineering

To achieve nonlinear boundaries, we can transform the input features. For example, adding polynomial features $\phi(\mathbf{x}) = (x_1, x_2, x_1^2, x_1 x_2, x_2^2)$ gives quadratic decision boundaries. The logistic model becomes:

$$p(y=1|\mathbf{x}) = \sigma(\mathbf{w}^T\phi(\mathbf{x}))$$

The boundary $\mathbf{w}^T\phi(\mathbf{x}) = 0$ is now a conic section (circle, ellipse, hyperbola) in the original $\mathbf{x}$ space.

7. Regularized Logistic Regression

Just as with linear regression, we can add L2 or L1 penalties to prevent overfitting:

L2-Regularized Logistic Regression

$$\mathcal{L}_{\text{reg}}(\mathbf{w}) = -\sum_{i=1}^{n}\left[y_i\log\hat{p}_i + (1-y_i)\log(1-\hat{p}_i)\right] + \frac{\lambda}{2}\|\mathbf{w}\|^2$$

The gradient becomes:

$$\nabla\mathcal{L}_{\text{reg}} = \mathbf{X}^T(\hat{\mathbf{p}} - \mathbf{y}) + \lambda\mathbf{w}$$

The Bayesian interpretation: this is MAP estimation with a Gaussian prior $\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \lambda^{-1}\mathbf{I})$. Larger $\lambda$ means a stronger prior belief that weights should be small.

8. Python Simulation: Logistic Regression from Scratch

This simulation implements binary and multi-class logistic regression using gradient descent, demonstrating convergence and decision boundary analysis.

Logistic Regression & Softmax Classification

Python
script.py142 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

9. Handling Class Imbalance

In many scientific applications, classes are highly imbalanced (e.g., rare particle events, anomalous signals). Standard logistic regression will be biased toward the majority class.

Weighted Cross-Entropy

Assign higher weight to the minority class in the loss:

$$\mathcal{L}_{\text{weighted}} = -\sum_{i=1}^{n}\left[w_+ y_i\log\hat{p}_i + w_-(1-y_i)\log(1-\hat{p}_i)\right]$$

Common choices: $w_+ = n/(2n_+)$ and $w_- = n/(2n_-)$, where $n_+$ and $n_-$are the counts of positive and negative examples.

Focal Loss

Focal loss (Lin et al., 2017) down-weights easy examples and focuses on hard ones:

$$\mathcal{L}_{\text{focal}} = -\sum_{i=1}^{n}\alpha_i(1-\hat{p}_{t,i})^\gamma\log\hat{p}_{t,i}$$

where $\hat{p}_{t,i}$ is the predicted probability for the true class and $\gamma \geq 0$is the focusing parameter. When $\gamma = 0$, this reduces to standard cross-entropy. With $\gamma = 2$ (typical), a well-classified example with $\hat{p}_t = 0.9$ has its loss reduced by $(1-0.9)^2 = 0.01$ relative to the standard loss.

12. Applications in Science

Medical Diagnostics

Classifying medical images (X-rays, histology slides) as normal vs. pathological. The probabilistic output gives clinicians a confidence score for each diagnosis.

Particle Physics

Distinguishing signal events (e.g., Higgs boson decays) from background in collider experiments. Logistic regression provides a baseline before moving to boosted decision trees or neural networks.

Galaxy Morphology

Classifying galaxies as spiral, elliptical, or irregular from survey images. Softmax regression on extracted features provides interpretable multi-class probabilities.

Protein Function Prediction

Predicting protein functional classes from sequence features. Regularized logistic regression handles the high-dimensional, sparse feature space.

10. Evaluation Metrics

Confusion Matrix

For binary classification, the confusion matrix contains four counts:

  • True Positives (TP): Correctly predicted positive
  • True Negatives (TN): Correctly predicted negative
  • False Positives (FP): Incorrectly predicted positive (Type I error)
  • False Negatives (FN): Incorrectly predicted negative (Type II error)

From these we derive key metrics:

$$\text{Precision} = \frac{TP}{TP + FP}, \quad \text{Recall} = \frac{TP}{TP + FN}$$
$$\text{F1 Score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2TP}{2TP + FP + FN}$$

ROC and AUC

The Receiver Operating Characteristic (ROC) curve plots True Positive Rate vs. False Positive Rate as the classification threshold varies:

$$\text{TPR} = \frac{TP}{TP + FN}, \quad \text{FPR} = \frac{FP}{FP + TN}$$

The Area Under the Curve (AUC) summarizes overall discriminative power:

  • $\text{AUC} = 1.0$: Perfect classifier
  • $\text{AUC} = 0.5$: Random guessing
  • $\text{AUC} < 0.5$: Worse than random (predictions are inverted)

AUC has the probabilistic interpretation: it equals the probability that a randomly chosen positive example receives a higher score than a randomly chosen negative example.

Calibration

A classifier is calibrated if its predicted probabilities match empirical frequencies: among all examples where the model predicts $p = 0.7$, approximately 70% should actually be positive.

The Expected Calibration Error (ECE) measures miscalibration:

$$\text{ECE} = \sum_{m=1}^{M}\frac{|B_m|}{n}|\text{acc}(B_m) - \text{conf}(B_m)|$$

where $B_m$ are probability bins. Calibration is crucial in scientific applications where predicted uncertainties inform downstream decisions (e.g., medical diagnosis).

11. Bayesian Logistic Regression

Instead of finding a single point estimate $\mathbf{w}^*$, the Bayesian approach maintains a full posterior distribution over weights, providing natural uncertainty quantification.

Laplace Approximation

Since the posterior has no closed form, we approximate it as a Gaussian centered at the MAP estimate:

$$p(\mathbf{w}|\mathcal{D}) \approx \mathcal{N}(\mathbf{w}_{\text{MAP}}, \mathbf{H}^{-1})$$

where $\mathbf{H} = -\nabla^2\log p(\mathbf{w}|\mathcal{D})|_{\mathbf{w}_{\text{MAP}}} = \mathbf{X}^T\mathbf{S}\mathbf{X} + \lambda\mathbf{I}$is the Hessian of the negative log-posterior at the MAP estimate.

Predictive distribution for a new point $\mathbf{x}_*$:

$$p(y_* = 1|\mathbf{x}_*, \mathcal{D}) \approx \sigma\left(\frac{\mathbf{w}_{\text{MAP}}^T\mathbf{x}_*}{\sqrt{1 + \pi\mathbf{x}_*^T\mathbf{H}^{-1}\mathbf{x}_*/8}}\right)$$

This "softens" predictions near the decision boundary, reflecting parameter uncertainty. Points far from training data get predictions closer to 0.5 (uncertain), while points near many training examples get confident predictions.

Summary

  • Sigmoid: $\sigma(z) = 1/(1+e^{-z})$ maps real numbers to probabilities
  • Cross-entropy: The natural loss for probabilistic classification, derived from maximum likelihood
  • Gradient: $\nabla\mathcal{L} = \mathbf{X}^T(\hat{\mathbf{p}} - \mathbf{y})$ — same elegant form as linear regression
  • Softmax: Generalizes sigmoid to multi-class, with $\text{softmax}(\mathbf{z})_k = e^{z_k}/\sum_j e^{z_j}$
  • Convexity: Cross-entropy is convex, guaranteeing a unique global optimum
  • Decision boundary: Linear in feature space; nonlinear with feature engineering