Logistic Regression
Logistic regression is the canonical probabilistic binary classifier. Despite its name, it is a classification model โ deriving the sigmoid function from log-odds, the cross-entropy loss from Bernoulli MLE, and an elegant closed-form gradient.
1. The Sigmoid Function โ Derived from Log-Odds
We model the log-odds (logit) of the probability \(P(y=1|\mathbf{x})\) as a linear function of the features:
Solving for \(p = P(y=1|\mathbf{x})\):
Useful properties of the sigmoid:
- \(\sigma(z) \in (0,1)\) โ always a valid probability
- \(\sigma(-z) = 1 - \sigma(z)\) โ symmetric around 0.5
- \(\sigma'(z) = \sigma(z)(1-\sigma(z))\) โ clean derivative, crucial for backpropagation
The sigmoid maps any real-valued score z to a probability. The decision boundary is at z = 0 (p = 0.5).
2. Cross-Entropy Loss โ Derived from Bernoulli MLE
For binary labels \(y_i \in \{0,1\}\), the predicted probability is \(\hat{p}_i = \sigma(\mathbf{w}^\top\mathbf{x}_i)\). The Bernoulli likelihood for one example is:
Assuming i.i.d. data, the log-likelihood is:
Maximising the log-likelihood is equivalent to minimising the negative log-likelihood โ the binary cross-entropy loss:
3. Gradient Derivation
We derive the gradient \(\nabla_{\mathbf{w}}\mathcal{L}\) step by step. Let \(z_i = \mathbf{w}^\top\mathbf{x}_i\) and \(\hat{p}_i = \sigma(z_i)\).
Step 1: Derivative with respect to \(z_i\):
Using \(\partial \log\hat p_i / \partial z_i = 1 - \hat p_i\) and \(\partial \log(1-\hat p_i)/\partial z_i = -\hat p_i\) (from \(\sigma' = \sigma(1-\sigma)\)).
Step 2: Chain rule to \(\mathbf{w}\): \(\partial z_i / \partial \mathbf{w} = \mathbf{x}_i\):
Step 3: Sum over all examples (with sign flip for loss):
The gradient has the same elegant form as the OLS gradient โ a weighted residual. The gradient descent update is \(\mathbf{w} \leftarrow \mathbf{w} - \eta\,\mathbf{X}^\top(\hat{\mathbf{p}} - \mathbf{y})/n\).
4. Newton's Method / IRLS
Newton's method uses the Hessian for second-order updates. The Hessian of the logistic loss is:
The Newton update is:
where \(\mathbf{z} = \mathbf{X}\mathbf{w}^{(t)} + \mathbf{W}^{-1}(\mathbf{y} - \hat{\mathbf{p}})\) is the adjusted response. This is exactly Iteratively Reweighted Least Squares (IRLS) โ solving a weighted OLS problem at each step. Newton's method converges quadratically vs linearly for gradient descent.
5. Multi-Class: Softmax Derivation
For \(K\) classes, generalise logistic regression to model log-odds relative to a reference class. The class probabilities are given by the softmax function:
Softmax is derived by requiring that all \(K\) probabilities sum to 1 and are proportional to exponentiated scores. For \(K=2\), softmax reduces to sigmoid. The loss is the multi-class cross-entropy (categorical NLL):
The gradient for class \(k\) is \(\nabla_{\mathbf{w}_k}\mathcal{L} = \frac{1}{n}\mathbf{X}^\top(\hat{\mathbf{p}}_k - \mathbf{y}_k)\) where \(\hat{\mathbf{p}}_k\) is the vector of predicted probabilities for class \(k\).
Python: Logistic Regression from Scratch
We implement gradient descent for logistic regression from scratch, plot the decision boundary on 2D data, and show convergence of the cross-entropy loss.
Click Run to execute the Python code
Code will be executed with Python 3 on the server
Key Takeaways
- โ The sigmoid arises naturally from modelling the log-odds as a linear function of features.
- โ Cross-entropy loss is the negative Bernoulli log-likelihood โ minimising it is equivalent to MLE.
- โ The gradient \(X^\top(\hat{p} - y)/n\) has the same elegant residual form as OLS, derived via \(\sigma' = \sigma(1-\sigma)\).
- โ IRLS (Newton's method) converges quadratically by solving a sequence of weighted least squares problems.
- โ Softmax generalises sigmoid to \(K\) classes and is the output layer of most neural network classifiers.