Linear Algebra for ML
Vectors, matrices, eigendecomposition, and the Singular Value Decomposition β the geometric and algebraic backbone of every machine learning algorithm.
1. Vectors and Matrices
A vector \(\mathbf{x} \in \mathbb{R}^n\) is an ordered tuple of real numbers. The inner product (dot product) of two vectors is
which encodes both the product of magnitudes and the angle between vectors. A matrix \(\mathbf{A} \in \mathbb{R}^{m \times n}\) represents a linear map from \(\mathbb{R}^n\) to \(\mathbb{R}^m\). The (i,j) entry of the matrix product \(\mathbf{C} = \mathbf{AB}\) is
Key identities used throughout ML:
- Transpose: \((\mathbf{AB})^\top = \mathbf{B}^\top \mathbf{A}^\top\)
- Inverse: \((\mathbf{AB})^{-1} = \mathbf{B}^{-1}\mathbf{A}^{-1}\) when both inverses exist
- Trace: \(\mathrm{tr}(\mathbf{AB}) = \mathrm{tr}(\mathbf{BA})\) (cyclic property)
2. Determinant and Invertibility
The determinant \(\det(\mathbf{A})\) measures the signed volume scaling factor of the linear transformation. For a \(2 \times 2\) matrix:
A square matrix is invertible if and only if \(\det(\mathbf{A}) \neq 0\), equivalently if and only if its columns are linearly independent. The inverse satisfies \(\mathbf{A}\mathbf{A}^{-1} = \mathbf{I}\).
Key property: \(\det(\mathbf{AB}) = \det(\mathbf{A})\det(\mathbf{B})\), so singular matrices (zero determinant) cannot be inverted and represent projections that collapse dimensions.
3. Eigendecomposition
An eigenvector of \(\mathbf{A} \in \mathbb{R}^{n \times n}\) is a non-zero vector that is only scaled (not rotated) by \(\mathbf{A}\):
The eigenvalue \(\lambda_i\) gives the scale factor. To find eigenvalues we solve the characteristic polynomial \(\det(\mathbf{A} - \lambda \mathbf{I}) = 0\).
If \(\mathbf{A}\) has \(n\) linearly independent eigenvectors collected as columns of a matrix \(\mathbf{Q} = [\mathbf{q}_1, \ldots, \mathbf{q}_n]\), then:
For symmetric matrices \(\mathbf{A} = \mathbf{A}^\top\) (e.g., covariance matrices), eigenvectors are orthogonal (\(\mathbf{Q}^{-1} = \mathbf{Q}^\top\)), so:
This spectral decomposition shows that \(\mathbf{A}\) is a sum of rank-1 projections weighted by eigenvalues β the foundation of PCA.
The linear map A stretches the unit circle into an ellipse along the eigenvector directions, with eigenvalues giving the stretch factors.
4. Rank and Null Space
The rank of \(\mathbf{A} \in \mathbb{R}^{m \times n}\) is the dimension of its column space (range): \(\mathrm{rank}(\mathbf{A}) = \dim(\mathrm{col}(\mathbf{A}))\). By the rank-nullity theorem:
The null space \(\mathrm{null}(\mathbf{A}) = \{\mathbf{x} : \mathbf{A}\mathbf{x} = \mathbf{0}\}\) contains all vectors that \(\mathbf{A}\) maps to zero β critically important for understanding solution spaces of linear systems \(\mathbf{A}\mathbf{x} = \mathbf{b}\).
A matrix is positive definite (PD) if \(\mathbf{x}^\top \mathbf{A} \mathbf{x} > 0\) for all \(\mathbf{x} \neq \mathbf{0}\), equivalently all eigenvalues are positive. PD matrices arise naturally as Hessians at strict local minima and as covariance matrices. The matrix \(\mathbf{X}^\top \mathbf{X}\) is always positive semi-definite, and PD when \(\mathbf{X}\) has full column rank.
5. Singular Value Decomposition (SVD)
The SVD generalises eigendecomposition to rectangular matrices. For \(\mathbf{A} \in \mathbb{R}^{m \times n}\) with \(m \geq n\), the SVD is:
Derivation from \(\mathbf{A}^\top \mathbf{A}\)
Consider the symmetric PSD matrix \(\mathbf{A}^\top \mathbf{A} \in \mathbb{R}^{n \times n}\). By the spectral theorem it has an orthonormal eigenbasis:
Define the singular values \(\sigma_i = \sqrt{\lambda_i}\) and the left singular vectors \(\mathbf{u}_i = \mathbf{A}\mathbf{v}_i / \sigma_i\) (for \(\sigma_i > 0\)). Then:
The \(\mathbf{u}_i\) are orthonormal: \(\mathbf{u}_i^\top \mathbf{u}_j = (\mathbf{A}\mathbf{v}_i)^\top(\mathbf{A}\mathbf{v}_j)/(\sigma_i\sigma_j) = \mathbf{v}_i^\top \mathbf{A}^\top \mathbf{A} \mathbf{v}_j / (\sigma_i \sigma_j) = \lambda_j \delta_{ij} / (\sigma_i \sigma_j) = \delta_{ij}\).
EckartβYoung Theorem (Best Low-Rank Approximation)
The rank-\(k\) truncated SVD \(\mathbf{A}_k = \mathbf{U}_k\boldsymbol{\Sigma}_k\mathbf{V}_k^\top = \sum_{i=1}^k \sigma_i \mathbf{u}_i \mathbf{v}_i^\top\) solves:
This is the theoretical foundation of PCA, image compression, and latent-factor models.
Python: SVD on a Data Matrix
We construct a low-rank data matrix, compute its full SVD, analyse the singular value spectrum, and show how few components capture most of the variance β the core idea behind PCA and data compression.
Click Run to execute the Python code
Code will be executed with Python 3 on the server
Key Takeaways
- β Matrix multiplication composes linear transformations; the inner product encodes angle and length.
- β Eigendecomposition \(A = Q\Lambda Q^{-1}\) diagonalises a matrix in its own basis; symmetric matrices have orthonormal eigenvectors.
- β SVD \(A = U\Sigma V^\top\) extends eigendecomposition to rectangular matrices and yields the best low-rank approximation.
- β Rank-nullity theorem governs solution existence and uniqueness; PD matrices guarantee unique solutions.
- β Truncated SVD with \(k\) components achieves optimal Frobenius-norm compression by retaining the largest singular values.