Optimization Methods
Finding minima of functions: gradient-based methods for smooth convex problems, and metaheuristics for rugged non-convex landscapes. The mathematical engine behind machine learning, physics simulations, and engineering design.
Convexity: The Central Concept
A function \(f\) is convex if \(f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda) f(y)\)for all \(\lambda \in [0,1]\). Equivalently, the Hessian \(H = \nabla^2 f \succeq 0\) (positive semidefinite).
Convex: Every Local Min = Global Min
Gradient descent is guaranteed to find the optimum. Convergence rate depends on condition number.
Non-Convex: Multiple Local Minima
Gradient methods get stuck. Need global search: simulated annealing, genetic algorithms, random restarts.
1. Gradient Descent
The simplest first-order method: move in the direction of steepest descent.
\(x_{k+1} = x_k - \alpha \nabla f(x_k)\)
Learning rate \(\alpha\) controls step size. Too large: diverge. Too small: slow convergence.
Convergence Rate
For \(L\)-smooth, \(\mu\)-strongly convex functions with condition number\(\kappa = L/\mu\):
\(f(x_k) - f(x^*) \leq \left(\frac{\kappa - 1}{\kappa + 1}\right)^{2k} (f(x_0) - f(x^*))\)
Linear convergence. Optimal step size: \(\alpha = 2/(L + \mu)\). Ill-conditioned problems (large \(\kappa\)) converge slowly -- the zig-zag problem.
Gradient Descent with Step Size Effects
PythonClick Run to execute the Python code
Code will be executed with Python 3 on the server
2. Newton's Method for Optimization
Use second-order (curvature) information to take better steps. At each iteration, minimize a local quadratic model of \(f\):
\(x_{k+1} = x_k - [\nabla^2 f(x_k)]^{-1} \nabla f(x_k)\)
Quadratic convergence near the minimum, but requires computing and inverting the Hessian \((O(n^3))\).
Newton vs Gradient Descent: Convergence Speed
PythonClick Run to execute the Python code
Code will be executed with Python 3 on the server
3. Conjugate Gradient for Optimization
A brilliant compromise: uses only gradient information (like steepest descent) but chooses conjugate directions that avoid the zig-zag problem. For an \(n\)-dimensional quadratic, CG converges in exactly \(n\) steps.
Algorithm (Fletcher-Reeves)
- \(d_0 = -\nabla f(x_0)\) (initial direction = steepest descent)
- Line search: \(\alpha_k = \arg\min_\alpha f(x_k + \alpha d_k)\)
- \(x_{k+1} = x_k + \alpha_k d_k\)
- \(\beta_{k+1} = \|\nabla f(x_{k+1})\|^2 / \|\nabla f(x_k)\|^2\)
- \(d_{k+1} = -\nabla f(x_{k+1}) + \beta_{k+1} d_k\) (conjugate direction)
Conjugate Gradient Optimization
PythonClick Run to execute the Python code
Code will be executed with Python 3 on the server
4. Simulated Annealing
Inspired by metallurgical annealing: occasionally accept uphill moves to escape local minima. The probability of accepting a worse solution decreases with a temperature schedule \(T(k)\) that cools over time.
\(P(\text{accept}) = \begin{cases} 1 & \text{if } \Delta f < 0 \\ e^{-\Delta f / T} & \text{if } \Delta f \geq 0 \end{cases}\)
Metropolis criterion. At high T, most moves accepted (exploration). At low T, only downhill (exploitation).
Simulated Annealing for Non-Convex Optimization
PythonClick Run to execute the Python code
Code will be executed with Python 3 on the server
5. Genetic Algorithms
Evolution-inspired optimization: maintain a population of candidate solutions. Apply selection (survival of the fittest), crossover (recombination), and mutation (random perturbation) to evolve toward the optimum.
Genetic Algorithm for Non-Convex Optimization
PythonClick Run to execute the Python code
Code will be executed with Python 3 on the server
6. Practical Optimization with SciPy
scipy.optimize: The Swiss Army Knife
PythonClick Run to execute the Python code
Code will be executed with Python 3 on the server
Method Selection Guide
| Problem Type | Best Method | scipy Function |
|---|---|---|
| Smooth, convex, small n | Newton / BFGS | minimize(method='BFGS') |
| Smooth, convex, large n | L-BFGS-B / CG | minimize(method='L-BFGS-B') |
| Constrained | SLSQP / trust-constr | minimize(method='SLSQP') |
| Non-convex, global | Differential evolution | differential_evolution() |
| No derivatives available | Nelder-Mead / Powell | minimize(method='Nelder-Mead') |
| Least squares | Levenberg-Marquardt | least_squares() |