Part VI: OptimizationChapter 6 of 6

Optimization Methods

Finding minima of functions: gradient-based methods for smooth convex problems, and metaheuristics for rugged non-convex landscapes. The mathematical engine behind machine learning, physics simulations, and engineering design.

Convexity: The Central Concept

A function \(f\) is convex if \(f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda) f(y)\)for all \(\lambda \in [0,1]\). Equivalently, the Hessian \(H = \nabla^2 f \succeq 0\) (positive semidefinite).

Convex: Every Local Min = Global Min

Gradient descent is guaranteed to find the optimum. Convergence rate depends on condition number.

Non-Convex: Multiple Local Minima

Gradient methods get stuck. Need global search: simulated annealing, genetic algorithms, random restarts.

1. Gradient Descent

The simplest first-order method: move in the direction of steepest descent.

\(x_{k+1} = x_k - \alpha \nabla f(x_k)\)

Learning rate \(\alpha\) controls step size. Too large: diverge. Too small: slow convergence.

Convergence Rate

For \(L\)-smooth, \(\mu\)-strongly convex functions with condition number\(\kappa = L/\mu\):

\(f(x_k) - f(x^*) \leq \left(\frac{\kappa - 1}{\kappa + 1}\right)^{2k} (f(x_0) - f(x^*))\)

Linear convergence. Optimal step size: \(\alpha = 2/(L + \mu)\). Ill-conditioned problems (large \(\kappa\)) converge slowly -- the zig-zag problem.

Gradient Descent with Step Size Effects

Python
script.py58 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

2. Newton's Method for Optimization

Use second-order (curvature) information to take better steps. At each iteration, minimize a local quadratic model of \(f\):

\(x_{k+1} = x_k - [\nabla^2 f(x_k)]^{-1} \nabla f(x_k)\)

Quadratic convergence near the minimum, but requires computing and inverting the Hessian \((O(n^3))\).

Newton vs Gradient Descent: Convergence Speed

Python
script.py55 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

3. Conjugate Gradient for Optimization

A brilliant compromise: uses only gradient information (like steepest descent) but chooses conjugate directions that avoid the zig-zag problem. For an \(n\)-dimensional quadratic, CG converges in exactly \(n\) steps.

Algorithm (Fletcher-Reeves)

  1. \(d_0 = -\nabla f(x_0)\) (initial direction = steepest descent)
  2. Line search: \(\alpha_k = \arg\min_\alpha f(x_k + \alpha d_k)\)
  3. \(x_{k+1} = x_k + \alpha_k d_k\)
  4. \(\beta_{k+1} = \|\nabla f(x_{k+1})\|^2 / \|\nabla f(x_k)\|^2\)
  5. \(d_{k+1} = -\nabla f(x_{k+1}) + \beta_{k+1} d_k\) (conjugate direction)

Conjugate Gradient Optimization

Python
script.py36 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

4. Simulated Annealing

Inspired by metallurgical annealing: occasionally accept uphill moves to escape local minima. The probability of accepting a worse solution decreases with a temperature schedule \(T(k)\) that cools over time.

\(P(\text{accept}) = \begin{cases} 1 & \text{if } \Delta f < 0 \\ e^{-\Delta f / T} & \text{if } \Delta f \geq 0 \end{cases}\)

Metropolis criterion. At high T, most moves accepted (exploration). At low T, only downhill (exploitation).

Simulated Annealing for Non-Convex Optimization

Python
script.py65 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

5. Genetic Algorithms

Evolution-inspired optimization: maintain a population of candidate solutions. Apply selection (survival of the fittest), crossover (recombination), and mutation (random perturbation) to evolve toward the optimum.

Genetic Algorithm for Non-Convex Optimization

Python
script.py85 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

6. Practical Optimization with SciPy

scipy.optimize: The Swiss Army Knife

Python
script.py49 lines

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Method Selection Guide

Problem TypeBest Methodscipy Function
Smooth, convex, small nNewton / BFGSminimize(method='BFGS')
Smooth, convex, large nL-BFGS-B / CGminimize(method='L-BFGS-B')
ConstrainedSLSQP / trust-constrminimize(method='SLSQP')
Non-convex, globalDifferential evolutiondifferential_evolution()
No derivatives availableNelder-Mead / Powellminimize(method='Nelder-Mead')
Least squaresLevenberg-Marquardtleast_squares()