Optimization Methods

Finding minima of functions: gradient-based methods for smooth convex problems, and metaheuristics for rugged non-convex landscapes. The mathematical engine behind machine learning, physics simulations, and engineering design.

Convexity: The Central Concept

A function \(f\) is convex if \(f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda) f(y)\)for all \(\lambda \in [0,1]\). Equivalently, the Hessian \(H = \nabla^2 f \succeq 0\) (positive semidefinite).

Convex: Every Local Min = Global Min

Gradient descent is guaranteed to find the optimum. Convergence rate depends on condition number.

Non-Convex: Multiple Local Minima

Gradient methods get stuck. Need global search: simulated annealing, genetic algorithms, random restarts.

1. Gradient Descent

The simplest first-order method: move in the direction of steepest descent.

\(x_{k+1} = x_k - \alpha \nabla f(x_k)\)

Learning rate \(\alpha\) controls step size. Too large: diverge. Too small: slow convergence.

Convergence Rate

For \(L\)-smooth, \(\mu\)-strongly convex functions with condition number\(\kappa = L/\mu\):

\(f(x_k) - f(x^*) \leq \left(\frac{\kappa - 1}{\kappa + 1}\right)^{2k} (f(x_0) - f(x^*))\)

Linear convergence. Optimal step size: \(\alpha = 2/(L + \mu)\). Ill-conditioned problems (large \(\kappa\)) converge slowly -- the zig-zag problem.

Gradient Descent with Step Size Effects

Python

script.py58 lines

import numpy as np

def gradient_descent(grad_f, x0, alpha, f=None, tol=1e-10, max_iter=1000):
    """Gradient descent with fixed step size."""
    x = np.array(x0, dtype=float)
    history = []
    for k in range(max_iter):
        g = grad_f(x)
        if f is not None:
            history.append((k, f(x), np.linalg.norm(g)))
        if np.linalg.norm(g) < tol:
            break
        x = x - alpha * g
    return x, history

# Rosenbrock function (banana-shaped valley, ill-conditioned)
def rosenbrock(x):
    return (1 - x[0])**2 + 100*(x[1] - x[0]**2)**2

def grad_rosenbrock(x):
    dx = -2*(1 - x[0]) - 400*x[0]*(x[1] - x[0]**2)
    dy = 200*(x[1] - x[0]**2)
    return np.array([dx, dy])

# Simple quadratic: f(x) = 0.5*(x1^2 + 10*x2^2)
# Condition number kappa = 10
def quadratic(x):
    return 0.5*(x[0]**2 + 10*x[1]**2)

def grad_quadratic(x):
    return np.array([x[0], 10*x[1]])

x0 = [3.0, 3.0]

print("=== Gradient Descent on Quadratic (kappa = 10) ===")
print(f"Optimal alpha = 2/(L+mu) = 2/(10+1) = {2/11:.4f}\n")

print(f"{'alpha':>8}  {'Iterations':>10}  {'f(x_final)':>14}  {'Status':>12}")
print("-" * 48)
for alpha in [0.01, 0.05, 2/11, 0.19, 0.20, 0.25]:
    x_opt, hist = gradient_descent(grad_quadratic, x0, alpha, f=quadratic, max_iter=500)
    f_final = quadratic(x_opt)
    status = "converged" if f_final < 1e-8 else ("slow" if f_final < 1 else "DIVERGED")
    print(f"{alpha:8.4f}  {len(hist):10d}  {f_final:14.6e}  {status:>12}")

# Show convergence with optimal step
print("\n=== Convergence History (optimal alpha) ===")
x_opt, hist = gradient_descent(grad_quadratic, x0, 2/11, f=quadratic, max_iter=200)
print(f"{'Iter':>4}  {'f(x)':>14}  {'||grad||':>14}")
print("-" * 36)
for k, fval, gnorm in hist[::10]:
    print(f"{k:4d}  {fval:14.6e}  {gnorm:14.6e}")
if len(hist) > 0:
    k, fval, gnorm = hist[-1]
    print(f"{k:4d}  {fval:14.6e}  {gnorm:14.6e}")

print(f"\nConverged in {len(hist)} iterations")
print(f"Higher kappa => more iterations (zig-zag behavior)")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

2. Newton's Method for Optimization

Use second-order (curvature) information to take better steps. At each iteration, minimize a local quadratic model of \(f\):

\(x_{k+1} = x_k - [\nabla^2 f(x_k)]^{-1} \nabla f(x_k)\)

Quadratic convergence near the minimum, but requires computing and inverting the Hessian \((O(n^3))\).

Newton vs Gradient Descent: Convergence Speed

Python

script.py55 lines

import numpy as np

def newton_opt(f, grad, hess, x0, tol=1e-12, max_iter=100):
    """Newton's method for optimization."""
    x = np.array(x0, dtype=float)
    history = []
    for k in range(max_iter):
        g = grad(x)
        H = hess(x)
        history.append((k, f(x), np.linalg.norm(g)))
        if np.linalg.norm(g) < tol:
            break
        dx = np.linalg.solve(H, -g)
        x = x + dx
    return x, history

# Ill-conditioned quadratic: f = 0.5*(x^2 + 100*y^2)
# kappa = 100
f = lambda x: 0.5*(x[0]**2 + 100*x[1]**2)
grad_f = lambda x: np.array([x[0], 100*x[1]])
hess_f = lambda x: np.array([[1.0, 0.0], [0.0, 100.0]])

x0 = [10.0, 10.0]

# Newton (should converge in 1 step for quadratic!)
x_newton, hist_newton = newton_opt(f, grad_f, hess_f, x0)

# Gradient descent with optimal step
alpha = 2.0 / (100 + 1)
from functools import partial

x_gd = np.array(x0, dtype=float)
hist_gd = []
for k in range(200):
    g = grad_f(x_gd)
    hist_gd.append((k, f(x_gd), np.linalg.norm(g)))
    if np.linalg.norm(g) < 1e-12:
        break
    x_gd = x_gd - alpha * g

print(f"Ill-conditioned quadratic: kappa = 100")
print(f"Starting point: x0 = {x0}\n")

print(f"{'Iter':>4}  {'Newton f(x)':>14}  {'GD f(x)':>14}")
print("-" * 36)
max_show = max(len(hist_newton), len(hist_gd))
for i in range(min(max_show, 20)):
    n_val = f"{hist_newton[i][1]:.6e}" if i < len(hist_newton) else "converged"
    g_val = f"{hist_gd[i][1]:.6e}" if i < len(hist_gd) else "converged"
    print(f"{i:4d}  {n_val:>14}  {g_val:>14}")

print(f"\nNewton: {len(hist_newton)} iterations (quadratic converges in 1 step!)")
print(f"GD:     {len(hist_gd)} iterations")
print(f"\nNewton uses Hessian: O(n^3) per step but far fewer steps")
print(f"For large n: use quasi-Newton (BFGS) which approximates Hessian")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

3. Conjugate Gradient for Optimization

A brilliant compromise: uses only gradient information (like steepest descent) but chooses conjugate directions that avoid the zig-zag problem. For an \(n\)-dimensional quadratic, CG converges in exactly \(n\) steps.

Algorithm (Fletcher-Reeves)

\(d_0 = -\nabla f(x_0)\) (initial direction = steepest descent)
Line search: \(\alpha_k = \arg\min_\alpha f(x_k + \alpha d_k)\)
\(x_{k+1} = x_k + \alpha_k d_k\)
\(\beta_{k+1} = \|\nabla f(x_{k+1})\|^2 / \|\nabla f(x_k)\|^2\)
\(d_{k+1} = -\nabla f(x_{k+1}) + \beta_{k+1} d_k\) (conjugate direction)

Conjugate Gradient Optimization

Python

script.py36 lines

import numpy as np
from scipy.optimize import minimize

# Rosenbrock function (the classic test for optimizers)
def rosenbrock(x):
    return sum(100*(x[1:] - x[:-1]**2)**2 + (1 - x[:-1])**2)

def grad_rosenbrock(x):
    n = len(x)
    g = np.zeros(n)
    g[0] = -400*x[0]*(x[1] - x[0]**2) - 2*(1 - x[0])
    for i in range(1, n-1):
        g[i] = 200*(x[i] - x[i-1]**2) - 400*x[i]*(x[i+1] - x[i]**2) - 2*(1 - x[i])
    g[-1] = 200*(x[-1] - x[-2]**2)
    return g

n = 10
x0 = np.zeros(n)  # far from minimum at (1,1,...,1)

print(f"Rosenbrock function in {n} dimensions")
print(f"Minimum at x* = (1, 1, ..., 1), f(x*) = 0\n")

methods = ['CG', 'BFGS', 'L-BFGS-B', 'Nelder-Mead']
print(f"{'Method':>12}  {'f(x*)':>12}  {'||x*-1||':>12}  {'f-evals':>8}  {'g-evals':>8}  {'Success':>8}")
print("-" * 66)

for method in methods:
    result = minimize(rosenbrock, x0, jac=grad_rosenbrock if method != 'Nelder-Mead' else None,
                     method=method, options={'maxiter': 10000})
    dist = np.linalg.norm(result.x - np.ones(n))
    print(f"{method:>12}  {result.fun:12.2e}  {dist:12.2e}  {result.nfev:8d}  {result.get('njev', 0):8d}  {'YES' if result.success else 'NO':>8}")

print("\nCG: first-order, O(n) per step, good for large n")
print("BFGS: quasi-Newton, O(n^2) per step, fastest convergence")
print("L-BFGS-B: limited-memory BFGS, O(n) memory, for very large n")
print("Nelder-Mead: derivative-free, simple but slow")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

4. Simulated Annealing

Inspired by metallurgical annealing: occasionally accept uphill moves to escape local minima. The probability of accepting a worse solution decreases with a temperature schedule \(T(k)\) that cools over time.

\(P(\text{accept}) = \begin{cases} 1 & \text{if } \Delta f < 0 \\ e^{-\Delta f / T} & \text{if } \Delta f \geq 0 \end{cases}\)

Metropolis criterion. At high T, most moves accepted (exploration). At low T, only downhill (exploitation).

Simulated Annealing for Non-Convex Optimization

Python

script.py65 lines

import numpy as np

def simulated_annealing(f, x0, T0=10.0, cooling=0.995, n_iter=10000, step_size=0.5):
    """Simulated annealing with exponential cooling."""
    rng = np.random.RandomState(42)
    x = np.array(x0, dtype=float)
    f_x = f(x)
    x_best, f_best = x.copy(), f_x
    history = []
    T = T0

for k in range(n_iter):
        # Random perturbation
        x_new = x + step_size * rng.randn(len(x))
        f_new = f(x_new)
        delta_f = f_new - f_x

# Metropolis criterion
        if delta_f < 0 or rng.random() < np.exp(-delta_f / T):
            x = x_new
            f_x = f_new

if f_x < f_best:
            x_best, f_best = x.copy(), f_x

T *= cooling
        if k % 1000 == 0:
            history.append((k, T, f_x, f_best))

history.append((n_iter, T, f_x, f_best))
    return x_best, f_best, history

# Multi-modal function: Rastrigin (many local minima)
# f(x) = 10n + sum(x_i^2 - 10*cos(2*pi*x_i))
# Global min at origin: f(0,...,0) = 0
def rastrigin(x):
    return 10*len(x) + np.sum(x**2 - 10*np.cos(2*np.pi*x))

n = 5  # 5 dimensions
x0 = np.random.RandomState(0).uniform(-5, 5, n)

print(f"Rastrigin function in {n}D (highly non-convex)")
print(f"Number of local minima: ~10^{n}")
print(f"Global minimum: f(0,...,0) = 0")
print(f"Starting point: x0 = [{', '.join(f'{xi:.2f}' for xi in x0)}]")
print(f"f(x0) = {rastrigin(x0):.4f}\n")

# Run SA with different cooling rates
print(f"{'Cooling':>8}  {'f_best':>10}  {'||x_best||':>12}  {'Final T':>10}")
print("-" * 44)
for cooling in [0.999, 0.995, 0.990, 0.980]:
    x_best, f_best, hist = simulated_annealing(rastrigin, x0, T0=50.0,
                                                  cooling=cooling, n_iter=20000)
    print(f"{cooling:8.3f}  {f_best:10.4f}  {np.linalg.norm(x_best):12.4f}  {hist[-1][1]:10.2e}")

# Best run with detailed history
x_best, f_best, hist = simulated_annealing(rastrigin, x0, T0=50.0,
                                              cooling=0.999, n_iter=30000)
print(f"\nBest result: f = {f_best:.6f}")
print(f"x_best = [{', '.join(f'{xi:.4f}' for xi in x_best)}]")

print(f"\n{'Step':>6}  {'Temperature':>12}  {'f_current':>12}  {'f_best':>12}")
print("-" * 46)
for step, T, f_curr, f_b in hist:
    print(f"{step:6d}  {T:12.4f}  {f_curr:12.4f}  {f_b:12.4f}")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

5. Genetic Algorithms

Evolution-inspired optimization: maintain a population of candidate solutions. Apply selection (survival of the fittest), crossover (recombination), and mutation (random perturbation) to evolve toward the optimum.

Genetic Algorithm for Non-Convex Optimization

Python

script.py85 lines

import numpy as np

def genetic_algorithm(f, bounds, pop_size=50, n_gen=200, mutation_rate=0.1,
                      crossover_rate=0.7, seed=42):
    """Simple real-valued genetic algorithm."""
    rng = np.random.RandomState(seed)
    n_dim = len(bounds)

# Initialize population
    pop = np.zeros((pop_size, n_dim))
    for i, (lo, hi) in enumerate(bounds):
        pop[:, i] = rng.uniform(lo, hi, pop_size)

fitness = np.array([f(ind) for ind in pop])
    best_idx = np.argmin(fitness)
    best = pop[best_idx].copy()
    best_fit = fitness[best_idx]
    history = [(0, best_fit, np.mean(fitness))]

for gen in range(1, n_gen + 1):
        # Tournament selection
        new_pop = np.zeros_like(pop)
        for i in range(pop_size):
            i1, i2 = rng.randint(0, pop_size, 2)
            winner = i1 if fitness[i1] < fitness[i2] else i2
            new_pop[i] = pop[winner]

# Crossover (arithmetic)
        for i in range(0, pop_size - 1, 2):
            if rng.random() < crossover_rate:
                alpha = rng.random()
                child1 = alpha * new_pop[i] + (1 - alpha) * new_pop[i+1]
                child2 = (1 - alpha) * new_pop[i] + alpha * new_pop[i+1]
                new_pop[i] = child1
                new_pop[i+1] = child2

# Mutation (Gaussian)
        for i in range(pop_size):
            if rng.random() < mutation_rate:
                j = rng.randint(0, n_dim)
                lo, hi = bounds[j]
                new_pop[i, j] += rng.randn() * (hi - lo) * 0.1
                new_pop[i, j] = np.clip(new_pop[i, j], lo, hi)

# Elitism: keep best individual
        new_pop[0] = best.copy()

pop = new_pop
        fitness = np.array([f(ind) for ind in pop])

if np.min(fitness) < best_fit:
            best_idx = np.argmin(fitness)
            best = pop[best_idx].copy()
            best_fit = fitness[best_idx]

if gen % 20 == 0 or gen == n_gen:
            history.append((gen, best_fit, np.mean(fitness)))

return best, best_fit, history

# Rastrigin function
def rastrigin(x):
    return 10*len(x) + np.sum(x**2 - 10*np.cos(2*np.pi*x))

n = 5
bounds = [(-5.12, 5.12)] * n

print(f"Genetic Algorithm: Rastrigin {n}D")
print(f"Population = 50, Generations = 200\n")

best, best_fit, history = genetic_algorithm(rastrigin, bounds, pop_size=50, n_gen=200)

print(f"{'Gen':>4}  {'Best f':>12}  {'Mean f':>12}")
print("-" * 32)
for gen, bf, mf in history:
    print(f"{gen:4d}  {bf:12.4f}  {mf:12.4f}")

print(f"\nBest solution: [{', '.join(f'{xi:.4f}' for xi in best)}]")
print(f"Best fitness: {best_fit:.6f} (global min = 0)")

# Compare with scipy.optimize
from scipy.optimize import differential_evolution
result = differential_evolution(rastrigin, bounds, seed=42, maxiter=200)
print(f"\nscipy differential_evolution: f = {result.fun:.6f}")
print(f"Evaluations: {result.nfev}")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

6. Practical Optimization with SciPy

scipy.optimize: The Swiss Army Knife

Python

script.py49 lines

import numpy as np
from scipy.optimize import minimize, minimize_scalar, linprog

print("=== scipy.optimize Toolkit ===\n")

# 1. Scalar minimization
print("1. Scalar minimization (Brent's method)")
result = minimize_scalar(lambda x: (x - 3)**2 + np.sin(5*x), bounds=(0, 6), method='bounded')
print(f"   min of (x-3)^2 + sin(5x): x* = {result.x:.6f}, f* = {result.fun:.6f}\n")

# 2. Unconstrained (BFGS)
print("2. Unconstrained: Rosenbrock 5D")
def rosenbrock(x):
    return sum(100*(x[1:]-x[:-1]**2)**2 + (1-x[:-1])**2)
result = minimize(rosenbrock, np.zeros(5), method='BFGS')
print(f"   x* = [{', '.join(f'{xi:.4f}' for xi in result.x)}]")
print(f"   f* = {result.fun:.2e}, iterations = {result.nit}\n")

# 3. Constrained optimization
print("3. Constrained: min x^2 + y^2 subject to x + y >= 1")
result = minimize(
    lambda x: x[0]**2 + x[1]**2,
    [0, 0],
    method='SLSQP',
    constraints={'type': 'ineq', 'fun': lambda x: x[0] + x[1] - 1}
)
print(f"   x* = ({result.x[0]:.4f}, {result.x[1]:.4f})")
print(f"   f* = {result.fun:.6f} (exact: 0.5)\n")

# 4. Linear programming
print("4. Linear Programming: min -x - 2y, s.t. x+y<=4, x<=3, y<=3, x,y>=0")
c = [-1, -2]  # minimize -x - 2y
A_ub = [[1, 1], [1, 0], [0, 1]]
b_ub = [4, 3, 3]
result = linprog(c, A_ub=A_ub, b_ub=b_ub)
print(f"   x* = ({result.x[0]:.1f}, {result.x[1]:.1f}), optimal value = {-result.fun:.1f}\n")

# 5. Least-squares curve fitting
from scipy.optimize import curve_fit
print("5. Nonlinear Curve Fitting")
x_data = np.linspace(0, 4, 20)
y_data = 2.5 * np.exp(-1.3 * x_data) + 0.5 + 0.1*np.random.RandomState(42).randn(20)
def model(x, a, b, c):
    return a * np.exp(b * x) + c
popt, pcov = curve_fit(model, x_data, y_data, p0=[1, -1, 0])
perr = np.sqrt(np.diag(pcov))
print(f"   Fit: {popt[0]:.3f}*exp({popt[1]:.3f}*x) + {popt[2]:.3f}")
print(f"   True: 2.5*exp(-1.3*x) + 0.5")
print(f"   Parameter errors: a={perr[0]:.3f}, b={perr[1]:.3f}, c={perr[2]:.3f}")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Method Selection Guide

Problem Type	Best Method	scipy Function
Smooth, convex, small n	Newton / BFGS	minimize(method='BFGS')
Smooth, convex, large n	L-BFGS-B / CG	minimize(method='L-BFGS-B')
Constrained	SLSQP / trust-constr	minimize(method='SLSQP')
Non-convex, global	Differential evolution	differential_evolution()
No derivatives available	Nelder-Mead / Powell	minimize(method='Nelder-Mead')
Least squares	Levenberg-Marquardt	least_squares()

Share:X Reddit LinkedIn