3.
Unconstrained Optimization
a. Introduction to Unconstrained Optimization Problems
An unconstrained optimization problem is the fundamental building block of optimization. It involves
finding the minimum or maximum of a function without any restrictions on the values of the decision
variables. The standard form is:
Minimize f(x) for x ∈ Rⁿ
or
Maximize f(x) for x ∈ Rⁿ
where f: Rⁿ → R is the objective function and x is the vector of decision variables.
Why is it important?
1. Theoretical Foundation: The optimality conditions (gradient = 0, Hessian positive definite)
are derived for unconstrained problems. These conditions form the basis for understanding
more complex constrained problems.
2. Algorithmic Core: The algorithms developed for unconstrained problems (like Gradient
Descent and Newton's Method) are often adapted and extended to handle constraints via
penalty methods, barrier methods, or Lagrange multipliers.
3. Direct Applications: Many problems can be naturally formulated as unconstrained, such as:
o Least Squares Regression: Minimizing the sum of squared errors.
o Maximum Likelihood Estimation: Maximizing the likelihood function to find model
parameters.
o Neural Network Training: Minimizing a loss function (e.g., Mean Squared Error,
Cross-Entropy) with respect to the network's weights and biases.
around x*. If this holds for all x ∈ Rⁿ, then x* is a global minimizer.
The goal is to find a local minimizer x*, a point such that f(x*) ≤ f(x) for all x in a neighborhood
b. One-Dimensional Optimization Algorithms (Line Search Methods)
These algorithms find the minimum of a function of a single variable, f(α). They are crucial as a sub-
problem in multidimensional optimization, where we often need to find the best step size α to take
in a given direction p (i.e., minimize f(xₖ + αpₖ)).
1. Golden Section Search
Purpose: A robust, derivative-free method for finding a minimum of a unimodal function (a
function with a single minimum in the interval) within a specified interval [a, b].
Intuition: It works by recursively narrowing the interval that contains the minimum by
comparing function values at interior points. The key is to choose these interior points in
such a way that one of them can be reused in the next iteration, minimizing the number of
function evaluations.
The Golden Ratio: The algorithm uses the golden ratio φ = (1 + √5)/2 ≈ 1.618 to determine
the interior points.
o c = b - (b - a)/φ
o d = a + (b - a)/φ
Algorithm Steps:
1. Start with an interval [a, b] known to contain a minimum.
2. Calculate two interior points: c and d.
3. If f(c) < f(d), the minimum must lie in [a, d]. Set b = d.
4. Else, the minimum lies in [c, b]. Set a = c.
5. Repeat until the interval length is sufficiently small.
Diagram: Golden Section Search Iteration
Pros & Cons:
o ✅ Pros: Very reliable; always converges for unimodal functions; doesn't require
derivatives.
o ❌ Cons: Convergence is linear (slow); only for 1D problems; requires a bracketing
interval.
2. Brent's Method
Purpose: A more sophisticated 1D minimization algorithm that combines the reliability of
bracketing (like Golden Section) with the speed of derivative-based methods.
Intuition: Brent's method uses three techniques:
1. Inverse Quadratic Interpolation: Fits an inverse quadratic function to the three best
points and finds its minimum. Very fast when it works.
2. Successive Parabolic Interpolation: Fits a parabola to three points. Also fast.
3. Golden Section Search: Used as a reliable fallback when the interpolation steps
would be slow or unreliable (e.g., when points are co-linear).
How it works: The algorithm tries to use the fast interpolation methods. If the predicted
minimum is within the current bracketing interval and promises a good convergence rate, it
uses that prediction. Otherwise, it takes a Golden Section step to ensure progress.
Pros & Cons:
o ✅ Pros: Superfast convergence (superlinear) for well-behaved functions; reliable due
to the fallback mechanism; considered one of the best general-purpose 1D
minimizers.
o ❌ Cons: More complex to implement; logic for switching methods is heuristic.
c. Multidimensional Optimization Algorithms
These are the workhorses for most practical problems. They iteratively generate a sequence of
points x₀, x₁, x₂, ... that (hopefully) converges to a local minimizer x*.
General Iterative Scheme:
xₖ₊₁ = xₖ + αₖ pₖ
where:
pₖ is the search direction.
αₖ is the step length or learning rate, found via a 1D line search (e.g., Golden Section, Brent)
along the direction pₖ.
The choice of the search direction pₖ defines the algorithm.
1. Method of Steepest Descent
Search Direction: pₖ = -∇f(xₖ)
o The most obvious choice: move in the direction where the function decreases most
rapidly locally.
direction: xₖ₊₁ = xₖ - αₖ ∇f(xₖ).
Algorithm: At each iteration, calculate the gradient and take a step in the negative gradient
Diagram: Path of Steepest Descent
The path often exhibits a "zig-zag" pattern, especially in narrow valleys, leading to slow convergence.
Pros & Cons:
o ✅ Pros: Very simple to understand and implement; guaranteed convergence under
mild conditions; only requires first derivatives (gradient).
o ❌ Cons: Convergence is very slow (linear convergence rate); performance is poor for
ill-conditioned problems (where the Hessian has eigenvalues of very different
magnitudes); zig-zag behavior is inefficient.
2. Newton's Method
Search Direction: pₖ = - (H(xₖ))⁻¹ ∇f(xₖ)
o This direction is obtained by minimizing a quadratic approximation (second-order
Taylor expansion) of f at xₖ.
Algorithm: xₖ₊₁ = xₖ - (H(xₖ))⁻¹ ∇f(xₖ)
Intuition: Newton's method uses both gradient (slope) and Hessian (curvature) information.
It doesn't just go downhill; it assumes the function looks like a bowl and jumps directly to the
bottom of that bowl.
Diagram: Newton's Method Step
Pros & Cons:
o ✅ Pros: Extremely fast convergence (quadratic convergence rate) near the optimum
if the initial guess is good.
o ❌ Cons: Requires calculation and inversion of the Hessian matrix H, which is
computationally expensive O(n³) for n variables; requires second derivatives; if the
Hessian is not positive definite, the direction pₖ might not be a descent direction;
can diverge if started far from a minimum.
3. Conjugate Gradient Method
Purpose: A method designed to accelerate convergence for large problems where storing or
inverting the Hessian (like in Newton's method) is too expensive. It's often used for very
high-dimensional problems (e.g., n > 10,000).
Intuition: Instead of using the negative gradient (which can lead to zig-zagging), it chooses a
search direction that is conjugate to the previous search direction. This means that each new
direction is chosen to not spoil the minimization achieved by the previous direction.
pₖ = -∇f(xₖ) + βₖ pₖ₋₁
Search Direction: The direction is updated iteratively:
where βₖ is a scalar (e.g., computed by the Fletcher-Reeves or Polak-Ribière formula) that
ensures conjugacy.
Pros & Cons:
o ✅ Pros: Faster convergence than Steepest Descent; doesn't require storing a matrix
(like the Hessian), only the previous gradient and direction; more memory efficient
than Newton's method.
o ❌ Cons: Convergence is slower than Newton's method (superlinear rate for quadratic
functions); can be more sensitive to the accuracy of the line search.
d. Implementation using NumPy and SciPy
The scipy.optimize module provides implementations for all these algorithms, saving you from
writing them from scratch.
Key Function: scipy.optimize.minimize(fun, x0, method, jac, hess, ...)
fun: The objective function to minimize.
x0: Initial guess (ndarray).
method: The algorithm to use.
jac: (Optional) Function to compute the Jacobian (gradient).
hess: (Optional) Function to compute the Hessian matrix.
Code Examples:
python
import numpy as np
from scipy.optimize import minimize
# Define the Rosenbrock function (a classic test problem)
def rosen(x):
return np.sum(100.0 * (x[1:] - x[:-1]**2.0)**2.0 + (1 - x[:-1])**2.0)
x0 = np.array([-1.2, 1.0]) # Initial guess
# 1. Using Nelder-Mead (Simplex, derivative-free)
result_nm = minimize(rosen, x0, method='Nelder-Mead')
print("Nelder-Mead result:", result_nm.x)
# 2. Using BFGS (A Quasi-Newton method that approximates the Hessian)
result_bfgs = minimize(rosen, x0, method='BFGS')
print("BFGS result:", result_bfgs.x)
# 3. Using Newton-CG (Uses Newton's direction with CG to solve H p = -∇f)
# We need to provide the gradient and Hessian for Newton-CG
def rosen_der(x):
# ... code to compute gradient ...
xm = x[1:-1]
xm_m1 = x[:-2]
xm_p1 = x[2:]
der = np.zeros_like(x)
der[1:-1] = 200*(xm - xm_m1**2) - 400*(xm_p1 - xm**2)*xm - 2*(1-xm_m1)
der[0] = -400*x[0]*(x[1]-x[0]**2) - 2*(1-x[0])
der[-1] = 200*(x[-1]-x[-2]**2)
return der
result_ncg = minimize(rosen, x0, method='Newton-CG', jac=rosen_der)
print("Newton-CG result:", result_ncg.x)
PART 2: HIGH-MARK ESSAY QUESTIONS & ANSWERS
Essay Question 1 (25 Marks)
"Compare and contrast the Method of Steepest Descent, Newton's Method, and the Conjugate
Gradient method for multidimensional unconstrained optimization. Your answer must detail the
mathematical formulation of the search direction for each, their convergence properties,
computational requirements, and practical suitability. Use diagrams where appropriate to illustrate
the behavior of each algorithm."
Answer:
The choice of algorithm for unconstrained optimization is a trade-off between computational cost,
convergence speed, and robustness. The Method of Steepest Descent, Newton's Method, and the
Conjugate Gradient method represent key points on this spectrum, each with distinct advantages
and limitations.
I. Method of Steepest Descent: The Simple Workhorse
direction of the negative gradient pₖ = -∇f(xₖ). This is the direction of the locally steepest decrease in
The search direction for Steepest Descent is intuitively derived: at each point xₖ, it moves in the
the function value.
Mathematically, the update is xₖ₊₁ = xₖ - αₖ ∇f(xₖ), where αₖ is found via a line search. Its
convergence is guaranteed under mild conditions but is linear. This rate is often unacceptably slow,
especially for ill-conditioned problems where the Hessian matrix has a high condition number (a
large ratio of its largest to smallest eigenvalue). In such landscapes, like the Rosenbrock banana
function, the path of Steepest Descent exhibits a characteristic zig-zag pattern (as shown in Diagram
1), as each new gradient direction is orthogonal to the previous one, leading to inefficient progress.
Computationally, it is cheap per iteration, requiring only the calculation of the gradient vector O(n). It
does not require second derivatives or matrix storage. Its practicality is limited to small, simple
problems or as a component in more complex algorithms where a cheap, rough step is needed.
II. Newton's Method: The Second-Order Benchmark
Newton's Method uses a more sophisticated search direction: pₖ = - (H(xₖ))⁻¹ ∇f(xₖ). This direction is
derived by minimizing the second-order Taylor series approximation of f at xₖ. It doesn't just go
downhill; it assumes the function is quadratic and jumps directly to the estimated minimum of that
quadratic.
The convergence rate of Newton's Method is quadratic near a local minimizer, meaning the number
of correct digits roughly doubles with each iteration. This is dramatically faster than the linear rate of
Steepest Descent. However, this speed comes at a high computational cost. Each iteration requires:
1. Computation of the Hessian matrix H (O(n²) operations).
2. Solving the linear system H p = -∇f (O(n³) operations for factorization).
This makes it prohibitively expensive for high-dimensional problems. Furthermore, it is less
robust. If the initial guess is poor or the Hessian is not positive definite, the method may not
converge. It requires careful implementation to handle indefinite Hessians.
III. Conjugate Gradient Method: The Smart Middle Ground
search direction is updated as pₖ = -∇f(xₖ) + βₖ pₖ₋₁. The scalar βₖ (calculated by, e.g., the Fletcher-
The Conjugate Gradient (CG) method seeks to overcome the limitations of the previous two. Its
Reeves formula) ensures that the new direction is conjugate to the previous one, meaning it
preserves the minimization achieved in previous steps and prevents the zig-zagging of Steepest
Descent.
For quadratic functions in n dimensions, CG converges in at most n iterations (direct method). For
general non-linear functions, its convergence is superlinear. While slower than Newton's quadratic
rate, it is significantly faster than linear convergence.
Computationally, CG is very efficient. It requires only O(n) storage (it needs to store the previous
gradient and direction vectors, not the Hessian) and O(n) operations per iteration. This makes it the
algorithm of choice for very large-scale problems (e.g., n > 10,000) where storing an n x n Hessian is
impossible. Its main drawback is that it can be more sensitive to the accuracy of the line search
compared to other methods.
IV. Summary and Practical Suitability
Search Convergence Cost per
Method Storage Best For
Direction pₖ Rate Iteration
Simple, small
Steepest problems;
-∇f(xₖ) Linear O(n) O(n)
Descent educational
purposes.
Medium-sized
∇f(xₖ)
Newton's -H⁻¹(xₖ)
Quadratic O(n³) O(n²) problems (n < 1000)
Method
with good Hessians.
Very large-scale
Conjugate -∇f(xₖ) + βₖ
Superlinear O(n) O(n) problems (n very
Gradient pₖ₋₁
large).
In practice, Quasi-Newton methods like BFGS and L-BFGS, which build an approximation of the
Hessian to achieve a superlinear convergence rate without the cost of calculating the exact Hessian,
are often the most popular choice for general-purpose, medium-sized optimization. They effectively
bridge the gap between the robustness of Steepest Descent and the speed of Newton's Method.
Essay Question 2 (25 Marks)
"The line search is a critical sub-problem in many multidimensional optimization algorithms.
Explain the role of line search, and critically evaluate the Golden Section Search and Brent's
algorithms for solving it. Discuss the circumstances under which one would be preferred over the
other, and how their performance impacts the overall efficiency of the multidimensional
optimizer."
Answer:
Multidimensional optimization algorithms like Steepest Descent and Conjugate Gradient follow a
two-step iterative process: 1) choose a search direction pₖ, and 2) determine how far to move in that
direction. This second step, finding the step length αₖ that (approximately) minimizes φ(α) = f(xₖ +
αpₖ), is known as the line search. Its effective solution is paramount to the overall performance and
robustness of the optimization algorithm.
I. The Role of Line Search
The purpose of the line search is not necessarily to find the exact minimizer along the ray, but to find
a step length α that provides a sufficient decrease in the objective function. A good line search
ensures global convergence (convergence from any starting point) and prevents the algorithm from
taking overly large steps that cause divergence or overly small steps that lead to glacial progress. It
transforms a direction-finding algorithm into a complete, convergent minimization routine.
II. Golden Section Search: The Reliable Bracketer
Golden Section Search (GSS) is a derivative-free, bracketing method designed to find the minimum of
a unimodal function within an interval [a, b]. Its core principle is to recursively narrow the interval of
uncertainty by comparing function values at two interior points, c and d, chosen based on the golden
ratio.
GSS is celebrated for its robustness and guaranteed convergence. It will always find a minimum
within the bracketing interval, and its performance is predictable. It does not require any derivative
information, making it applicable to non-smooth functions.
However, this robustness comes at the cost of speed. GSS has a linear convergence rate, meaning
the error reduces by a constant factor each iteration. It typically requires about 40-60 iterations to
reduce the interval length by a factor of 1e-6. Each iteration requires one or two new function
evaluations. For expensive objective functions f(x), this can be a significant bottleneck for the overall
optimizer.
III. Brent's Method: The Hybrid Accelerator
Brent's method is a sophisticated hybrid algorithm that aims to achieve the reliability of bracketing
with the speed of polynomial interpolation. It combines three techniques:
1. Golden Section Search: Used as a reliable fallback to guarantee progress.
2. Successive Parabolic Interpolation: Fits a parabola to three points to predict the minimum.
3. Inverse Quadratic Interpolation: Fits an inverse quadratic to three points.
Brent's method intelligently switches between these techniques. It uses parabolic or inverse
quadratic interpolation when the predicted point is within the current bracket and promises fast
convergence. If the interpolation step is deemed unreliable (e.g., the points are co-linear, or the
prediction is outside the bracket), it defaults to a step of Golden Section Search.
This strategy gives Brent's method a superlinear convergence rate for well-behaved functions, often
requiring far fewer function evaluations than GSS to achieve the same accuracy. It is widely regarded
as one of the best general-purpose 1D minimizers in practice.
The main drawback is its complexity. The logic for switching between methods is heuristic and more
complicated to implement correctly. However, for the end-user, this complexity is hidden within
robust implementations like scipy.optimize.minimize_scalar(method='brent').
IV. Impact on Multidimensional Optimization and Preference
The choice of line search algorithm directly impacts the efficiency of the multidimensional optimizer:
Number of Function Evaluations: This is often the dominant computational cost. A faster line
search like Brent's reduces the total number of f(x) calls dramatically.
Quality of the Step: A more accurate line search (finding a better α) can lead to better search
directions in subsequent iterations, especially for algorithms like Conjugate Gradient which
are sensitive to the line search accuracy.
Circumstances for Preference:
Use Golden Section Search when the objective function is noisy, non-differentiable, or has
discontinuities. Its reliance only on function values and its robustness make it a safe, albeit
slower, choice. It is also useful for validating results from more complex methods.
Use Brent's Method for the vast majority of smooth, well-behaved objective functions. Its
superior speed makes it the default choice in most scientific libraries
(scipy.optimize.minimize uses Brent's method for its line search by default in many cases).
The reduction in function evaluations leads to a significantly more efficient overall
optimization process.
In conclusion, while the multidimensional search direction (e.g., gradient, Newton, conjugate) gets
more attention, the line search is the silent workhorse that ensures progress. The choice between a
robust but slow method like Golden Section and a fast, intelligent hybrid like Brent's is a classic trade-
off between reliability and efficiency, with Brent's method being the preferred choice for most
modern scientific computing applications involving smooth functions.