Newton’s Method

Definition:
Newton’s method is an iterative optimization algorithm used to find critical points (e.g., minima, maxima, or saddle points) of a differentiable function. It leverages seond-order Taylor approximations and the Hessian matrix to refine estimates of the optimal solution.

For unconstrained optimization of a scalar-valued function , Newton’s method updates the parameter as:

where:

  • : Current estimate of the solution.
  • : Gradient (first derivative) of at .
  • : Hessian matrix (second derivative) of at .

Key Concepts

  1. Taylor Approximation:
    Newton’s method is based on approximating around a point using a second-order Taylor expansion:

    The minimum of this quadratic approximation occurs at:

  2. Gradient and Hessian:

    • The gradient points in the direction of the steepest ascent/descent.
    • The Hessian provides curvature information, refining the step direction and size.

Algorithm

  1. Initialize: Choose an initial guess .
  2. Iterate: Repeat until convergence:
  3. Convergence Criterion: Stop when:

    where is a small tolerance value.

Advantages

  1. Quadratic Convergence:

    • Near a critical point, convergence is very fast compared to gradient descent.
  2. Accurate Steps:

    • Incorporates curvature information, resulting in better step directions and sizes.

Disadvantages

  1. Computational Cost:

    • Computing and inverting the Hessian matrix is expensive for high-dimensional problems ( complexity).
  2. Not Always Stable:

    • If the Hessian is not positive definite, the method may diverge.
  3. Initialization Sensitivity:

    • Poor initial guesses can lead to slow convergence or divergence.

Variants of Newton’s Method

  1. Modified Newton’s Method:

    • Adds a damping factor to improve stability:

      where is the step size.
  2. Quasi-Newton Methods:

    • Approximate the Hessian iteratively to reduce computational cost. Examples:
      • BFGS (Broyden-Fletcher-Goldfarb-Shanno).
      • L-BFGS (Limited-memory BFGS) for large-scale problems.
  3. Hessian-Free Optimization:

    • Avoids explicit computation of the Hessian by using matrix-free methods like conjugate gradient.

Comparison to Gradient Descent

FeatureNewton’s MethodGradient Descent
Convergence SpeedQuadratic near the minimumLinear
Step SizeAdaptive (from Hessian)Fixed or decaying
Computational CostHigh (Hessian computation and inversion)Low
ApplicabilityRequires twice-differentiable functionsFirst-order differentiable

Newton’s method is powerful for optimization, particularly for problems where second-order information is computationally feasible. For large-scale problems, approximations like quasi-Newton methods are often used to balance efficiency and accuracy.