BFGS

categories: Data Science, Real Analysis, Method, Optimization, Algorithm

Definition:
The BFGS algorithm is a quasi-Newton optimization method used to minimize a differentiable scalar-valued function $f (x)$ . It approximates the inverse of the Hessian matrix iteratively, avoiding the computational cost of explicitly calculating or inverting the Hessian.

BFGS updates an approximation of the inverse Hessian matrix using information from gradients at successive iterations. It is widely used for unconstrained optimization problems and is robust and efficient for many applications.

Key Idea

Avoid Explicit Hessian Computation:
Instead of directly computing the Hessian matrix $H_{f} (x)$ , BFGS builds an approximation of the inverse Hessian, $B_{k} \approx H_{f} (x_{k})^{- 1}$ , iteratively.
Quasi-Newton Update:
The inverse Hessian approximation is updated based on gradient differences and step sizes between iterations.
Line Search:
BFGS typically incorporates a line search to determine a suitable step size for each iteration.

Algorithm

Initialization:
- Start with an initial guess $x_{0} \in R^{n}$ .
- Initialize the inverse Hessian approximation as $B_{0} = I$ (identity matrix).
- Compute the initial gradient $\nabla f (x_{0})$ .
Iterative Updates:
For $k = 0, 1, 2, \dots$ :
- Compute the search direction:
  $p_{k} = - B_{k} \nabla f (x_{k}) .$
- Perform a line search to find step size $α_{k}$ that sufficiently reduces $f$ :
  $x_{k + 1} = x_{k} + α_{k} p_{k} .$
- Compute the gradient difference and step vector:
  $s_{k} = x_{k + 1} - x_{k}, y_{k} = \nabla f (x_{k + 1}) - \nabla f (x_{k}) .$
- Update the inverse Hessian approximation $B_{k}$ :
  $B_{k + 1} = B_{k} + \frac{s _{k} s _{k}^{⊤}}{s _{k}^{⊤} y _{k}} - \frac{B _{k} y _{k} y _{k}^{⊤} B _{k}}{y _{k}^{⊤} B _{k} y _{k}} .$
Convergence:
Stop when the gradient norm $∥\nabla f (x_{k}) ∥$ is below a predefined threshold.

Key Formulas

Search Direction:
$p_{k} = - B_{k} \nabla f (x_{k}) .$
Hessian Inverse Update Rule:
$B_{k + 1} = B_{k} + \frac{s _{k} s _{k}^{⊤}}{s _{k}^{⊤} y _{k}} - \frac{B _{k} y _{k} y _{k}^{⊤} B _{k}}{y _{k}^{⊤} B _{k} y _{k}},$
where:
- $s_{k} = x_{k + 1} - x_{k}$ : Change in position.
- $y_{k} = \nabla f (x_{k + 1}) - \nabla f (x_{k})$ : Change in gradient.

Advantages of BFGS

Efficiency:
- Avoids the $O (n^{3})$ cost of explicit Hessian computation and inversion.
- Updates the inverse Hessian approximation in $O (n^{2})$ per iteration.
Robustness:
- Converges quickly for well-behaved functions.
- Suitable for both small and medium-sized problems.
Superlinear Convergence:
- Converges faster than gradient descent near the minimum due to better curvature approximation.

Limitations of BFGS

Memory Requirements:
- Storing and updating the inverse Hessian approximation requires $O (n^{2})$ memory, which can be prohibitive for very high-dimensional problems.
Line Search Dependency:
- Relies on an effective line search to ensure sufficient decrease in the objective function.
Ill-Conditioned Problems:
- May struggle when the Hessian is ill-conditioned, though modifications like damped BFGS help mitigate this.

Variants

Limited-Memory BFGS (L-BFGS):
- Stores only a few vectors to approximate the Hessian, reducing memory requirements to $O (n)$ .
- Suitable for high-dimensional problems (e.g., training machine learning models).
Damped BFGS:
- Modifies the update rule to ensure positive definiteness of $B_{k}$ , improving stability.

Example

Problem: Minimize $f (x) = x_{1}^{2} + 4 x_{2}^{2}$ .

Gradient:
$\nabla f (x) = [2 x_{1}, 8 x_{2}]^{⊤} .$
Initialization:
- Start with $x_{0} = [1, 1]^{⊤}$ .
- Set $B_{0} = I$ .
First Iteration:
- Compute search direction:
  $p_{0} = - B_{0} \nabla f (x_{0}) = - [1, 1]^{⊤} .$
- Perform line search to find $α_{0}$ :
  $α_{0} = ar g min_{α} f (x_{0} + α p_{0}) .$
- Update position:
  $x_{1} = x_{0} + α_{0} p_{0} .$
Update $B_{0}$ to $B_{1}$ using $s_{0}$ and $y_{0}$ .

Comparison with Other Methods

Method	Hessian Use	Memory Cost	Convergence Speed	Use Case
Gradient Descent	None	Low	Linear	Simple/large problems
Newton’s Method	Explicit Hessian	High	Quadratic (near optima)	Small problems
BFGS	Approx. Hessian	Medium	Superlinear	Medium-sized problems
L-BFGS	Approx. Hessian	Low	Superlinear	High-dimensional problems

Conclusion

The BFGS algorithm is a powerful and versatile tool for unconstrained optimization. It balances the accuracy of second-order methods with the efficiency of avoiding explicit Hessian computation, making it a cornerstone in optimization techniques for machine learning and scientific computing.

Evgeny's Notes

Explorer

Recent posts

Installing the Homebrew Channel App on an LG TV (Ubuntu)

Obsidian + Zettelkasten + PARA

About this site

BFGS

Table of Contents

Key Idea

Algorithm

Key Formulas

Advantages of BFGS

Limitations of BFGS

Variants

Example

Problem: Minimize $f (x) = x_{1}^{2} + 4 x_{2}^{2}$ .

Comparison with Other Methods

Conclusion

Graph View

Backlinks

Evgeny's Notes

Explorer

Recent posts

Installing the Homebrew Channel App on an LG TV (Ubuntu)

Obsidian + Zettelkasten + PARA

About this site

BFGS

Table of Contents

Key Idea

Algorithm

Key Formulas

Advantages of BFGS

Limitations of BFGS

Variants

Example

Problem: Minimize f(x)=x12​+4x22​.

Comparison with Other Methods

Conclusion

Graph View

Backlinks

Problem: Minimize $f (x) = x_{1}^{2} + 4 x_{2}^{2}$ .