Convergence of Stochastic Gradient Descent (SGD) Theorem

categories: Data Science, Theorem, Optimization

Setting

Consider the problem of minimizing a convex function $f (x)$ over a domain $X \subseteq R^{d}$ :

x \in X min f (x),

where $f (x) = E_{ξ} [F (x, ξ)]$ , and $F (x, ξ)$ is a stochastic estimate of the objective function.

Stochastic Gradient Descent (SGD)) updates the parameter $x$ as:

x_{k + 1} = x_{k} - η_{k} \nabla F (x_{k}, ξ_{k}),

where $η_{k}$ is the learning rate and $ξ_{k}$ is a sample from the data distribution.

Convergence Theorem (Convex Case)

If $f (x)$ is Convex Function and satisfies Lipschitz condition with constant $L$ , and the stochastic gradients $\nabla F (x, ξ)$ have bounded variance $σ^{2}$ , then SGD satisfies the following convergence guarantee under appropriate step sizes $η_{k}$ :

Step Size (Learning Rate):
Choose a diminishing learning rate $η_{k} = \frac{α}{k + 1}$ , where $α > 0$ is a constant.
Convergence Result:
The expected function value converges as:
$E [f (\overset{ˉ}{x}_{K})] - f (x^{*}) \leq \frac{∥ x _{1} - x ^{*} ∥ _{2}^{2}}{2 α K} + \frac{α L ^{2} σ ^{2}}{2},$
where $\overset{ˉ}{x}_{K} = \frac{1}{K} \sum_{k = 1}^{K} x_{k}$ is the average iterate, and $x^{*}$ is the optimal solution.

Key Assumptions

Convexity:
$f (x)$ is convex, i.e., for all $x, y \in X$ :
$f (y) \geq f (x) + \nabla f (x)^{T} (y - x) .$
Lipschitz Continuity:
$∥\nabla f (x) ∥_{2} \leq L$ for all $x \in X$ .
Bounded Variance:
The stochastic gradients have bounded variance:
$E [∥\nabla F (x, ξ) - \nabla f (x) ∥_{2}^{2}] \leq σ^{2}, \forall x \in X .$

Convergence Theorem (Strongly Convex Case)

If $f (x)$ is strongly convex with constant $μ > 0$ and the same assumptions on Lipschitz continuity and variance hold, SGD achieves a faster convergence rate:

Step Size:
Use a diminishing step size $η_{k} = \frac{α}{k}$ , with $α > 1/ μ$ .
Convergence Result:
The expected function value satisfies:
$E [f (x_{K})] - f (x^{*}) \leq \frac{C}{K},$
where $C$ depends on the initial distance to the optimum, the Lipschitz constant $L$ , and the variance $σ^{2}$ .

Interpretation

For convex functions, the convergence is sublinear ( $O (1/ K)$ ).
For strongly convex functions, the convergence is linear in expectation ( $O (lo g K)$ for the iterates).
The diminishing learning rate balances convergence speed with noise suppression from the stochastic gradient.

Evgeny's Notes

Explorer

Recent posts

Installing the Homebrew Channel App on an LG TV (Ubuntu)

Obsidian + Zettelkasten + PARA

About this site