- categories: Data Science, Theorem, Optimization
Setting
Consider the problem of minimizing a convex function over a domain :
where , and is a stochastic estimate of the objective function.
Stochastic Gradient Descent (SGD)) updates the parameter as:
where is the learning rate and is a sample from the data distribution.
Convergence Theorem (Convex Case)
If is Convex Function and satisfies Lipschitz condition with constant , and the stochastic gradients have bounded variance , then SGD satisfies the following convergence guarantee under appropriate step sizes :
-
Step Size (Learning Rate):
Choose a diminishing learning rate , where is a constant. -
Convergence Result:
The expected function value converges as:where is the average iterate, and is the optimal solution.
Key Assumptions
-
Convexity:
is convex, i.e., for all : -
Lipschitz Continuity:
for all . -
Bounded Variance:
The stochastic gradients have bounded variance:
Convergence Theorem (Strongly Convex Case)
If is strongly convex with constant and the same assumptions on Lipschitz continuity and variance hold, SGD achieves a faster convergence rate:
-
Step Size:
Use a diminishing step size , with . -
Convergence Result:
The expected function value satisfies:where depends on the initial distance to the optimum, the Lipschitz constant , and the variance .
Interpretation
- For convex functions, the convergence is sublinear ().
- For strongly convex functions, the convergence is linear in expectation ( for the iterates).
- The diminishing learning rate balances convergence speed with noise suppression from the stochastic gradient.