Setting

Consider the problem of minimizing a convex function over a domain :

where , and is a stochastic estimate of the objective function.

Stochastic Gradient Descent (SGD)) updates the parameter as:

where is the learning rate and is a sample from the data distribution.


Convergence Theorem (Convex Case)

If is Convex Function and satisfies Lipschitz condition with constant , and the stochastic gradients have bounded variance , then SGD satisfies the following convergence guarantee under appropriate step sizes :

  1. Step Size (Learning Rate):
    Choose a diminishing learning rate , where is a constant.

  2. Convergence Result:
    The expected function value converges as:

    where is the average iterate, and is the optimal solution.


Key Assumptions

  1. Convexity:
    is convex, i.e., for all :

  2. Lipschitz Continuity:
    for all .

  3. Bounded Variance:
    The stochastic gradients have bounded variance:


Convergence Theorem (Strongly Convex Case)

If is strongly convex with constant and the same assumptions on Lipschitz continuity and variance hold, SGD achieves a faster convergence rate:

  1. Step Size:
    Use a diminishing step size , with .

  2. Convergence Result:
    The expected function value satisfies:

    where depends on the initial distance to the optimum, the Lipschitz constant , and the variance .


Interpretation

  1. For convex functions, the convergence is sublinear ().
  2. For strongly convex functions, the convergence is linear in expectation ( for the iterates).
  3. The diminishing learning rate balances convergence speed with noise suppression from the stochastic gradient.