Stochastic Gradient Descent (SGD)

Definition:
Stochastic Gradient Descent is an iterative optimization algorithm used to minimize a loss function. Unlike batch gradient descent, which uses the entire dataset to compute the gradient, SGD updates the model parameters using only a single randomly chosen sample (or a small batch of samples) at each step.

Objective:
Minimize a loss function , where represents the parameters of the model, using the update rule:

where:

  • is the learning rate (step size).
  • is the gradient of the loss with respect to .

Key Difference in SGD

In batch gradient descent, the gradient is computed over the entire dataset:

where is the total number of samples.

In SGD, the gradient is computed for a single sample (or a small mini-batch):

where is randomly selected from .


Algorithm

  1. Initialize: Set initial parameters randomly. Choose a learning rate .

  2. Repeat (for a fixed number of epochs or until convergence):

    • Shuffle the dataset.
    • For each sample (or mini-batch ):
      • Compute the gradient of the loss for (or ):

        or for a mini-batch:
      • Update the parameters:
  3. Terminate: Stop when convergence criteria are met (e.g., no significant improvement in ).


Variants of SGD

  1. Mini-Batch Gradient Descent:
    Combines the benefits of both batch and stochastic methods by computing the gradient on small batches of data.

    • Batch size typically ranges from 32 to 512.
  2. SGD with Momentum:
    Adds a “momentum” term to the update to accelerate convergence and reduce oscillations:


    where is the momentum coefficient (e.g., 0.9).

  3. Adaptive Learning Rate Methods:

    • Adagrad: Adjusts the learning rate based on past gradients.
    • RMSprop: Scales learning rates by a moving average of gradient magnitudes.
    • Adam: Combines momentum and adaptive learning rates for robust updates.

Advantages

  1. Efficiency:
    Faster updates compared to batch gradient descent, especially for large datasets.

  2. Online Learning:
    Can update the model as new data arrives, making it suitable for streaming or online learning.

  3. Scalability:
    Requires less memory and computational resources since only a subset of data is used for each update.

  4. Escapes Local Minima:
    The noisy updates of SGD can help escape shallow local minima in non-convex optimization.


Disadvantages

  1. Noisy Convergence:
    The randomness in gradient estimates can cause oscillations around the minimum.

  2. Sensitivity to Learning Rate:
    A poorly chosen learning rate can lead to divergence or slow convergence.

  3. Tuning Challenges:
    Requires careful tuning of hyperparameters like learning rate, mini-batch size, and momentum.

Example

Consider minimizing (mean squared error) for a linear regression model.

  1. Gradient for Sample :

  2. SGD Update:


Tips for Effective Use

  1. Learning Rate Scheduling:
    Reduce the learning rate over time to stabilize convergence (e.g., exponential decay).

  2. Batch Normalization:
    Normalize mini-batches to improve stability and speed up convergence.

  3. Regularization:
    Add penalties like or regularization to avoid overfitting.

  4. Early Stopping:
    Monitor the validation loss and stop training when it stops improving.