Stochastic Gradient Descent (SGD)

categories: Data Science, Algorithm

Stochastic Gradient Descent (SGD)

Definition:
Stochastic Gradient Descent is an iterative optimization algorithm used to minimize a loss function. Unlike batch gradient descent, which uses the entire dataset to compute the gradient, SGD updates the model parameters using only a single randomly chosen sample (or a small batch of samples) at each step.

Objective:
Minimize a loss function $J (θ)$ , where $θ$ represents the parameters of the model, using the update rule:
$θ \leftarrow θ - η \nabla_{θ} J (θ),$
where:

$η$ is the learning rate (step size).
$\nabla_{θ} J (θ)$ is the gradient of the loss with respect to $θ$ .

Key Difference in SGD

In batch gradient descent, the gradient is computed over the entire dataset:
$\nabla_{θ} J (θ) = \frac{1}{m} \sum_{i = 1}^{m} \nabla_{θ} J_{i} (θ),$
where $m$ is the total number of samples.

In SGD, the gradient is computed for a single sample (or a small mini-batch):
$\nabla_{θ} J (θ) \approx \nabla_{θ} J_{i} (θ),$
where $i$ is randomly selected from ${1, 2, \dots, m}$ .

Algorithm

Initialize: Set initial parameters $θ_{0}$ randomly. Choose a learning rate $η$ .
Repeat (for a fixed number of epochs or until convergence):
- Shuffle the dataset.
- For each sample $i$ (or mini-batch $B$ ):
  - Compute the gradient of the loss for $i$ (or $B$ ):
    $g_{i} = \nabla_{θ} J_{i} (θ),$
    or for a mini-batch:
    $g_{B} = \frac{1}{∣ B ∣} \sum_{i \in B} \nabla_{θ} J_{i} (θ) .$
  - Update the parameters:
    $θ \leftarrow θ - η g_{i} (or θ \leftarrow θ - η g_{B}) .$
Terminate: Stop when convergence criteria are met (e.g., no significant improvement in $J (θ)$ ).

Variants of SGD

Mini-Batch Gradient Descent:
Combines the benefits of both batch and stochastic methods by computing the gradient on small batches of data.
- Batch size $∣ B ∣$ typically ranges from 32 to 512.
SGD with Momentum:
Adds a “momentum” term to the update to accelerate convergence and reduce oscillations:
$v_{t} = γ v_{t - 1} + η \nabla_{θ} J (θ),$
$θ \leftarrow θ - v_{t},$
where $γ$ is the momentum coefficient (e.g., 0.9).
Adaptive Learning Rate Methods:
- Adagrad: Adjusts the learning rate based on past gradients.
- RMSprop: Scales learning rates by a moving average of gradient magnitudes.
- Adam: Combines momentum and adaptive learning rates for robust updates.

Advantages

Efficiency:
Faster updates compared to batch gradient descent, especially for large datasets.
Online Learning:
Can update the model as new data arrives, making it suitable for streaming or online learning.
Scalability:
Requires less memory and computational resources since only a subset of data is used for each update.
Escapes Local Minima:
The noisy updates of SGD can help escape shallow local minima in non-convex optimization.

Disadvantages

Noisy Convergence:
The randomness in gradient estimates can cause oscillations around the minimum.
Sensitivity to Learning Rate:
A poorly chosen learning rate can lead to divergence or slow convergence.
Tuning Challenges:
Requires careful tuning of hyperparameters like learning rate, mini-batch size, and momentum.

Example

Consider minimizing $J (θ) = \frac{1}{2 m} \sum_{i = 1}^{m} (h_{θ} (x_{i}) - y_{i})^{2}$ (mean squared error) for a linear regression model.

Gradient for Sample $i$ :
$\nabla_{θ} J_{i} (θ) = (h_{θ} (x_{i}) - y_{i}) x_{i} .$
SGD Update:
$θ \leftarrow θ - η (h_{θ} (x_{i}) - y_{i}) x_{i} .$

Tips for Effective Use

Learning Rate Scheduling:
Reduce the learning rate over time to stabilize convergence (e.g., exponential decay).
Batch Normalization:
Normalize mini-batches to improve stability and speed up convergence.
Regularization:
Add penalties like $L_{1}$ or $L_{2}$ regularization to avoid overfitting.
Early Stopping:
Monitor the validation loss and stop training when it stops improving.

Evgeny's Notes

Explorer

Recent posts

Installing the Homebrew Channel App on an LG TV (Ubuntu)

Obsidian + Zettelkasten + PARA

About this site