- categories: Data Science, Algorithm
Stochastic Gradient Descent (SGD)
Definition:
Stochastic Gradient Descent is an iterative optimization algorithm used to minimize a loss function. Unlike batch gradient descent, which uses the entire dataset to compute the gradient, SGD updates the model parameters using only a single randomly chosen sample (or a small batch of samples) at each step.
Objective:
Minimize a loss function , where represents the parameters of the model, using the update rule:
where:
- is the learning rate (step size).
- is the gradient of the loss with respect to .
Key Difference in SGD
In batch gradient descent, the gradient is computed over the entire dataset:
where is the total number of samples.
In SGD, the gradient is computed for a single sample (or a small mini-batch):
where is randomly selected from .
Algorithm
-
Initialize: Set initial parameters randomly. Choose a learning rate .
-
Repeat (for a fixed number of epochs or until convergence):
- Shuffle the dataset.
- For each sample (or mini-batch ):
- Compute the gradient of the loss for (or ):
or for a mini-batch:
- Update the parameters:
- Compute the gradient of the loss for (or ):
-
Terminate: Stop when convergence criteria are met (e.g., no significant improvement in ).
Variants of SGD
-
Mini-Batch Gradient Descent:
Combines the benefits of both batch and stochastic methods by computing the gradient on small batches of data.- Batch size typically ranges from 32 to 512.
-
SGD with Momentum:
Adds a “momentum” term to the update to accelerate convergence and reduce oscillations:
where is the momentum coefficient (e.g., 0.9). -
Adaptive Learning Rate Methods:
- Adagrad: Adjusts the learning rate based on past gradients.
- RMSprop: Scales learning rates by a moving average of gradient magnitudes.
- Adam: Combines momentum and adaptive learning rates for robust updates.
Advantages
-
Efficiency:
Faster updates compared to batch gradient descent, especially for large datasets. -
Online Learning:
Can update the model as new data arrives, making it suitable for streaming or online learning. -
Scalability:
Requires less memory and computational resources since only a subset of data is used for each update. -
Escapes Local Minima:
The noisy updates of SGD can help escape shallow local minima in non-convex optimization.
Disadvantages
-
Noisy Convergence:
The randomness in gradient estimates can cause oscillations around the minimum. -
Sensitivity to Learning Rate:
A poorly chosen learning rate can lead to divergence or slow convergence. -
Tuning Challenges:
Requires careful tuning of hyperparameters like learning rate, mini-batch size, and momentum.
Example
Consider minimizing (mean squared error) for a linear regression model.
-
Gradient for Sample :
-
SGD Update:
Tips for Effective Use
-
Learning Rate Scheduling:
Reduce the learning rate over time to stabilize convergence (e.g., exponential decay). -
Batch Normalization:
Normalize mini-batches to improve stability and speed up convergence. -
Regularization:
Add penalties like or regularization to avoid overfitting. -
Early Stopping:
Monitor the validation loss and stop training when it stops improving.