Gradient Clipping

categories: Data Science, Technique, Algorithm

Gradient Clipping

Definition:
Gradient clipping is a technique used to prevent exploding gradients during the training of deep neural networks. It involves capping the gradients during Backpropagation Algorithm so that their norm or absolute value does not exceed a specified threshold. This stabilizes training, especially for models with long backpropagation paths, such as Recurrent Neural Network.

Why Gradient Clipping is Needed

Exploding Gradients:
- During backpropagation, gradients can grow exponentially with the number of layers or time steps.
- This leads to excessively large weight updates, causing the model to diverge or oscillate.
Stabilizing Training:
- Gradient clipping ensures that gradients remain within a manageable range, leading to smoother updates and stable convergence.

Types of Gradient Clipping

Norm Clipping:
- Rescales the gradient vector if its norm exceeds a threshold $τ$ .
- For a gradient $\nabla_{θ}$ with norm $∥ \nabla_{θ} ∥$ , the clipped gradient is:
  $\nabla_{θ} \leftarrow \nabla_{θ} \cdot min (1, \frac{τ}{∥ \nabla _{θ} ∥}) .$
- Preserves the direction of the gradient while limiting its magnitude.
Value Clipping:
- Clips each element of the gradient vector to lie within a specified range $[- τ, τ]$ :
  $\nabla_{θ} [i] \leftarrow max (- τ, min (\nabla_{θ} [i], τ)) .$
- Useful for very large gradients in specific directions.
Global Norm Clipping (used in TensorFlow):
- Rescales all gradients for all parameters collectively based on their global norm:
  $global norm = \sum_{i} ∥ \nabla_{θ}^{(i)} ∥^{2},$
  and clip using the same rule as norm clipping.

How Gradient Clipping Works

Compute Gradients:
- During backpropagation, compute the gradients of the loss function with respect to the model parameters $\nabla_{θ}$ .
Check Norm/Value:
- Evaluate whether the gradient norm or any gradient value exceeds the threshold $τ$ .
Clip:
- If the norm or value exceeds the threshold, scale the gradient vector or its elements to comply with the threshold.
Update Parameters:
- Use the clipped gradients to update the model parameters:
  $θ \leftarrow θ - η \cdot \nabla_{θ} .$

Algorithm

Set the clipping threshold $τ$ .
For each mini-batch:
- Compute gradients $\nabla_{θ}$ .
- If $∥ \nabla_{θ} ∥ > τ$ :
  - Scale gradients:
    $\nabla_{θ} \leftarrow \nabla_{θ} \cdot \frac{τ}{∥ \nabla _{θ} ∥} .$
- Update parameters using the clipped gradients.

Advantages of Gradient Clipping

Prevents Instability:
- Avoids exploding gradients, especially in RNNs and very deep networks.
Maintains Learning Rate:
- Allows the use of larger learning rates without causing divergence.
Simplicity:
- Easy to implement and integrate into existing training pipelines.

Disadvantages of Gradient Clipping

Information Loss:
- Clipping may alter the gradient direction, potentially slowing convergence.
Requires Tuning:
- The clipping threshold $τ$ is a hyperparameter that needs careful tuning.

Example in PyTorch

import torch
import torch.nn as nn
import torch.optim as optim
 
# Define a simple model
model = nn.Sequential(
    nn.Linear(10, 50),
    nn.ReLU(),
    nn.Linear(50, 1)
)
 
# Loss function and optimizer
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)
 
# Example input and target
inputs = torch.randn(16, 10)  # Batch of 16 samples
targets = torch.randn(16, 1)
 
# Forward pass
outputs = model(inputs)
loss = criterion(outputs, targets)
 
# Backward pass
loss.backward()
 
# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)
 
# Update parameters
optimizer.step()

Comparison with Other Techniques

Technique	Effect	Use Case
Gradient Clipping	Caps gradient magnitudes	Exploding gradients, RNNs
Batch Normalization	Normalizes activations	Vanishing gradients, deep networks
Weight Regularization	Penalizes large weights	Overfitting, general regularization

Evgeny's Notes

Explorer

Recent posts

Installing the Homebrew Channel App on an LG TV (Ubuntu)

Obsidian + Zettelkasten + PARA

About this site