Gradient Clipping

Definition:
Gradient clipping is a technique used to prevent exploding gradients during the training of deep neural networks. It involves capping the gradients during Backpropagation Algorithm so that their norm or absolute value does not exceed a specified threshold. This stabilizes training, especially for models with long backpropagation paths, such as Recurrent Neural Network.


Why Gradient Clipping is Needed

  1. Exploding Gradients:

    • During backpropagation, gradients can grow exponentially with the number of layers or time steps.
    • This leads to excessively large weight updates, causing the model to diverge or oscillate.
  2. Stabilizing Training:

    • Gradient clipping ensures that gradients remain within a manageable range, leading to smoother updates and stable convergence.

Types of Gradient Clipping

  1. Norm Clipping:

    • Rescales the gradient vector if its norm exceeds a threshold .
    • For a gradient with norm , the clipped gradient is:
    • Preserves the direction of the gradient while limiting its magnitude.
  2. Value Clipping:

    • Clips each element of the gradient vector to lie within a specified range :
    • Useful for very large gradients in specific directions.
  3. Global Norm Clipping (used in TensorFlow):

    • Rescales all gradients for all parameters collectively based on their global norm:

      and clip using the same rule as norm clipping.

How Gradient Clipping Works

  1. Compute Gradients:

    • During backpropagation, compute the gradients of the loss function with respect to the model parameters .
  2. Check Norm/Value:

    • Evaluate whether the gradient norm or any gradient value exceeds the threshold .
  3. Clip:

    • If the norm or value exceeds the threshold, scale the gradient vector or its elements to comply with the threshold.
  4. Update Parameters:

    • Use the clipped gradients to update the model parameters:

Algorithm

  1. Set the clipping threshold .
  2. For each mini-batch:
    • Compute gradients .
    • If :
      • Scale gradients:
    • Update parameters using the clipped gradients.

Advantages of Gradient Clipping

  1. Prevents Instability:

    • Avoids exploding gradients, especially in RNNs and very deep networks.
  2. Maintains Learning Rate:

    • Allows the use of larger learning rates without causing divergence.
  3. Simplicity:

    • Easy to implement and integrate into existing training pipelines.

Disadvantages of Gradient Clipping

  1. Information Loss:

    • Clipping may alter the gradient direction, potentially slowing convergence.
  2. Requires Tuning:

    • The clipping threshold is a hyperparameter that needs careful tuning.

Example in PyTorch

import torch
import torch.nn as nn
import torch.optim as optim
 
# Define a simple model
model = nn.Sequential(
    nn.Linear(10, 50),
    nn.ReLU(),
    nn.Linear(50, 1)
)
 
# Loss function and optimizer
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)
 
# Example input and target
inputs = torch.randn(16, 10)  # Batch of 16 samples
targets = torch.randn(16, 1)
 
# Forward pass
outputs = model(inputs)
loss = criterion(outputs, targets)
 
# Backward pass
loss.backward()
 
# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)
 
# Update parameters
optimizer.step()

Comparison with Other Techniques

TechniqueEffectUse Case
Gradient ClippingCaps gradient magnitudesExploding gradients, RNNs
Batch NormalizationNormalizes activationsVanishing gradients, deep networks
Weight RegularizationPenalizes large weightsOverfitting, general regularization