Vanishing and Exploding Gradient Problem

categories: Data Science, Challenge

The vanishing gradient and exploding gradient problems are common issues encountered when training deep neural networks. They arise during Backpropagation Algorithm, where gradients are propagated backward through the network to update weights. These problems can lead to instability in training, poor convergence, or even failure to train.

Vanishing Gradient Problem

Definition:
In the vanishing gradient problem, the gradients of the loss function with respect to the weights in earlier layers become very small as they are propagated backward. This results in negligible updates to the weights of the earlier layers, causing the network to learn very slowly or not at all.

Causes:

Activation Functions:
- Non-linear activation functions like Sigmoid Function or tanh “saturate” for large input values, where their derivatives are close to zero:
  $σ^{'} (x) = σ (x) (1 - σ (x)) (for sigmoid),$
  $tanh^{'} (x) = 1 - tanh^{2} (x) .$
Deep Networks:
- In deep networks, backpropagation involves repeatedly multiplying gradients by the Jacobian matrices of layer activations. If the Jacobian Matrix has eigenvalues less than 1, gradients shrink exponentially with depth.

Exploding Gradient Problem

Definition:
In the exploding gradient problem, the gradients of the loss function with respect to the weights grow very large during backpropagation. This leads to instability, where the weights diverge, causing the loss to oscillate or blow up.

Causes:

Weight Initialization:
- Improper weight initialization can lead to gradients with large magnitudes.
Deep Networks:
- If the Jacobian Matrix has eigenvalues greater than 1, gradients grow exponentially with depth.

Mathematical Explanation

Consider a loss function $L$ for a deep network with parameters $θ$ . During backpropagation, the gradient of the loss with respect to parameters in layer $l$ is:
$\frac{\partial L}{\partial θ _{l}} = \frac{\partial L}{\partial a _{L}} \prod_{k = l + 1}^{L} \frac{\partial a _{k}}{\partial a _{k - 1}},$
where $a_{k}$ are the activations at layer $k$ .

If $∥ \frac{\partial a _{k}}{\partial a _{k - 1}} ∥$ is consistently:

Small (e.g., <1): Gradients diminish as $L - l$ increases, leading to vanishing gradients.
Large (e.g., >1): Gradients grow exponentially as $L - l$ increases, leading to exploding gradients.

Impact on Training

Vanishing Gradients:
- Early layers stop learning as their gradients approach zero.
- Network struggles to capture long-term dependencies in sequential tasks (e.g., RNNs).
Exploding Gradients:
- Gradients become too large, causing weights to oscillate or diverge.
- Training becomes unstable, and the loss may “explode” to infinity.

Solutions to Vanishing and Exploding Gradients

1. Proper Weight Initialization

Use methods like Xavier (Glorot) or He initialization to ensure that weights are neither too small nor too large:
- Xavier Initialization:
  $W \sim N (0, \frac{2}{n _{in} + n _{out}}),$
  where $n_{in}$ and $n_{out}$ are the number of inputs and outputs for the layer.
- He Initialization:
  $W \sim N (0, \frac{2}{n _{in}}) .$

2. Use Activation Functions with Non-Saturating Gradients

Replace saturating functions like sigmoid or tanh with activation functions like ReLU:
$ReLU (x) = max (0, x), ReLU^{'} (x) = 1 (for x > 0) .$

3. Batch Normalization

Normalize activations within a mini-batch to stabilize gradient magnitudes:
$\overset{x}{^}_{i} = \frac{x _{i} - μ}{σ ^{2} + ϵ} .$

4. Gradient Clipping

Clip gradients to a maximum norm or value to prevent them from exploding:
$if ∥ \nabla_{θ} ∥ > τ, \nabla_{θ} \leftarrow \frac{\nabla _{θ}}{∥ \nabla _{θ} ∥} \cdot τ,$
where $τ$ is a threshold.

5. Residual Connections (ResNets)

Add shortcut connections that allow gradients to flow directly to earlier layers:
$y = F (x, W) + x,$
where $F (x, W)$ is the output of a block of layers.

6. Specialized Architectures

Use architectures designed to mitigate gradient issues, such as:
- LSTMs or GRUs for sequential tasks.
- Transformers for long-term dependencies.

Comparison: Vanishing vs. Exploding Gradients

Feature	Vanishing Gradient	Exploding Gradient
Cause	Gradients shrink exponentially with depth	Gradients grow exponentially with depth
Effect	Slow or no learning in early layers	Instability, divergence of training
Common Context	Deep networks, RNNs	Poor initialization, deep networks
Solutions	ReLU, BatchNorm, Residual Connections	Gradient Clipping, Proper Initialization

Evgeny's Notes

Explorer

Recent posts

Installing the Homebrew Channel App on an LG TV (Ubuntu)

Obsidian + Zettelkasten + PARA

About this site

Vanishing and Exploding Gradient Problem

Table of Contents

Vanishing Gradient Problem

Causes:

Exploding Gradient Problem

Causes:

Mathematical Explanation

Impact on Training

Solutions to Vanishing and Exploding Gradients

1. Proper Weight Initialization

2. Use Activation Functions with Non-Saturating Gradients

3. Batch Normalization

4. Gradient Clipping

5. Residual Connections (ResNets)

6. Specialized Architectures

Comparison: Vanishing vs. Exploding Gradients

Graph View

Backlinks