Backpropagation Algorithm

categories: Data Science, Algorithm

Definition:
Backpropagation is an algorithm used to train neural networks by computing the gradient of the loss function with respect to the network’s weights and biases. It applies the Chain Rule of calculus to efficiently propagate errors from the output layer back to the earlier layers, enabling gradient-based optimization methods like Stochastic Gradient Descent (SGD).

Steps of Backpropagation

Forward Pass:
Compute the predictions of the network for given inputs and calculate the loss.
Backward Pass (Gradient Computation):
- Compute the gradient of the loss with respect to the outputs of each layer (starting from the last layer).
- Use the chain rule to propagate gradients back through the network to compute the gradients of the loss with respect to the weights and biases of each layer.
Update Parameters:
Use the computed gradients to update the weights and biases, typically using gradient descent:
$w \leftarrow w - η \frac{\partial L}{\partial w}, b \leftarrow b - η \frac{\partial L}{\partial b},$
where $η$ is the learning rate.

Mathematical Formulation

Consider a feedforward network with:

$L$ layers,
Activation functions $σ^{(l)}$ for layer $l$ ,
Weight matrices $W^{(l)}$ , bias vectors $b^{(l)}$ , and activations $a^{(l)}$ .

Forward Pass Equations:

For each layer $l$ :

Compute the pre-activation (weighted sum):
$z^{(l)} = W^{(l)} a^{(l - 1)} + b^{(l)}$
Compute the activation:
$a^{(l)} = σ^{(l)} (z^{(l)})$

At the output layer:

Predicted output $\overset{y}{^} = a^{(L)}$ .
Loss $L (\overset{y}{^}, y)$ measures the discrepancy between predictions $\overset{y}{^}$ and true labels $y$ .

Backward Pass (Gradient Computation)

Output Layer (Last Layer $L$ ):
Compute the gradient of the loss with respect to the pre-activation $z^{(L)}$ :
$δ^{(L)} = \frac{\partial L}{\partial z ^{(L)}} = \frac{\partial L}{\partial a ^{(L)}} ⊙ σ^{(L)^{'}} (z^{(L)})$
where $⊙$ denotes elementwise multiplication, and $σ^{(L)^{'}}$ is the derivative of the activation function.
Hidden Layers ( $l = L - 1, \dots, 1$ ):
Propagate the error backward through the network using:
$δ^{(l)} = (W^{(l + 1)^{⊤}} δ^{(l + 1)}) ⊙ σ^{(l)^{'}} (z^{(l)})$
Gradients of Weights and Biases:
Compute the gradients for updating parameters:
- Weight gradient:
  $\frac{\partial L}{\partial W ^{(l)}} = δ^{(l)} a^{(l - 1)^{⊤}}$
- Bias gradient:
  $\frac{\partial L}{\partial b ^{(l)}} = δ^{(l)}$

Algorithm Outline

Initialization: Randomly initialize $W^{(l)}$ and $b^{(l)}$ .
Forward Pass:
- Compute activations and predictions for input $x$ .
- Compute the loss $L (\overset{y}{^}, y)$ .
Backward Pass:
- Compute $δ^{(L)}$ for the output layer.
- Propagate errors backward to compute $δ^{(l)}$ for all hidden layers.
- Compute gradients $\frac{\partial L}{\partial W ^{(l)}}$ and $\frac{\partial L}{\partial b ^{(l)}}$ .
Parameter Update:
Update $W^{(l)}$ and $b^{(l)}$ using an optimization algorithm (e.g., gradient descent).
Iterate: Repeat until convergence.

Key Concepts

Chain Rule:
Backpropagation relies on the chain rule to efficiently compute gradients layer by layer:
$\frac{\partial L}{\partial W ^{(l)}} = \frac{\partial L}{\partial z ^{(l)}} \frac{\partial z ^{(l)}}{\partial W ^{(l)}} .$
Local Gradients:
The computations involve local gradients for each layer, reducing the complexity compared to computing the full gradient in one step.
Computational Graph:
Neural networks can be viewed as computational graphs, where backpropagation systematically computes gradients by traversing the graph in reverse order.

Advantages

Efficiency: Exploits the chain rule to compute gradients in $O (N)$ time, where $N$ is the number of parameters.
Scalability: Works for deep networks with many layers.
General Applicability: Compatible with various loss functions and activation functions.

Limitations

Vanishing/Exploding Gradients: Gradients can become too small or too large, especially in deep networks with sigmoidal activations.
Computational Cost: For very large networks, backpropagation can be computationally expensive.
Initialization Sensitivity: Poor weight initialization can slow convergence or prevent effective learning.

Evgeny's Notes

Explorer

Recent posts

Installing the Homebrew Channel App on an LG TV (Ubuntu)

Obsidian + Zettelkasten + PARA

About this site