Definition:
Backpropagation is an algorithm used to train neural networks by computing the gradient of the loss function with respect to the network’s weights and biases. It applies the Chain Rule of calculus to efficiently propagate errors from the output layer back to the earlier layers, enabling gradient-based optimization methods like Stochastic Gradient Descent (SGD).


Steps of Backpropagation

  1. Forward Pass:
    Compute the predictions of the network for given inputs and calculate the loss.

  2. Backward Pass (Gradient Computation):

    • Compute the gradient of the loss with respect to the outputs of each layer (starting from the last layer).
    • Use the chain rule to propagate gradients back through the network to compute the gradients of the loss with respect to the weights and biases of each layer.
  3. Update Parameters:
    Use the computed gradients to update the weights and biases, typically using gradient descent:

    where is the learning rate.


Mathematical Formulation

Consider a feedforward network with:

  • layers,
  • Activation functions for layer ,
  • Weight matrices , bias vectors , and activations .

Forward Pass Equations:

For each layer :

  1. Compute the pre-activation (weighted sum):
  2. Compute the activation:

At the output layer:

  • Predicted output .
  • Loss measures the discrepancy between predictions and true labels .

Backward Pass (Gradient Computation)

  1. Output Layer (Last Layer ):
    Compute the gradient of the loss with respect to the pre-activation :

    where denotes elementwise multiplication, and is the derivative of the activation function.

  2. Hidden Layers ():
    Propagate the error backward through the network using:

  3. Gradients of Weights and Biases:
    Compute the gradients for updating parameters:

    • Weight gradient:
    • Bias gradient:

Algorithm Outline

  1. Initialization: Randomly initialize and .

  2. Forward Pass:

    • Compute activations and predictions for input .
    • Compute the loss .
  3. Backward Pass:

    • Compute for the output layer.
    • Propagate errors backward to compute for all hidden layers.
    • Compute gradients and .
  4. Parameter Update:
    Update and using an optimization algorithm (e.g., gradient descent).

  5. Iterate: Repeat until convergence.


Key Concepts

  1. Chain Rule:
    Backpropagation relies on the chain rule to efficiently compute gradients layer by layer:

  2. Local Gradients:
    The computations involve local gradients for each layer, reducing the complexity compared to computing the full gradient in one step.

  3. Computational Graph:
    Neural networks can be viewed as computational graphs, where backpropagation systematically computes gradients by traversing the graph in reverse order.


Advantages

  1. Efficiency: Exploits the chain rule to compute gradients in time, where is the number of parameters.
  2. Scalability: Works for deep networks with many layers.
  3. General Applicability: Compatible with various loss functions and activation functions.

Limitations

  1. Vanishing/Exploding Gradients: Gradients can become too small or too large, especially in deep networks with sigmoidal activations.
  2. Computational Cost: For very large networks, backpropagation can be computationally expensive.
  3. Initialization Sensitivity: Poor weight initialization can slow convergence or prevent effective learning.