Batch Normalization

categories: Data Science, Technique

Definition:
Batch Normalization is a technique used in training deep neural networks to stabilize and accelerate convergence. It normalizes the activations of each layer by adjusting and scaling them, ensuring that their distributions remain consistent across training iterations.

Introduced by Ioffe and Szegedy in 2015, BatchNorm helps mitigate issues such as internal covariate shift—the change in the distribution of layer inputs during training.

How It Works

For a mini-batch of inputs $X = {x_{1}, x_{2}, \dots, x_{m}}$ in a given layer, BatchNorm applies the following steps:

Compute Batch Statistics:
- Mean:
  $μ_{B} = \frac{1}{m} \sum_{i = 1}^{m} x_{i}$
- Variance:
  $σ_{B}^{2} = \frac{1}{m} \sum_{i = 1}^{m} (x_{i} - μ_{B})^{2}$
Normalize the Inputs:
Center and scale each input to have zero mean and unit variance:
$\overset{x}{^}_{i} = \frac{x _{i} - μ _{B}}{σ _{B}^{2} + ϵ},$
where $ϵ$ is a small constant added for numerical stability (e.g., $1 0^{- 5}$ ).
Scale and Shift:
Introduce trainable parameters $γ$ (scale) and $β$ (shift) to restore the network’s ability to represent complex transformations:
$y_{i} = γ \overset{x}{^}_{i} + β .$

The parameters $γ$ and $β$ are learned during training along with the model’s other weights.

Training vs. Inference

During Training:
- Batch statistics ( $μ_{B}$ and $σ_{B}^{2}$ ) are computed for each mini-batch.
During Inference:
- Use running averages of $μ_{B}$ and $σ_{B}^{2}$ (computed over training mini-batches) for normalization, ensuring consistency across test samples.

Benefits of BatchNorm

Stabilizes Training:
- Reduces sensitivity to initialization.
- Helps prevent Vanishing and Exploding Gradient Problem.
Accelerates Convergence:
- Enables faster training by allowing higher learning rates.
Improves Generalization:
- Acts as a form of regularization, reducing the need for other techniques like dropout in some cases.
Reduces Internal Covariate Shift:
- Normalizes intermediate activations, minimizing changes in input distributions to subsequent layers during training.

Mathematical Representation in Neural Networks

For a layer with input activations $x$ and weights $W$ , the forward pass typically involves:
$z = W x + b (pre-activation output) .$
Applying BatchNorm:

Compute batch statistics ( $μ_{B}$ , $σ_{B}^{2}$ ).
Normalize:
$\overset{z}{^} = \frac{z - μ _{B}}{σ _{B}^{2} + ϵ} .$
Scale and shift:
$y = γ \overset{z}{^} + β .$

Effect on Gradient Descent

Gradient Smoothing:
- BatchNorm makes the optimization landscape smoother by keeping activations well-scaled.
- This reduces the likelihood of steep or flat regions, making gradient descent more efficient.
Decouples Layers:
- By normalizing layer inputs, BatchNorm reduces dependencies between parameters in different layers, improving stability.

Practical Considerations

Mini-Batch Size:
- Small batch sizes may result in unstable estimates of $μ_{B}$ and $σ_{B}^{2}$ . Techniques like Group Normalization or Layer Normalization are alternatives in such cases.
Placement in Architecture:
- Typically applied after a linear or convolutional layer and before the activation function.
Regularization:
- While BatchNorm has a regularizing effect, it is often combined with other techniques like Dropout.

Variants of Batch Normalization

Layer Normalization:
- Normalizes across features for each sample instead of across the batch.
- Useful in RNNs where batch statistics are less meaningful.
Instance Normalization:
- Normalizes each individual feature map (used in style transfer).
Group Normalization:
- Divides features into groups and normalizes within each group, suitable for small batch sizes.
Batch Renormalization:
- Modifies BatchNorm to make it more robust when mini-batch statistics deviate significantly.

Advantages and Disadvantages

Aspect	Advantages	Disadvantages
Stability	Reduces covariate shift and stabilizes training.	Requires mini-batches; less effective with small batches.
Efficiency	Enables faster convergence and higher learning rates.	Adds computation and memory overhead.
Regularization	Reduces overfitting in some cases.	May not fully replace other regularization techniques.

Example in PyTorch

import torch
import torch.nn as nn
 
# Define a model with BatchNorm
class ModelWithBatchNorm(nn.Module):
    def __init__(self):
        super(ModelWithBatchNorm, self).__init__()
        self.conv = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1)
        self.bn = nn.BatchNorm2d(64)
        self.relu = nn.ReLU()
 
    def forward(self, x):
        x = self.conv(x)
        x = self.bn(x)
        x = self.relu(x)
        return x
 
# Instantiate and test
model = ModelWithBatchNorm()
input_tensor = torch.randn(8, 3, 32, 32)  # Batch of 8 images, 3 channels, 32x32
output = model(input_tensor)

Evgeny's Notes

Explorer

Recent posts

Installing the Homebrew Channel App on an LG TV (Ubuntu)

Obsidian + Zettelkasten + PARA

About this site