Definition:
The Rectified Linear Unit (ReLU) activation function is defined as:


Properties

  1. Piecewise Linear:

    • Linear for .
    • Constant at for .
  2. Range:

    • Outputs values in .
  3. Non-linear Activation:

    • Despite being linear for , it introduces non-linearity by zeroing out negative inputs, enabling the network to learn complex mappings.
  4. Sparse Activation:

    • Only neurons with are activated, leading to efficient representations.
  5. Gradient Behavior:

    • Derivative:
    • Gradients are preserved for , mitigating the vanishing gradient problem.

Advantages

  1. Computational Efficiency:

    • Simple to compute: requires only comparison and a max operation.
  2. Mitigation of Vanishing Gradients:

    • Unlike sigmoid or tanh, ReLU retains gradients for positive activations, enabling deeper networks to train more effectively.
  3. Encourages Sparse Representations:

    • Zeroing out negative inputs results in fewer active neurons, improving model interpretability and reducing overfitting.

Disadvantages

  1. Dying ReLU Problem:

    • Neurons with produce zero gradients, becoming permanently inactive during training.
    • Common in poorly initialized or overly large learning rate scenarios.
  2. Unbounded Outputs:

    • Can lead to exploding activations, especially in deeper layers if not managed with techniques like normalization.
  3. Sensitivity to Initialization:

    • Proper weight initialization (e.g., He initialization) is crucial for preventing gradient issues.

Visualization:


Variants

  1. Leaky ReLU:

    • Allows a small negative slope for to prevent dying neurons:
    • Typical value of is .
  2. Parametric ReLU (PReLU):

    • Generalizes Leaky ReLU by making a learnable parameter.
  3. Exponential Linear Unit (ELU):

    • Smooths the transition for :
    • Provides non-zero gradients for negative inputs and a bounded range for negative outputs.
  4. Scaled Exponential Linear Unit (SELU):

    • A self-normalizing variant designed to maintain a mean and variance close to and , respectively.
  5. Maxout:

    • Outputs the maximum of multiple linear functions:

    • Highly flexible but computationally expensive.