- categories: Data Science, Definition
Definition:
The Rectified Linear Unit (ReLU) activation function is defined as:
Properties
-
Piecewise Linear:
- Linear for .
- Constant at for .
-
Range:
- Outputs values in .
-
Non-linear Activation:
- Despite being linear for , it introduces non-linearity by zeroing out negative inputs, enabling the network to learn complex mappings.
-
Sparse Activation:
- Only neurons with are activated, leading to efficient representations.
-
Gradient Behavior:
- Derivative:
- Gradients are preserved for , mitigating the vanishing gradient problem.
Advantages
-
Computational Efficiency:
- Simple to compute: requires only comparison and a max operation.
-
Mitigation of Vanishing Gradients:
- Unlike sigmoid or tanh, ReLU retains gradients for positive activations, enabling deeper networks to train more effectively.
-
Encourages Sparse Representations:
- Zeroing out negative inputs results in fewer active neurons, improving model interpretability and reducing overfitting.
Disadvantages
-
Dying ReLU Problem:
- Neurons with produce zero gradients, becoming permanently inactive during training.
- Common in poorly initialized or overly large learning rate scenarios.
-
Unbounded Outputs:
- Can lead to exploding activations, especially in deeper layers if not managed with techniques like normalization.
-
Sensitivity to Initialization:
- Proper weight initialization (e.g., He initialization) is crucial for preventing gradient issues.
Visualization:
Variants
-
Leaky ReLU:
- Allows a small negative slope for to prevent dying neurons:
- Typical value of is .
-
Parametric ReLU (PReLU):
- Generalizes Leaky ReLU by making a learnable parameter.
-
Exponential Linear Unit (ELU):
- Smooths the transition for :
- Provides non-zero gradients for negative inputs and a bounded range for negative outputs.
-
Scaled Exponential Linear Unit (SELU):
- A self-normalizing variant designed to maintain a mean and variance close to and , respectively.
-
Maxout:
-
Outputs the maximum of multiple linear functions:
-
Highly flexible but computationally expensive.
-