ReLU Activation Function

Definition:
The Rectified Linear Unit (ReLU) activation function is defined as:

ReLU (x) = max (0, x)

Piecewise Linear:
- Linear for $x > 0$ .
- Constant at $0$ for $x \leq 0$ .
Range:
- Outputs values in $[0, \infty)$ .
Non-linear Activation:
- Despite being linear for $x > 0$ , it introduces non-linearity by zeroing out negative inputs, enabling the network to learn complex mappings.
Sparse Activation:
- Only neurons with $x > 0$ are activated, leading to efficient representations.
Gradient Behavior:
- Derivative: $\frac{d}{d x} ReLU (x) = {10 x > 0 x \leq 0$
- Gradients are preserved for $x > 0$ , mitigating the vanishing gradient problem.

Computational Efficiency:
- Simple to compute: requires only comparison and a max operation.
Mitigation of Vanishing Gradients:
- Unlike sigmoid or tanh, ReLU retains gradients for positive activations, enabling deeper networks to train more effectively.
Encourages Sparse Representations:
- Zeroing out negative inputs results in fewer active neurons, improving model interpretability and reducing overfitting.

Dying ReLU Problem:
- Neurons with $x \leq 0$ produce zero gradients, becoming permanently inactive during training.
- Common in poorly initialized or overly large learning rate scenarios.
Unbounded Outputs:
- Can lead to exploding activations, especially in deeper layers if not managed with techniques like normalization.
Sensitivity to Initialization:
- Proper weight initialization (e.g., He initialization) is crucial for preventing gradient issues.

Visualization:

Leaky ReLU:
- Allows a small negative slope for $x \leq 0$ to prevent dying neurons:
$Leaky ReLU (x) = {x αx x > 0 x \leq 0$
- Typical value of $α$ is $0.01$ .
Parametric ReLU (PReLU):
- Generalizes Leaky ReLU by making $α$ a learnable parameter.
Exponential Linear Unit (ELU):
- Smooths the transition for $x \leq 0$ :
$ELU (x) = {x α (e^{x} - 1) x > 0 x \leq 0$
- Provides non-zero gradients for negative inputs and a bounded range for negative outputs.
Scaled Exponential Linear Unit (SELU):
- A self-normalizing variant designed to maintain a mean and variance close to $0$ and $1$ , respectively.
Maxout:
- Outputs the maximum of multiple linear functions:
  $Maxout (x) = max (w_{1}^{T} x + b_{1}, w_{2}^{T} x + b_{2})$
- Highly flexible but computationally expensive.

Evgeny's Notes