- categories: Data Science, Definition
Definition:
The softmax function maps a vector of real numbers into a probability distribution over a finite set of classes. For an input vector , the softmax function is defined as:
Properties:
-
Output Range:
The softmax outputs are probabilities, so:
-
Exponential Scaling:
Larger values of have exponentially greater contributions to the numerator, making the softmax function sensitive to relative differences in . -
Translation Invariance:
Adding a constant to all elements of does not change the output of the softmax:
-
Derivatives:
For an individual output , the derivative with respect to is:
where is the Kronecker delta ( if , and otherwise).
Applications:
-
Multiclass Classification:
- In machine learning, softmax is commonly used in the final layer of a neural network to model probabilities of classes.
- Given logits (raw scores) , the predicted probability for class is:
-
Cross-Entropy Loss:
The softmax function is often paired with the cross-entropy loss for classification tasks. Given a true label and predicted probabilities :
-
Attention Mechanisms:
- Softmax is used in attention-based models (e.g., Transformers) to compute attention weights.
- Ensures weights sum to 1, representing probabilities.
Example:
Given , compute the softmax output:
- Compute exponentials:
- Compute the sum:
- Compute probabilities:
Comparison with Sigmoid Function:
- Sigmoid: Used for binary classification, outputs a single probability.
- Softmax: Generalizes sigmoid to multiclass problems, outputs a probability distribution over classes.
Numerical Stability:
To avoid numerical overflow when has large values, subtract the maximum value from all elements before applying softmax: