Definition:
The softmax function maps a vector of real numbers into a probability distribution over a finite set of classes. For an input vector , the softmax function is defined as:

Properties:

  1. Output Range:
    The softmax outputs are probabilities, so:

  2. Exponential Scaling:
    Larger values of have exponentially greater contributions to the numerator, making the softmax function sensitive to relative differences in .

  3. Translation Invariance:
    Adding a constant to all elements of does not change the output of the softmax:

  4. Derivatives:
    For an individual output , the derivative with respect to is:

    where is the Kronecker delta ( if , and otherwise).

Applications:

  1. Multiclass Classification:

    • In machine learning, softmax is commonly used in the final layer of a neural network to model probabilities of classes.
    • Given logits (raw scores) , the predicted probability for class is:
  2. Cross-Entropy Loss:
    The softmax function is often paired with the cross-entropy loss for classification tasks. Given a true label and predicted probabilities :

  3. Attention Mechanisms:

    • Softmax is used in attention-based models (e.g., Transformers) to compute attention weights.
    • Ensures weights sum to 1, representing probabilities.

Example:
Given , compute the softmax output:

  1. Compute exponentials:
  2. Compute the sum:
  3. Compute probabilities:

Comparison with Sigmoid Function:

  • Sigmoid: Used for binary classification, outputs a single probability.
  • Softmax: Generalizes sigmoid to multiclass problems, outputs a probability distribution over classes.

Numerical Stability:
To avoid numerical overflow when has large values, subtract the maximum value from all elements before applying softmax: