Softmax Function

categories: Data Science, Definition

Definition:
The softmax function maps a vector of real numbers into a probability distribution over a finite set of classes. For an input vector $z = [z_{1}, z_{2}, \dots, z_{n}]$ , the softmax function is defined as:
$softmax (z_{i}) = \frac{e ^{z_{i}}}{\sum _{j = 1}^{n} e ^{z_{j}}} for i = 1, 2, \dots, n .$

Properties:

Output Range:
The softmax outputs are probabilities, so:
$softmax (z_{i}) \in (0, 1) and \sum_{i = 1}^{n} softmax (z_{i}) = 1.$
Exponential Scaling:
Larger values of $z_{i}$ have exponentially greater contributions to the numerator, making the softmax function sensitive to relative differences in $z_{i}$ .
Translation Invariance:
Adding a constant $c$ to all elements of $z$ does not change the output of the softmax:
$softmax (z_{i} + c) = softmax (z_{i}) .$
Derivatives:
For an individual output $softmax (z_{i})$ , the derivative with respect to $z_{k}$ is:
$\frac{\partial softmax ( z _{i} )}{\partial z _{k}} = softmax (z_{i}) (δ_{ik} - softmax (z_{k}))$
where $δ_{ik}$ is the Kronecker delta ( $δ_{ik} = 1$ if $i = k$ , and $0$ otherwise).

Applications:

Multiclass Classification:
- In machine learning, softmax is commonly used in the final layer of a neural network to model probabilities of $n$ classes.
- Given logits (raw scores) $z$ , the predicted probability for class $i$ is:
  $P (y = i ∣ z) = softmax (z_{i}) .$
Cross-Entropy Loss:
The softmax function is often paired with the cross-entropy loss for classification tasks. Given a true label $y \in {1, \dots, n}$ and predicted probabilities $\overset{p}{^}_{i} = softmax (z_{i})$ :
$Loss = - lo g (\overset{p}{^}_{y}) = - lo g (softmax (z_{y})) .$
Attention Mechanisms:
- Softmax is used in attention-based models (e.g., Transformers) to compute attention weights.
- Ensures weights sum to 1, representing probabilities.

Example:
Given $z = [2, 1, 0]$ , compute the softmax output:

Compute exponentials:
$e^{z} = [e^{2}, e^{1}, e^{0}] = [7.389, 2.718, 1] .$
Compute the sum:
$\sum e^{z_{i}} = 7.389 + 2.718 + 1 = 11.107.$
Compute probabilities:
$softmax (z_{1}) = \frac{7.389}{11.107} \approx 0.665, softmax (z_{2}) = \frac{2.718}{11.107} \approx 0.245, softmax (z_{3}) = \frac{1}{11.107} \approx 0.090.$

Comparison with Sigmoid Function:

Sigmoid: Used for binary classification, outputs a single probability.
Softmax: Generalizes sigmoid to multiclass problems, outputs a probability distribution over $n$ classes.

Numerical Stability:
To avoid numerical overflow when $z$ has large values, subtract the maximum value from all elements before applying softmax:
$softmax (z_{i}) = \frac{e ^{z_{i} - m a x (z)}}{\sum _{j = 1}^{n} e ^{z_{j} - m a x (z)}} .$

Evgeny's Notes

Explorer

Recent posts

Installing the Homebrew Channel App on an LG TV (Ubuntu)

Obsidian + Zettelkasten + PARA

About this site

Softmax Function

Graph View

Backlinks