Definition:
Cross-entropy quantifies the difference between two probability distributions (true distribution) and (predicted distribution) over the same set of events . It is defined as:

where and are the probabilities assigned to event by distributions and , respectively.

Intuition:

  • Cross-entropy measures how well the predicted distribution approximates the true distribution .
  • If , the cross-entropy equals the Shannon Entropy , representing the minimum encoding cost.
  • Larger differences between and increase the cross-entropy, signifying greater inefficiency in using to encode .

Key Properties:

  1. Non-Negativity:

    Equality occurs only if .

  2. Relation to Shannon Entropy:

    where is the Kullback-Leibler divergence, measuring the extra cost of encoding using .

  3. Logarithm Base:

    • Base-2 logarithms yield entropy in bits.
    • Natural logarithms (base ) give entropy in nats.

Applications:

  1. Machine Learning:

    • Used as a loss function for classification tasks, especially when predictions are probabilities (e.g., softmax outputs).
    • Binary cross-entropy for binary classification:
    • Categorical cross-entropy for multi-class classification:

      where is the one-hot encoded true label, and is the predicted probability for class .
  2. Information Theory:

    • Measuring the efficiency of coding when approximating one distribution by another.

Interpretation in Optimization:
Minimizing cross-entropy aligns the predicted probabilities with the true probabilities , leading to better classification or approximation.