Kullback-Leibler Divergence (KL Divergence)

Definition:
The Kullback-Leibler (KL) divergence measures how one probability distribution (approximation) differs from a reference distribution (true distribution). For discrete distributions over events :

For continuous distributions:

Intuition:
KL divergence quantifies the inefficiency of approximating using .

  • If , then .
  • Larger values indicate a greater difference between and .

Key Properties:

  1. Non-Negativity:

    Equality occurs only if almost everywhere.

  2. Asymmetry:

    KL divergence is not a true distance metric because it is not symmetric and does not satisfy the triangle inequality.

  3. Additivity for Independent Variables:
    If and , then:

Applications:

  1. Machine Learning:

    • Used in variational inference to measure the difference between the true posterior distribution and an approximate distribution.
    • Cross-entropy loss includes KL divergence as a component:
  2. Information Theory:

    • Quantifies the inefficiency of encoding messages from using a code optimized for .
  3. Natural Sciences:

    • Comparing probability distributions in areas like genetics, linguistics, and physics.

Relation to Entropy:
KL divergence relates to entropy and cross-entropy:

where is the cross-entropy, and is the Shannon entropy of .

Special Case (Binary Variables):
For binary random variables with and :