The defination of Cross-entropy is as follows:

$$ H(P, Q) = - \sum_{x}p(x)\log Q(x) $$

The KL divergence is defined as follows:

$$ KL(P|Q) = \sum_{x}P(x)\log\frac{P(x)}{Q(x)} $$

We first introduce the definition of information entropy:

$$ S(v) = - \sum_{i}p(v_i)\log p(v_i), $$

where $p(v_i)$ represents the probability of state $v_i$. From the perspective of information theory, $S(v)$ is the information required to remove system uncertainty.

The formula for KL divergence can be further transformed into the following form:

$$ KL(A|B) = \sum_{i}p_A(v_i)\log p_A(v_i) - p_A(v_i)\log p_B(v_i), $$

where the first term of the right hand side is the entropy of distribution $A$, the second term can be interpreted as the expectation of distribution $B$ in terms of $A$. The $KL(A|B)$ describes how different $B$ is from $A$ from the perspective of $A$.

It’s worth noting that $A$ usually stands for the data and $B$ is the theoretical or hypothetical distribution. We formalize the cross entropy in terms of distribution $A$ and $B$ as follows:

$$ H(A, B) = -\sum_{i}p_A(v_i)\log p_B(v_i). $$

From the definition, we can easily see:

$$ H(A, B) = KL(A|B) + S_A. $$

If $S_A$ is a constant, then minimizing $H(A, B)$ is equivalent to minimizing $KL(A|B)$.

Here is an example from machine learning: a binary classification problem with input $x$ and label $y$. Suppose the distribution $p(y) = [0.1, 0.9]$. This distribution represents the true data distribution and remains fixed, corresponding to the distribution $A$ mentioned above. We use a machine learning model $f$ which outputs $p_{f}(x|y)= [0.2, 0.8]$, corresponding to distribution $B$ mentioned above.

# kl(y|f) = 0.1 * log(0.1/0.2) + 0.9 * log(0.9/0.8)
# pytorch implementation
target = torch.tensor([[0.1, 0.9]])
pred = torch.tensor([[0.2, 0.8]])
kl = F.kl_div(torch.log(pred), target, reduction='batchmean', log_target=False)

References

This content was originally shared by doubllle on Stack Overflow.

For more details, you can check out the full discussion here.