where $p(v_i)$ represents the probability of state $v_i$. From the perspective of information theory, $S(v)$ is the information required to remove system uncertainty.
$$ KL(A|B) = \sum_{i}p_A(v_i)\log p_A(v_i) - p_A(v_i)\log p_B(v_i), $$where the first term of the right hand side is the entropy of distribution $A$, the second term can be interpreted as the expectation of distribution $B$ in terms of $A$. The $KL(A|B)$ describes how different $B$ is from $A$ from the perspective of $A$.
$$ H(A, B) = -\sum_{i}p_A(v_i)\log p_B(v_i). $$$$ H(A, B) = KL(A|B) + S_A. $$If $S_A$ is a constant, then minimizing $H(A, B)$ is equivalent to minimizing $KL(A|B)$.
Here is an example from machine learning: a binary classification problem with input $x$ and label $y$. Suppose the distribution $p(y) = [0.1, 0.9]$. This distribution represents the true data distribution and remains fixed, corresponding to the distribution $A$ mentioned above. We use a machine learning model $f$ which outputs $p_{f}(x|y)= [0.2, 0.8]$, corresponding to distribution $B$ mentioned above.
# kl(y|f) = 0.1 * log(0.1/0.2) + 0.9 * log(0.9/0.8)
# pytorch implementation
target = torch.tensor([[0.1, 0.9]])
pred = torch.tensor([[0.2, 0.8]])
kl = F.kl_div(torch.log(pred), target, reduction='batchmean', log_target=False)
References
This content was originally shared by doubllle on Stack Overflow.
For more details, you can check out the full discussion here.