The core idea of maximum likelihood estimation is to assume that we draw samples $\{x^1, x^2, \cdots, x^m\}$ from the data distribution $P_{\text{data}}(x)$ and calculate the probability $P_{\theta}(x^i)$ of observing each sample $x^i$. The objective is to find the parameters $\theta$ that maximize the likelihood of observing all the samples. The optimization objective is formulated as follows:

$$ \theta^{*} = \argmax_{\theta} \prod_{i=1}^{m} P_{\theta}(x^i) $$

Now, we will explore the relationship between maximum likelihood estimation and KL divergence.

$$ \begin{aligned} \theta^{*} &= \argmax_{\theta} \prod_{i=1}^{m} P_{\theta}(x^i)\\ & = \argmax_{\theta} \log \prod_{i=1}^{m} P_{\theta}(x^i)\\ & = \argmax_{\theta} \sum_{i=1}^{m} \log P_{\theta}(x^i)\\ & \approx \argmax_{\theta} E_{x \sim P_{\text{data}}} \left[ \log P_{\theta}(x^i) \right]\\ & = \argmax_{\theta} \int_{x} P_{\text{data}}(x) \log P_{\theta}(x) dx\\ & = \argmax_{\theta} \int_{x} P_{\text{data}}(x) \log P_{\theta}(x) dx - \underbrace{\int_{x} P_{\text{data}}(x) \log P_{\text{data}}(x) dx}_{\text{not related to $\theta$}}\\ & = \argmax_{\theta} \int_{x} P_{\text{data}}(x) \log \frac{P_{\theta}(x)} {P_{\text{data}}(x)} dx\\ & = \argmax_{\theta} -1 \cdot \int_{x} P_{\text{data}}(x) \log \frac{P_{\text{data}}(x)} {P_{\theta}(x)} dx\\ & = \argmin_{\theta} KL(P_{\text{data}}|| P_{\theta}) \end{aligned} $$

References

https://www.youtube.com/watch?v=67_M2qP5ssY&list=PLJV_el3uVTsNi7PgekEUFsyVllAJXRsP-