when we want to model 𝑃 with 𝑄 we can use the following as cost functions:
- KL Divergence 𝐷𝐾𝐿(𝑃||𝑄)
- Reverse KL Divergence 𝐷𝐾𝐿(𝑄||𝑃)
as we know from Univariate Entropy:
- 𝐷𝐾𝐿(𝑃||𝑄) = 𝐄𝑋~𝑃[ 𝑙𝑔[𝑃(𝑋)/𝑄(𝑋)] ]
- 𝐷𝐾𝐿(𝑄||𝑃) = 𝐄𝑋~𝑄[ 𝑙𝑔[𝑄(𝑋)/𝑃(𝑋)] ]
when modeling 𝑃 with:
- 𝐷𝐾𝐿(𝑃||𝑄) we want 𝑄(𝑋) to be non-zero if 𝑃(𝑋) is non-zero. Otherwise, the KL value will be high. Therefore, it tries to cover what 𝑃 will cover
- 𝐷𝐾𝐿(𝑄||𝑃) we want 𝑄(𝑋) to be zero if 𝑃(𝑋) is zero. Otherwise, the KL value will be high.
Modeling with KL vs Reverse KL - Example
Let’s say the ground truth 𝑃 is a bimodal distribution (the blue curve below) and we want to model it with a single-mode Gaussian distribution (the red curve):
- if 𝐷𝐾𝐿(𝑃||𝑄) is used as the training objective function, we will get a Gaussian distribution that overlaps both modes of the ground truth 𝑃 that peaks at the trough between the two modes (diagram a)
- If 𝐷𝐾𝐿(𝑄||𝑃) is used, we will either get one of the local optimal in (diagram b or c)
the problem is that we are using a simple model for complex ground truth (i.e. high bias, see bias-variance tradeoff). If both models have similar complexity, this will be a non-issue
2D Topology
/modeling---kl-divergence-vs-reverse-kl-divergence/kl-diveregence-vs-reverse-kl-divergence.jpeg)
2D Cross Section
/modeling---kl-divergence-vs-reverse-kl-divergence/modeling-with-kl-divergence.png)
/modeling---kl-divergence-vs-reverse-kl-divergence/modeling-with-reverse-kl-divergence.png)