when we want to model 𝑃 with 𝑄 we can use the following as cost functions:

  • KL Divergence 𝐷𝐾𝐿(𝑃||𝑄)
  • Reverse KL Divergence 𝐷𝐾𝐿(𝑄||𝑃)

as we know from Univariate Entropy:

  • 𝐷𝐾𝐿(𝑃||𝑄) = 𝐄𝑋~𝑃[ 𝑙𝑔[𝑃(𝑋)/𝑄(𝑋)] ]
  • 𝐷𝐾𝐿(𝑄||𝑃) = 𝐄𝑋~𝑄[ 𝑙𝑔[𝑄(𝑋)/𝑃(𝑋)] ]

when modeling 𝑃 with:

  • 𝐷𝐾𝐿(𝑃||𝑄) we want 𝑄(𝑋) to be non-zero if 𝑃(𝑋) is non-zero. Otherwise, the KL value will be high. Therefore, it tries to cover what 𝑃 will cover
  • 𝐷𝐾𝐿(𝑄||𝑃) we want 𝑄(𝑋) to be zero if 𝑃(𝑋) is zero. Otherwise, the KL value will be high.

Modeling with KL vs Reverse KL - Example

Let’s say the ground truth 𝑃 is a bimodal distribution (the blue curve below) and we want to model it with a single-mode Gaussian distribution (the red curve):

  • if 𝐷𝐾𝐿(𝑃||𝑄) is used as the training objective function, we will get a Gaussian distribution that overlaps both modes of the ground truth 𝑃 that peaks at the trough between the two modes (diagram a)
  • If 𝐷𝐾𝐿(𝑄||𝑃) is used, we will either get one of the local optimal in (diagram b or c)

the problem is that we are using a simple model for complex ground truth (i.e. high bias, see bias-variance tradeoff). If both models have similar complexity, this will be a non-issue

2D Topology

2D Cross Section