when we want to model 𝑃 with 𝑄 we can use the following as cost functions:

KL Divergence 𝐷_𝐾𝐿(𝑃||𝑄)
Reverse KL Divergence 𝐷_𝐾𝐿(𝑄||𝑃)

𝐷_𝐾𝐿(𝑃||𝑄) = 𝐄_𝑋~_𝑃[ 𝑙𝑔[𝑃(𝑋)/𝑄(𝑋)] ]
𝐷_𝐾𝐿(𝑄||𝑃) = 𝐄_𝑋~_𝑄[ 𝑙𝑔[𝑄(𝑋)/𝑃(𝑋)] ]

when modeling 𝑃 with:

𝐷_𝐾𝐿(𝑃||𝑄) we want 𝑄(𝑋) to be non-zero if 𝑃(𝑋) is non-zero. Otherwise, the KL value will be high. Therefore, it tries to cover what 𝑃 will cover
𝐷_𝐾𝐿(𝑄||𝑃) we want 𝑄(𝑋) to be zero if 𝑃(𝑋) is zero. Otherwise, the KL value will be high.

Modeling with KL vs Reverse KL - Example

Let’s say the ground truth 𝑃 is a bimodal distribution (the blue curve below) and we want to model it with a single-mode Gaussian distribution (the red curve):

if 𝐷_𝐾𝐿(𝑃||𝑄) is used as the training objective function, we will get a Gaussian distribution that overlaps both modes of the ground truth 𝑃 that peaks at the trough between the two modes (diagram a)
If 𝐷_𝐾𝐿(𝑄||𝑃) is used, we will either get one of the local optimal in (diagram b or c)

the problem is that we are using a simple model for complex ground truth (i.e. high bias, see bias-variance tradeoff). If both models have similar complexity, this will be a non-issue

／var／log marcus chiu

Explorer

Modeling - KL Divergence vs Reverse KL Divergence

Modeling with KL vs Reverse KL - Example

2D Topology

2D Cross Section

／var／logmarcus chiu

Explorer

Modeling - KL Divergence vs Reverse KL Divergence

Modeling with KL vs Reverse KL - Example

2D Topology

2D Cross Section

／var／log marcus chiu