Entropy

is a measure of how close a system is to equilibrium (higher means more equal)

Information Entropy

is a measure of the amount of disorder/stochasticity/noise in the distribution
- lower entropy implies distribution mass/density is on a few instances
- larger entropy implies distribution mass/density is more evenly spread out (similar to uniform distribution)
is the MINIMAL number of bits needed, on average, to encode the information produced by a:
- stochastic source of data
- stochastic/probability distribution
- random variable

Univariate Entropy (Information Content - Entropy - Cross Entropy - KL Divergence)

Information Content ℎ(𝑋=𝑥) - a measure of information content of an outcome 𝑥. (optimal) minimal number of bits to encode outcome 𝑥

Click here to expand...

ℎ_𝑥~_𝑃(𝑋=𝑥) = 𝑙𝑔 [1/𝑃(𝑋=𝑥)]

ℎ_𝑥~_𝑃(𝑋=𝑥) = - 𝑙𝑔 𝑃(𝑋=𝑥)

Entropy 𝐻_𝑃(𝑃) or 𝐻(𝑃) - is the expected value of the information content of a probability distribution 𝑃. the average length of communicating an event from a distribution 𝑃 with the optimal code for the same distribution 𝑃

Click here to expand...

𝐻_𝑃(𝑃) = 𝐄_𝑋~_𝑃[ ℎ(𝑋=𝑥) ]

𝐻_𝑃(𝑃) = 𝐄_𝑋~_𝑃[ 𝑙𝑔 [1/𝑃(𝑋)] ]

𝐻_𝑃(𝑃) = 𝐄_𝑋~_𝑃[ - 𝑙𝑔 𝑃(𝑋) ]

when 𝑃 is a discrete probability distribution

𝐻_𝑃(𝑃) = - 𝛴_𝑥∊𝑋[ 𝑃(𝑋=𝑥) 𝑙𝑔 𝑃(𝑋=𝑥) ]

when 𝑃 is a continuous probability distribution (𝐻(𝑃) is known as continuous/differential entropy)

𝐻_𝑃(𝑃) = - _{𝑠𝑡𝑎𝑟𝑡}∫^𝑒𝑛𝑑 𝑃(𝑋=𝑥) 𝑙𝑔 𝑃(𝑋=𝑥) 𝑑𝑥

Cross Entropy 𝐻_𝑄(𝑃) - the average length of communicating an event from one distribution 𝑃 with the optimal code for another distribution 𝑄

Click here to expand...

if the distributions 𝑃 and 𝑄 are SAME then the cross-entropy = entropy of 𝑃 = entropy of 𝑄

𝐻_𝑃(𝑄) = 𝐻_𝑄(𝑃) = 𝐻_𝑃(𝑃) = 𝐻_𝑄(𝑄)

the more different the distributions 𝑃 and 𝑄 are:

the more the cross-entropy of 𝑃 with respect to 𝑄 will be bigger than the entropy of 𝑃

𝐻_𝑄(𝑃) > 𝐻_𝑃(𝑃)

the more the cross-entropy of 𝑄 with respect to 𝑃 will be bigger than the entropy of 𝑄

𝐻_𝑃(𝑄) > 𝐻_𝑄(𝑄)

cross-entropy is not symmetric: 𝐻_𝑄(𝑃) ≠ 𝐻_𝑃(𝑄)

𝐻_𝑄(𝑃) = 𝐄_𝑋~_𝑃[ ℎ_𝑄(𝑋) ]

𝐻_𝑄(𝑃) = 𝐄_𝑋~_𝑃[ 𝑙𝑔 [1/𝑄(𝑋)] ]

𝐻_𝑄(𝑃) = 𝐄_𝑋~_𝑃[ - 𝑙𝑔 𝑄(𝑋) ]

when 𝑃 and 𝑄 are discrete probability distributions

𝐻_𝑄(𝑃) = - 𝛴_𝑥∊𝑋[ 𝑃(𝑋=𝑥) 𝑙𝑔 𝑄(𝑋=𝑥) ]

when 𝑃 and 𝑄 are continuous probability distributions (𝐻_𝑄(𝑃) is known as continuous/differential cross entropy)

𝐻_𝑄(𝑃) = - _{𝑠𝑡𝑎𝑟𝑡}∫^𝑒𝑛𝑑 𝑃(𝑋=𝑥) 𝑙𝑔 𝑄(𝑋=𝑥) 𝑑𝑥

often compared as negative log-likelihood

𝐻_𝑄(𝑃) = 𝐄_𝑋,𝑌~_𝑃[ - 𝑙𝑔 𝑄(𝑌|𝑋) ] # likelihood function = 𝑄(𝑌|𝑋)

Relative Entropy or Kullback-Leibler (KL) Divergence 𝐷_𝐾𝐿(𝑃||𝑄) or 𝐷_𝑄(𝑃) - measures the “distance” between 2 distributions (see: divergence)

Click here to expand...

𝐷_𝑄(𝑃) = 𝐻_𝑄(𝑃) - 𝐻_𝑃(𝑃)

𝐷_𝑄(𝑃) = 𝐄_𝑋~_𝑃[ 𝑙𝑔[1/𝑄(𝑋)] ] - 𝐄_𝑋~_𝑃[ 𝑙𝑔[1/𝑃(𝑋)] ]

𝐷_𝑄(𝑃) = 𝐄_𝑋~_𝑃[ 𝑙𝑔[1/𝑄(𝑋)] - 𝑙𝑔[1/𝑃(𝑋)] ]

𝐷_𝑄(𝑃) = 𝐄_𝑋~_𝑃[ 𝑙𝑔[𝑃(𝑋)/𝑄(𝑋)] ]

if we have 2 separate probability distributions 𝑃 and 𝑄 over the same random variable 𝑥, we can measure the distance between 𝑃 and 𝑄 using the KL Divergence

if the distributions 𝑃 and 𝑄 are DIFF then the KL-Divergence > 0

if the distributions 𝑃 and 𝑄 are SAME then the KL-Divergence = 0

Because the KL Divergence is non-negative and measures the difference between two distributions, it is often conceptualized as some sort of distance measure between these distributions. However, it is not a true distance metric because it is not symmetric:

𝐷_𝑄(𝑃) ≠ 𝐷_𝑃(𝑄)

using KL Divergence for modeling

／var／log marcus chiu

Explorer

Univariate Entropy (Information Content - Entropy - Cross Entropy - KL Divergence)

Entropy

Information Entropy

Univariate Entropy (Information Content - Entropy - Cross Entropy - KL Divergence)

Bringing It All Together

Multivariate Entropy

Subpages

Resources

／var／logmarcus chiu

Explorer

Univariate Entropy (Information Content - Entropy - Cross Entropy - KL Divergence)

Entropy

Information Entropy

Univariate Entropy (Information Content - Entropy - Cross Entropy - KL Divergence)

Bringing It All Together

Multivariate Entropy

Subpages

Resources

／var／log marcus chiu