see: Univariate Entropy (Information Content - Entropy - Cross Entropy - KL Divergence)

Entropy 𝐻(𝑋) = 𝐻_𝑃(𝑋) = - 𝛴_𝑥∊𝑋[ 𝑃(𝑋=𝑥) 𝑙𝑔 𝑃(𝑋=𝑥) ]

Joint Entropy 𝐻(𝑋,𝑌) = 𝐻_𝑃(𝑋,𝑌)

Click here to expand...

𝐻_𝑃(𝑋,𝑌) = - 𝛴_𝑥∊𝑋 𝛴_𝑦∊𝑌[ 𝑃(𝑋=𝑥,𝑌=𝑦) 𝑙𝑔 𝑃(𝑋=𝑥,𝑌=𝑦) ]

joint entropy is very similar to entropy: we could just substitute 𝑃(𝑋,𝑌) with 𝑄(𝑍) and get the entropy formula

𝐻_𝑄(𝑍) = - 𝛴_𝑧∊𝑍 [ 𝑄(𝑍=𝑧) 𝑙𝑔 𝑄(𝑍=𝑧) ]

joint entropy is symmetric

𝐻_𝑃(𝑋,𝑌) = 𝐻_𝑃(𝑌,𝑋)

Conditional Entropy 𝐻(𝑋|𝑌) = 𝐻_𝑃(𝑋|𝑌)

Click here to expand...

𝐻_𝑃(𝑋|𝑌) = - 𝛴_𝑦∊𝑌𝑃(𝑌=𝑦) 𝛴_𝑥∊𝑋[ 𝑃(𝑋=𝑥|𝑌=𝑦) 𝑙𝑔 𝑃(𝑋=𝑥|𝑌=𝑦) ]

𝐻_𝑃(𝑋|𝑌) = - 𝛴_𝑥∊𝑋 𝛴_𝑦∊𝑌[ 𝑃(𝑋=𝑥,𝑌=𝑦) 𝑙𝑔 𝑃(𝑋=𝑥|𝑌=𝑦) ]

if values of 𝑋 are completely determined by 𝑌 then:

𝐻_𝑃(𝑋|𝑌) = 0

conditional entropy of itself 𝐻(𝑋|𝑋) = 0. It is so because entropy is a measure of uncertainty and there is no uncertainty in reasoning on values of 𝑋 given the values of 𝑋:

𝐻_𝑃(𝑋|𝑋) = 0 = - 𝛴_𝑥∊𝑋 𝛴_𝑥∊𝑋[ 𝑃(𝑋=𝑥,𝑋=𝑥) 𝑙𝑔 𝑃(𝑋=𝑥|𝑋=𝑥) ]

if 𝑥=𝑥 then 𝑙𝑔𝑃(𝑋=𝑥|𝑋=𝑥) = 0

if 𝑥≠𝑥 then 𝑃(𝑋=𝑥,𝑋=𝑥) = 0

Mutual Information or Information Gain 𝐼(𝑋,𝑌) - information shared between the variables

Click here to expand...

𝐼_𝑃(𝑋,𝑌) = 𝐻_𝑃(𝑋) + 𝐻_𝑃(𝑌) - 𝐻_𝑃(𝑋,𝑌)

this definition works because 𝐻_𝑃(𝑋) + 𝐻_𝑃(𝑌) has two copies of the mutual information, since it’s in both 𝑋 and 𝑌, while 𝐻_𝑃(𝑋,𝑌) only has one

or more accurately, 𝐼(𝑋,𝑌) is the measure of reduction in uncertainty about a variable after observing the other:

𝐼_𝑃(𝑋,𝑌) = 𝐻_𝑃(𝑋) - 𝐻_𝑃(𝑋|𝑌) # reduction in uncertainty about 𝑋 after observing 𝑌

𝐼_𝑃(𝑋,𝑌) = 𝐻_𝑃(𝑌) - 𝐻_𝑃(𝑌|𝑋) # reduction in uncertainty about 𝑌 after observing 𝑋

therefore:

𝐼_𝑃(𝑋,𝑌) ≤ 𝑚𝑖𝑛(𝐻_𝑃(𝑋), 𝐻_𝑃(𝑌))

for discrete variables:

𝐼_𝑃(𝑋,𝑌) = 𝛴_𝑥∊𝑋𝛴_𝑦∊𝑌[ 𝑃(𝑋=𝑥,𝑌=𝑦) 𝑙𝑜𝑔 [ 𝑃(𝑋=𝑥,𝑌=𝑦) / [ 𝑃(𝑋=𝑥)𝑃(𝑌=𝑦) ] ] ]

Pointwise Mutual Information 𝐼_𝑃(𝑋=𝑥,𝑌=𝑦) - measures how much more do events 𝑥 and 𝑦 co-occur than if they were independent

Click here to expand...

𝐼_𝑃(𝑋=𝑥,𝑌=𝑦) = 𝑙𝑜𝑔 [ 𝑃(𝑋=𝑥,𝑌=𝑦) / [ 𝑃(𝑋=𝑥)𝑃(𝑌=𝑦) ] ]

Self Information - mutual information with itself 𝐼_𝑃(𝑋,𝑋)

Click here to expand...

𝐼_𝑃(𝑋,𝑋) = 𝐻_𝑃(𝑋) - 𝐻_𝑃(𝑋|𝑋)

𝐼_𝑃(𝑋,𝑋) = 𝐻_𝑃(𝑋) # because 𝐻_𝑃(𝑋|𝑋) = 0 see conditional entropy section

Variation of Information 𝑉_𝑃(𝑋,𝑌) - gives us a metric, a notion of distance, between different variables. The variation of information between two variables is zero if knowing the value of one tells you the value of the other and increases as they become more independent

Click here to expand...

𝑉_𝑃(𝑋,𝑌) = 𝐻_𝑃(𝑋,𝑌) - 𝐼_𝑃(𝑋,𝑌)

Relationship between Entropy, Joint Entropy, & Conditional Entropy

Click here to expand...

𝐻_𝑃(𝑋,𝑌) ≥ 𝐻_𝑃(𝑋) ≥ 𝐻_𝑃(𝑋|𝑌)

𝐻_𝑃(𝑋,𝑌) ≥ 𝐻_𝑃(𝑌) ≥ 𝐻_𝑃(𝑌|𝑋)

𝐻_𝑃(𝑋,𝑌) = 𝐻_𝑃(𝑋|𝑌) + 𝐻_𝑃(𝑌) = 𝐻_𝑃(𝑌|𝑋) + 𝐻_𝑃(𝑋)

from (Joint Entropy - Entropy) → (Conditional Entropy):

𝐻_𝑃(𝑌|𝑋) = 𝐻_𝑃(𝑋,𝑌) - 𝐻_𝑃(𝑋)

and

𝐻_𝑃(𝑋|𝑌) = 𝐻_𝑃(𝑋,𝑌) - 𝐻_𝑃(𝑌)

Click here to expand...

𝐻_𝑃(𝑋|𝑌) = - 𝛴_𝑦∊𝑌𝛴_𝑥∊𝑋[ 𝑃(𝑋=𝑥,𝑌=𝑦) 𝑙𝑔 𝑃(𝑋=𝑥,𝑌=𝑦) ] - -𝛴_𝑦∊𝑌[ 𝑃(𝑌=𝑦) 𝑙𝑔 𝑃(𝑌=𝑦) ]

𝐻_𝑃(𝑋|𝑌) = - [ 𝛴_𝑦∊𝑌𝛴_𝑥∊𝑋[ 𝑃(𝑋=𝑥,𝑌=𝑦) 𝑙𝑔 𝑃(𝑋=𝑥,𝑌=𝑦) ] - 𝛴_𝑦∊𝑌[ 𝑃(𝑌=𝑦) 𝑙𝑔 𝑃(𝑌=𝑦) ] ]

𝐻_𝑃(𝑋|𝑌) = - [ 𝛴_𝑦∊𝑌[𝛴_𝑥∊𝑋[ 𝑃(𝑋=𝑥,𝑌=𝑦) 𝑙𝑔 𝑃(𝑋=𝑥,𝑌=𝑦) ] - 𝑃(𝑌=𝑦) 𝑙𝑔 𝑃(𝑌=𝑦) ] ]

𝐻_𝑃(𝑋|𝑌) = - [𝛴_𝑦∊𝑌[𝛴_𝑥∊𝑋[ 𝑃(𝑋=𝑥|𝑌=𝑦)𝑃(𝑌=𝑦) 𝑙𝑔 𝑃(𝑋=𝑥,𝑌=𝑦) ] - 𝑃(𝑌=𝑦) 𝑙𝑔 𝑃(𝑌=𝑦) ] ]

𝐻_𝑃(𝑋|𝑌) = - [𝛴_𝑦∊𝑌𝑃(𝑌=𝑦) [𝛴_𝑥∊𝑋[ 𝑃(𝑋=𝑥|𝑌=𝑦) 𝑙𝑔 𝑃(𝑋=𝑥|𝑌=𝑦) ] + 𝛴_𝑥∊𝑋[ 𝑃(𝑋=𝑥|𝑌=𝑦) 𝑙𝑔 𝑃(𝑌=𝑦) ] - 𝑙𝑔 𝑃(𝑌=𝑦) ] ]

𝐻_𝑃(𝑋|𝑌) = - [𝛴_𝑦∊𝑌𝑃(𝑌=𝑦) [𝛴_𝑥∊𝑋[ 𝑃(𝑋=𝑥|𝑌=𝑦) 𝑙𝑔 𝑃(𝑋=𝑥|𝑌=𝑦) ] + 𝑙𝑔 𝑃(𝑌=𝑦) 𝛴_𝑥∊𝑋[ 𝑃(𝑋=𝑥|𝑌=𝑦) ] - 𝑙𝑔 𝑃(𝑌=𝑦) ] ]

𝐻_𝑃(𝑋|𝑌) = - [𝛴_𝑦∊𝑌𝑃(𝑌=𝑦) [𝛴_𝑥∊𝑋[ 𝑃(𝑋=𝑥|𝑌=𝑦) 𝑙𝑔 𝑃(𝑋=𝑥|𝑌=𝑦) ] + 𝑙𝑔 𝑃(𝑌=𝑦) - 𝑙𝑔 𝑃(𝑌=𝑦) ] ] # 𝛴_𝑥∊𝑋[ 𝑃(𝑋=𝑥|𝑌=𝑦) ] = 1

𝐻_𝑃(𝑋|𝑌) = - 𝛴_𝑦∊𝑌𝑃(𝑌=𝑦) 𝛴_𝑥∊𝑋[ 𝑃(𝑋=𝑥|𝑌=𝑦) 𝑙𝑔 𝑃(𝑋=𝑥|𝑌=𝑦) ] # definition of conditional entropy

KL-Divergence vs Variation of Information

Click here to expand...

KL Divergence gives us a distance between two distributions over the same variable or set of variables (not symmetric)

Variation of Information gives us the distance between two jointly distributed variables (symmetric)

KL Divergence is between distributions, Variation of Information between variables within a distribution

Bringing it All Together

Resources

https://colah.github.io/posts/2015-09-Visual-Information/#fn4

／var／log marcus chiu

Explorer

Multivariate Entropy (Joint Entropy - Conditional Entropy - (Pointwise) Mutual Information ／ Information Gain - Variation of Information)

Relationship between Entropy, Joint Entropy, & Conditional Entropy

KL-Divergence vs Variation of Information

Bringing it All Together

Resources

／var／logmarcus chiu

Explorer

Multivariate Entropy (Joint Entropy - Conditional Entropy - (Pointwise) Mutual Information ／ Information Gain - Variation of Information)

Relationship between Entropy, Joint Entropy, & Conditional Entropy

KL-Divergence vs Variation of Information

Bringing it All Together

Resources

／var／log marcus chiu