see: Univariate Entropy (Information Content - Entropy - Cross Entropy - KL Divergence)
Entropy 𝐻(𝑋) = 𝐻𝑃(𝑋) = - 𝛴𝑥∊𝑋[ 𝑃(𝑋=𝑥) 𝑙𝑔 𝑃(𝑋=𝑥) ]
Joint Entropy 𝐻(𝑋,𝑌) = 𝐻𝑃(𝑋,𝑌)
Click here to expand...
- 𝐻𝑃(𝑋,𝑌) = - 𝛴𝑥∊𝑋 𝛴𝑦∊𝑌[ 𝑃(𝑋=𝑥,𝑌=𝑦) 𝑙𝑔 𝑃(𝑋=𝑥,𝑌=𝑦) ]
- joint entropy is very similar to entropy: we could just substitute 𝑃(𝑋,𝑌) with 𝑄(𝑍) and get the entropy formula
- 𝐻𝑄(𝑍) = - 𝛴𝑧∊𝑍 [ 𝑄(𝑍=𝑧) 𝑙𝑔 𝑄(𝑍=𝑧) ]
- joint entropy is symmetric
- 𝐻𝑃(𝑋,𝑌) = 𝐻𝑃(𝑌,𝑋)
Conditional Entropy 𝐻(𝑋|𝑌) = 𝐻𝑃(𝑋|𝑌)
Click here to expand...
- 𝐻𝑃(𝑋|𝑌) = - 𝛴𝑦∊𝑌𝑃(𝑌=𝑦) 𝛴𝑥∊𝑋[ 𝑃(𝑋=𝑥|𝑌=𝑦) 𝑙𝑔 𝑃(𝑋=𝑥|𝑌=𝑦) ]
- 𝐻𝑃(𝑋|𝑌) = - 𝛴𝑥∊𝑋 𝛴𝑦∊𝑌[ 𝑃(𝑋=𝑥,𝑌=𝑦) 𝑙𝑔 𝑃(𝑋=𝑥|𝑌=𝑦) ]
if values of 𝑋 are completely determined by 𝑌 then:
- 𝐻𝑃(𝑋|𝑌) = 0
conditional entropy of itself 𝐻(𝑋|𝑋) = 0. It is so because entropy is a measure of uncertainty and there is no uncertainty in reasoning on values of 𝑋 given the values of 𝑋:
- 𝐻𝑃(𝑋|𝑋) = 0 = - 𝛴𝑥∊𝑋 𝛴𝑥∊𝑋[ 𝑃(𝑋=𝑥,𝑋=𝑥) 𝑙𝑔 𝑃(𝑋=𝑥|𝑋=𝑥) ]
- if 𝑥=𝑥 then 𝑙𝑔𝑃(𝑋=𝑥|𝑋=𝑥) = 0
- if 𝑥≠𝑥 then 𝑃(𝑋=𝑥,𝑋=𝑥) = 0
Mutual Information or Information Gain 𝐼(𝑋,𝑌) - information shared between the variables
Click here to expand...
- 𝐼𝑃(𝑋,𝑌) = 𝐻𝑃(𝑋) + 𝐻𝑃(𝑌) - 𝐻𝑃(𝑋,𝑌)
this definition works because 𝐻𝑃(𝑋) + 𝐻𝑃(𝑌) has two copies of the mutual information, since it’s in both 𝑋 and 𝑌, while 𝐻𝑃(𝑋,𝑌) only has one
or more accurately, 𝐼(𝑋,𝑌) is the measure of reduction in uncertainty about a variable after observing the other:
- 𝐼𝑃(𝑋,𝑌) = 𝐻𝑃(𝑋) - 𝐻𝑃(𝑋|𝑌) # reduction in uncertainty about 𝑋 after observing 𝑌
- 𝐼𝑃(𝑋,𝑌) = 𝐻𝑃(𝑌) - 𝐻𝑃(𝑌|𝑋) # reduction in uncertainty about 𝑌 after observing 𝑋
therefore:
- 𝐼𝑃(𝑋,𝑌) ≤ 𝑚𝑖𝑛(𝐻𝑃(𝑋), 𝐻𝑃(𝑌))
for discrete variables:
- 𝐼𝑃(𝑋,𝑌) = 𝛴𝑥∊𝑋𝛴𝑦∊𝑌[ 𝑃(𝑋=𝑥,𝑌=𝑦) 𝑙𝑜𝑔 [ 𝑃(𝑋=𝑥,𝑌=𝑦) / [ 𝑃(𝑋=𝑥)𝑃(𝑌=𝑦) ] ] ]
Pointwise Mutual Information 𝐼𝑃(𝑋=𝑥,𝑌=𝑦) - measures how much more do events 𝑥 and 𝑦 co-occur than if they were independent
Click here to expand...
𝐼𝑃(𝑋=𝑥,𝑌=𝑦) = 𝑙𝑜𝑔 [ 𝑃(𝑋=𝑥,𝑌=𝑦) / [ 𝑃(𝑋=𝑥)𝑃(𝑌=𝑦) ] ]
Self Information - mutual information with itself 𝐼𝑃(𝑋,𝑋)
Click here to expand...
- 𝐼𝑃(𝑋,𝑋) = 𝐻𝑃(𝑋) - 𝐻𝑃(𝑋|𝑋)
- 𝐼𝑃(𝑋,𝑋) = 𝐻𝑃(𝑋) # because 𝐻𝑃(𝑋|𝑋) = 0 see conditional entropy section
Variation of Information 𝑉𝑃(𝑋,𝑌) - gives us a metric, a notion of distance, between different variables. The variation of information between two variables is zero if knowing the value of one tells you the value of the other and increases as they become more independent
Click here to expand...
- 𝑉𝑃(𝑋,𝑌) = 𝐻𝑃(𝑋,𝑌) - 𝐼𝑃(𝑋,𝑌)
Relationship between Entropy, Joint Entropy, & Conditional Entropy
Click here to expand...
- 𝐻𝑃(𝑋,𝑌) ≥ 𝐻𝑃(𝑋) ≥ 𝐻𝑃(𝑋|𝑌)
- 𝐻𝑃(𝑋,𝑌) ≥ 𝐻𝑃(𝑌) ≥ 𝐻𝑃(𝑌|𝑋)
- 𝐻𝑃(𝑋,𝑌) = 𝐻𝑃(𝑋|𝑌) + 𝐻𝑃(𝑌) = 𝐻𝑃(𝑌|𝑋) + 𝐻𝑃(𝑋)
from (Joint Entropy - Entropy) → (Conditional Entropy):
- 𝐻𝑃(𝑌|𝑋) = 𝐻𝑃(𝑋,𝑌) - 𝐻𝑃(𝑋)
- and
- 𝐻𝑃(𝑋|𝑌) = 𝐻𝑃(𝑋,𝑌) - 𝐻𝑃(𝑌)
Click here to expand...
- 𝐻𝑃(𝑋|𝑌) = - 𝛴𝑦∊𝑌𝛴𝑥∊𝑋[ 𝑃(𝑋=𝑥,𝑌=𝑦) 𝑙𝑔 𝑃(𝑋=𝑥,𝑌=𝑦) ] - -𝛴𝑦∊𝑌[ 𝑃(𝑌=𝑦) 𝑙𝑔 𝑃(𝑌=𝑦) ]
- 𝐻𝑃(𝑋|𝑌) = - [ 𝛴𝑦∊𝑌𝛴𝑥∊𝑋[ 𝑃(𝑋=𝑥,𝑌=𝑦) 𝑙𝑔 𝑃(𝑋=𝑥,𝑌=𝑦) ] - 𝛴𝑦∊𝑌[ 𝑃(𝑌=𝑦) 𝑙𝑔 𝑃(𝑌=𝑦) ] ]
- 𝐻𝑃(𝑋|𝑌) = - [ 𝛴𝑦∊𝑌[𝛴𝑥∊𝑋[ 𝑃(𝑋=𝑥,𝑌=𝑦) 𝑙𝑔 𝑃(𝑋=𝑥,𝑌=𝑦) ] - 𝑃(𝑌=𝑦) 𝑙𝑔 𝑃(𝑌=𝑦) ] ]
- 𝐻𝑃(𝑋|𝑌) = - [𝛴𝑦∊𝑌[𝛴𝑥∊𝑋[ 𝑃(𝑋=𝑥|𝑌=𝑦)𝑃(𝑌=𝑦) 𝑙𝑔 𝑃(𝑋=𝑥,𝑌=𝑦) ] - 𝑃(𝑌=𝑦) 𝑙𝑔 𝑃(𝑌=𝑦) ] ]
- 𝐻𝑃(𝑋|𝑌) = - [𝛴𝑦∊𝑌𝑃(𝑌=𝑦) [𝛴𝑥∊𝑋[ 𝑃(𝑋=𝑥|𝑌=𝑦) 𝑙𝑔 𝑃(𝑋=𝑥|𝑌=𝑦) ] + 𝛴𝑥∊𝑋[ 𝑃(𝑋=𝑥|𝑌=𝑦) 𝑙𝑔 𝑃(𝑌=𝑦) ] - 𝑙𝑔 𝑃(𝑌=𝑦) ] ]
- 𝐻𝑃(𝑋|𝑌) = - [𝛴𝑦∊𝑌𝑃(𝑌=𝑦) [𝛴𝑥∊𝑋[ 𝑃(𝑋=𝑥|𝑌=𝑦) 𝑙𝑔 𝑃(𝑋=𝑥|𝑌=𝑦) ] + 𝑙𝑔 𝑃(𝑌=𝑦) 𝛴𝑥∊𝑋[ 𝑃(𝑋=𝑥|𝑌=𝑦) ] - 𝑙𝑔 𝑃(𝑌=𝑦) ] ]
- 𝐻𝑃(𝑋|𝑌) = - [𝛴𝑦∊𝑌𝑃(𝑌=𝑦) [𝛴𝑥∊𝑋[ 𝑃(𝑋=𝑥|𝑌=𝑦) 𝑙𝑔 𝑃(𝑋=𝑥|𝑌=𝑦) ] + 𝑙𝑔 𝑃(𝑌=𝑦) - 𝑙𝑔 𝑃(𝑌=𝑦) ] ] # 𝛴𝑥∊𝑋[ 𝑃(𝑋=𝑥|𝑌=𝑦) ] = 1
- 𝐻𝑃(𝑋|𝑌) = - 𝛴𝑦∊𝑌𝑃(𝑌=𝑦) 𝛴𝑥∊𝑋[ 𝑃(𝑋=𝑥|𝑌=𝑦) 𝑙𝑔 𝑃(𝑋=𝑥|𝑌=𝑦) ] # definition of conditional entropy
KL-Divergence vs Variation of Information
Click here to expand...
- KL Divergence gives us a distance between two distributions over the same variable or set of variables (not symmetric)
- Variation of Information gives us the distance between two jointly distributed variables (symmetric)
KL Divergence is between distributions, Variation of Information between variables within a distribution
Bringing it All Together
-mutual-information-/-information-gain---variation-of-information)/entropy.png)