Maxent Model - Example

let’s consider a discrete random variable 𝐶 with 2 outcomes: ℎ and 𝑡

𝐏(𝐶=ℎ) = probability of seeing heads
𝐏(𝐶=𝑡) = probability of seeing tails

below is the formula for univariate entropy, in which we want to maximize 𝐻_𝐏(𝐏) with respect to the constraints of the model

𝐻_𝐏(𝐏) = 𝛴_𝑥∊𝐶[ - 𝐏(𝐶=𝑥) 𝑙𝑛 𝐏(𝐶=𝑥) ]

below are 3 different models

Model With No Constraints	Model With 1 Constraint	Model With 2 Constraints
NONE here 𝐏(𝐶) is allowed to be an un-normalized distributioni.e. 𝐏(𝐶) does not have to be a probability distribution	𝐏(𝐶=ℎ) + 𝐏(𝐶=𝑡) = 1 this constrains 𝐏(𝐶) to be a normalized distribution i.e. 𝐏(𝐶) is a probability distribution	𝐏(𝐶=ℎ) + 𝐏(𝐶=𝑡) = 1 𝐏(𝐶=ℎ) = 0.3
thus there is a 2D plane of possible candidates	thus there is a 1D line of possible candidates	thus there is a single 1D point as the possible candidate
𝐻_𝐏(𝐏) is maximized when: 𝐏(𝐶=ℎ) = 1/𝑒 𝐏(𝐶=𝑡) = 1/𝑒 this is because the max of -𝐏(𝐶=𝑥)𝑙𝑛𝐏(𝐶=𝑥) is 1/𝑒	𝐻_𝐏(𝐏) is maximized when: 𝐏(𝐶=ℎ) = 1/2 𝐏(𝐶=𝑡) = 1/2	𝐻_𝐏(𝐏) is maximized when: 𝐏(𝐶=ℎ) = 0.3 𝐏(𝐶=𝑡) = 0.7 which is the only candidate point

Why Find Maximum Entropy Model?

maximizing entropy in effect helps us find an estimated distribution model 𝐏ˆ that:

minimizes commitment (which is another way of saying maximizes entropy)
resembles some reference to the true population distribution (actually empirical distribution)

this is what we want in the estimated distribution model 𝐏ˆ

Solution

is to maximize entropy 𝐻, subject to feature-based constraints:

𝐄_𝐏[𝑓_𝑖] = 𝐄_𝐏ˆ[𝑓_𝑖] ↔ 𝛴_{𝑥∊𝑓_𝑖}𝐏_𝑥 = 𝐶_𝑖

adding constraints/features:

lowers maximum entropy
raises the maximum likelihood of data
brings the distribution model further from the uniform distribution
brings the distribution model closer to the empirical distribution

Maxent - Properties

maximum entropy models are convex

a model 𝐹 is convex when:

𝐹(𝛴_𝑖𝑤_𝑖𝑥_𝑖) ≥ 𝛴_𝑖𝑤_𝑖𝐹(𝑥_𝑖) where 𝛴_𝑖𝑤_𝑖 = 1

convexity guarantees a single, global maximum because any higher points are greedily reachable

maximum entropy models 𝐻_𝐏(𝐏) = 𝛴_𝑥∊𝐶[ - 𝐏(𝐶=𝑥) 𝑙𝑛 𝐏(𝐶=𝑥) ] are convex

𝐏(𝐶=𝑥) 𝑙𝑛 𝐏(𝐶=𝑥) is convex

𝛴_𝑥∊𝐶[ - 𝐏(𝐶=𝑥) 𝑙𝑛 𝐏(𝐶=𝑥) ] is convex (sum of convex functions is convex)

the feasible-region of constrained 𝐻_𝐏(𝐏) is a linear subspace that is convex

the constrained entropy surface is therefore convex

the Maximum Likelihood Estimation (MLE) exponential model formulation is also convex (dual)

Subpages

Resources

Stanford’s NLP Video Lecture

／var／log marcus chiu

Explorer

Maximum Entropy (Maxent) Models

Maximum Entropy (Maxent) Models

Maxent Model - Example

Why Find Maximum Entropy Model?

Solution

Maxent - Properties

Subpages

Resources

／var／logmarcus chiu

Explorer

Maximum Entropy (Maxent) Models

Maximum Entropy (Maxent) Models

Maxent Model - Example

Why Find Maximum Entropy Model?

Solution

Maxent - Properties

Subpages

Resources

／var／log marcus chiu