ML Models - List

ML Model - Comparisons

linear regression

Input/Output

input 𝑥: real-valued features

output 𝑦: Guassian Distribution

Input/Output Relation

𝑦 = 𝜃₀ + 𝜃₁𝑥₁ + … + 𝜃_𝑛𝑥_𝑛

Model Parameter 𝜃

𝜃 = {𝜃₁, …, 𝜃_𝑛}

Solving 𝜃

maximum likelihood estimate

𝜃_𝑀𝐿𝐸 = 𝑎𝑟𝑔𝑚𝑎𝑥_𝜃 𝐏(𝑦|𝑥₁, …, 𝑥_𝑛)

𝜃_𝑀𝐿𝐸 = (𝑋^𝑇𝑋)^-1𝑋^𝑇𝑌

where {𝑋,𝑌} are training data

Graphical Model

logistic regression

Input/Output

input 𝑥: real-valued features

output 𝑝: Bernoulli Distribution

Input/Output Relation

𝑦 = 𝜃₀ + 𝜃₁𝑥₁ + … + 𝜃_𝑛𝑥_𝑛

𝑝 = 1 / [1 + 𝑒𝑥𝑝(-𝑦)]

Click here to expand...

𝐏(𝑦=1|𝑥) = 𝑝

𝐏(𝑦=0|𝑥) = 1 - 𝑝

𝑙𝑜𝑔(𝑝/(1-𝑝)) = 𝜃₀ + 𝜃₁𝑥₁ + … + 𝜃_𝑛𝑥_𝑛

𝑙𝑜𝑔(𝑝/(1-𝑝)) = 𝑦

𝑝/(1-𝑝) = 𝑒𝑥𝑝(𝑦)

𝑝 = (1-𝑝)𝑒𝑥𝑝(𝑦)

𝑝 = 𝑒𝑥𝑝(𝑦) - 𝑝·𝑒𝑥𝑝(𝑦)

𝑝 + 𝑝·𝑒𝑥𝑝(𝑦) = 𝑒𝑥𝑝(𝑦)

𝑝 [1 + 𝑒𝑥𝑝(𝑦)] = 𝑒𝑥𝑝(𝑦)

𝑝 = 𝑒𝑥𝑝(𝑦) / [1 + 𝑒𝑥𝑝(𝑦)]

# multiply by 𝑒𝑥𝑝(𝑦) / 𝑒𝑥𝑝(𝑦)

𝑝 = 1 / [1 + 𝑒𝑥𝑝(-𝑦)]

Model Parameter 𝜃

𝜃 = {𝜃₁, …, 𝜃_𝑛}

Solving 𝜃

no closed form solution

gradient descent

TODO

Graphical Model

Advantages

correlated features 𝑥 don’t lead to problems (contrast to naive bayes)

well calibrated probability (contrast to SVM)

𝐏(𝑌_𝑖=1|𝑋_𝑖) = 𝑝_𝑖, ∀ instances {𝑋_𝑖,𝑌_𝑖}

→ 𝐄[𝛴𝑝_𝑖] = 𝛴𝑌_𝑖 # number of ”𝑌=1”

not sensitive to unbalanced training data

𝑌_𝑖 = 1, if 𝐏(𝑌_𝑖=1|𝑋_𝑖) > 𝐏(𝑌=1)

𝑌_𝑖 = 0, otherwise

multinomial logistic regression

Input/Output

input 𝑥: real-valued features, 𝑛-𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑎𝑙

output 𝑝_𝑐: Multinoulli Distribution, 𝑚-𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑎𝑙

Input/Output Relation

𝑦_𝑐 = 𝜃_𝑐,0 + 𝜃_𝑐,1𝑥₁ + … + 𝜃_𝑐,𝑛𝑥_𝑛# weighted sum for class 𝑐

𝑝_𝑐 = 𝑒𝑥𝑝(-𝑦_𝑐) / 𝛴_{𝑐’∊𝑎𝑙𝑙-𝑐𝑙𝑎𝑠𝑠𝑒𝑠}[𝑒𝑥𝑝(-𝑦_𝑐’)] # proability of class 𝑐

Model Parameter 𝜃

𝜃 = {𝜃_1,1, …, 𝜃_1,𝑛, 𝜃_2,1, …,𝜃_2,𝑛, …, 𝜃_𝑚,𝑛}

Solving 𝜃

no closed form solution

gradient descent

TODO

Graphical Model

log-linear model

log-linear model is a structured logistic regression

structured: allows non-numerical input and output by defining 𝑘 feature functions

special case: logistic regression where 𝑘 = (𝑛: number of input values)

Input/Output

input 𝑥: real-valued features, 𝑛-𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑎𝑙

output 𝑝_𝑐: Multinoulli Distribution, 𝑚-𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑎𝑙

Input/Output Relation

𝑦 = 𝑤₀ + 𝑤₁𝐹₁(𝒙,𝑦) + … + 𝑤_𝑘𝐹_𝑘(𝒙,𝑦)# weighted sum of 𝑘 features

𝐏(𝑌=𝑦|𝒙) = 𝑒𝑥𝑝(-𝑦) / 𝛴_{𝑦’∊𝑌}[𝑒𝑥𝑝(-𝑦’)] # proability of 𝑌=𝑦

Model Parameter 𝜃

𝜃 = {𝑤₀, 𝑤₁,…, 𝑤_𝑘}

Solving 𝜃

TODO

Graphical Model

linear-chain CRF

linear-chain CRF is a specific structure of Conditional Random Field

is a log-linear model where:

the length 𝐿 of output 𝑦 can be varying

the form of feature function is the sum of “low-level feature functions”:

𝐹_𝑗(𝒙,𝑦) = 𝛴_{1≤𝑖≤𝐿}𝑓_𝑗(𝑦_𝑖-1,𝑦_𝑖,𝒙,𝑖)

List indent undo

Example: Part of Speech (PoS) Tagging

input (observed) 𝒙: word sequence

output (hidden) 𝒚: PoS tag sequence

𝒙 = {He, sat, on, the, mat}

𝒚 = {pronoun, verb, preposition, article, noun}

with CRF we hope:

𝐏({pron, v, prep, art, n}|{He, sat, on, the, mat}) > 𝐏(⟨PoS Tags⟩|{He, sat, on, the, mat}), ∀⟨PoS Tag Sequence⟩ ≠ {pron, v, prep, art, n}

CRF

𝒚 = 𝑤₀ + 𝑤₁𝐹₁(𝒙,𝒚) + … + 𝑤_𝑘𝐹_𝑘(𝒙,𝒚) # weighted sum of 𝑘 features

𝐏(𝒀=𝒚|𝒙) = 𝑒𝑥𝑝(-𝒚) / 𝛴_{𝒚’∊𝒀}[𝑒𝑥𝑝(-𝒚’)] # proability of 𝒀=𝒚

where:

𝐹_𝑗(𝒙,𝒚) = 𝛴_{1≤𝑖≤𝐿}𝑓_𝑗(𝑦_𝑖-1,𝑦_𝑖,𝒙,𝑖)

An example of low-level feature function 𝑓_𝑗(𝑦_𝑖-1,𝑦_𝑖,𝒙,𝑖):

“The 𝑖^th word in 𝒙 is capitalized, and PoS tag 𝑦_𝑖 = proper noun” [TRUE(1) or FALSE(0)]

If 𝑤_𝑗 positively large, given 𝒙 and other condition fixed, then 𝒚 is more probable if 𝑓_𝑗(𝑦_𝑖-1,𝑦_𝑖,𝒙,𝑖) is activated

CRF Training

stochastic gradient ascent

partial derivative of conditional log-likelihood:

𝒚 = 𝑤₀ + 𝑤₁𝐹₁(𝒙,𝒚) + … + 𝑤_𝑘𝐹_𝑘(𝒙,𝒚)

𝐏(𝒀=𝒚|𝒙) = 𝑒𝑥𝑝(-𝒚) / 𝛴_{𝒚’∊𝒀}[𝑒𝑥𝑝(-𝒚’)]

𝛿𝑙𝑜𝑔𝐏(𝒀=𝒚|𝒙) / 𝛿𝑤_𝑗 = 𝐹_𝑗(𝒙,𝒚) - 𝛴_{𝒚’∊𝒀}[𝐹_𝑗(𝒙,𝒚’)𝐏(𝒀=𝒚|𝒙)]

update weight by:

𝑤_𝑗← 𝑤_𝑗 + 𝛼 [𝛿𝑙𝑜𝑔𝐏(𝒀=𝒚|𝒙) / 𝛿𝑤_𝑗]

NOTE: if 𝑗^th feature function is not activated by this training example:

we don’t need to update it

usually only a few weights need to be updated in each iteration

CRF Testing

for 1-best derivation:

𝒚’ = 𝑎𝑟𝑔𝑚𝑎𝑥_𝒚𝐏(𝒀=𝒚|𝒙)

𝒚’ = 𝑎𝑟𝑔𝑚𝑎𝑥_𝒚𝐏(𝒀=𝒚|𝒙)

𝒚’ = 𝑎𝑟𝑔𝑚𝑎𝑥_𝒚𝑒𝑥𝑝(-𝒚) / 𝛴_{𝒚’∊𝒀}[𝑒𝑥𝑝(-𝒚’)]

𝒚’ = 𝑎𝑟𝑔𝑚𝑎𝑥_𝒚𝑒𝑥𝑝(-𝒚)

𝒚’ = 𝑎𝑟𝑔𝑚𝑎𝑥_𝒚𝛴_{0≤𝑗≤𝑘}[𝑤_𝑗·𝐹_𝑗(𝒙,𝒚)]

𝒚’ = 𝑎𝑟𝑔𝑚𝑎𝑥_𝒚𝛴_{0≤𝑗≤𝑘}[𝑤_𝑗·𝛴_{1≤𝑖≤𝐿}𝑓_𝑗(𝑦_𝑖-1,𝑦_𝑖,𝒙,𝑖)]

𝒚’ = 𝑎𝑟𝑔𝑚𝑎𝑥_{𝑦₀, …, 𝑦_𝐿}𝛴_{0≤𝑗≤𝑘}[𝑤_𝑗·𝛴_{1≤𝑖≤𝐿}𝑓_𝑗(𝑦_𝑖-1,𝑦_𝑖,𝒙,𝑖)]

𝒚’ = 𝑎𝑟𝑔𝑚𝑎𝑥_{𝑦₀, …, 𝑦_𝐿}𝛴_{1≤𝑖≤𝐿}𝑔(𝑦_𝑖-1,𝑦_𝑖) # given {𝑤_𝑗} and 𝒙

for 1-best derivation:

precompute 𝑔(𝑦_𝑖-1,𝑦_𝑖) as a table for each 𝑖

perform dynamic programming to find the best sequence 𝒚:

𝑠𝑐𝑜𝑟𝑒({𝑦₀, …, 𝑦_𝑖}) ← 𝑚𝑎𝑥_{𝑦_𝑖-1}(𝑠𝑐𝑜𝑟𝑒({𝑦₀, …, 𝑦_𝑖-1}), 𝑔(𝑦_𝑖-1,𝑦_𝑖))

complexity

𝑂(𝑀²𝐿𝑘)

where:

𝑀 - build a table

𝐿 - number of elements in sequence

𝑘 - number of feature functions

／var／log marcus chiu

Explorer

ML - Model Comparisons

ML Models - List

ML Model - Comparisons

Input/Output

Input/Output Relation

Model Parameter 𝜃

Solving 𝜃

Graphical Model

Input/Output

Input/Output Relation

Model Parameter 𝜃

Solving 𝜃

Graphical Model

Advantages

Input/Output

Input/Output Relation

Model Parameter 𝜃

Solving 𝜃

Graphical Model

Input/Output

Input/Output Relation

Model Parameter 𝜃

Solving 𝜃

Graphical Model

Example: Part of Speech (PoS) Tagging

CRF Training

CRF Testing

／var／logmarcus chiu

Explorer

ML - Model Comparisons

ML Models - List

ML Model - Comparisons

Input/Output

Input/Output Relation

Model Parameter 𝜃

Solving 𝜃

Graphical Model

Input/Output

Input/Output Relation

Model Parameter 𝜃

Solving 𝜃

Graphical Model

Advantages

Input/Output

Input/Output Relation

Model Parameter 𝜃

Solving 𝜃

Graphical Model

Input/Output

Input/Output Relation

Model Parameter 𝜃

Solving 𝜃

Graphical Model

Example: Part of Speech (PoS) Tagging

CRF Training

CRF Testing

／var／log marcus chiu