ML Models - List
see: ML - Models
ML Model - Comparisons
- Histogram vs KDE
- Linear Discriminant Analysis (LDA) vs Quadratic Discriminant Analysis (QDA)
- Linear Regression vs Gaussian Regression
- Logistic Regression (LR) vs Linear Discriminant Analysis (LDA)
- Logistic Regression vs Linear SVM vs Non-Linear SVM
linear regression
Input/Output
- input 𝑥: real-valued features
- output 𝑦: Guassian Distribution
Input/Output Relation
- 𝑦 = 𝜃0 + 𝜃1𝑥1 + … + 𝜃𝑛𝑥𝑛
Model Parameter 𝜃
- 𝜃 = {𝜃1, …, 𝜃𝑛}
Solving 𝜃
maximum likelihood estimate
- 𝜃𝑀𝐿𝐸 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝐏(𝑦|𝑥1, …, 𝑥𝑛)
- 𝜃𝑀𝐿𝐸 = (𝑋𝑇𝑋)-1𝑋𝑇𝑌
where {𝑋,𝑌} are training data
Graphical Model
logistic regression
Input/Output
- input 𝑥: real-valued features
- output 𝑝: Bernoulli Distribution
Input/Output Relation
- 𝑦 = 𝜃0 + 𝜃1𝑥1 + … + 𝜃𝑛𝑥𝑛
- 𝑝 = 1 / [1 + 𝑒𝑥𝑝(-𝑦)]
Click here to expand...
- 𝐏(𝑦=1|𝑥) = 𝑝
- 𝐏(𝑦=0|𝑥) = 1 - 𝑝
- 𝑙𝑜𝑔(𝑝/(1-𝑝)) = 𝜃0 + 𝜃1𝑥1 + … + 𝜃𝑛𝑥𝑛
- 𝑙𝑜𝑔(𝑝/(1-𝑝)) = 𝑦
- 𝑝/(1-𝑝) = 𝑒𝑥𝑝(𝑦)
- 𝑝 = (1-𝑝)𝑒𝑥𝑝(𝑦)
- 𝑝 = 𝑒𝑥𝑝(𝑦) - 𝑝·𝑒𝑥𝑝(𝑦)
- 𝑝 + 𝑝·𝑒𝑥𝑝(𝑦) = 𝑒𝑥𝑝(𝑦)
- 𝑝 [1 + 𝑒𝑥𝑝(𝑦)] = 𝑒𝑥𝑝(𝑦)
- 𝑝 = 𝑒𝑥𝑝(𝑦) / [1 + 𝑒𝑥𝑝(𝑦)]
- # multiply by 𝑒𝑥𝑝(𝑦) / 𝑒𝑥𝑝(𝑦)
- 𝑝 = 1 / [1 + 𝑒𝑥𝑝(-𝑦)]
Model Parameter 𝜃
- 𝜃 = {𝜃1, …, 𝜃𝑛}
Solving 𝜃
- no closed form solution
gradient descent
TODO
Graphical Model
Advantages
- correlated features 𝑥 don’t lead to problems (contrast to naive bayes)
- well calibrated probability (contrast to SVM)
- 𝐏(𝑌𝑖=1|𝑋𝑖) = 𝑝𝑖, ∀ instances {𝑋𝑖,𝑌𝑖}
- → 𝐄[𝛴𝑝𝑖] = 𝛴𝑌𝑖 # number of ”𝑌=1”
- not sensitive to unbalanced training data
- 𝑌𝑖 = 1, if 𝐏(𝑌𝑖=1|𝑋𝑖) > 𝐏(𝑌=1)
- 𝑌𝑖 = 0, otherwise
multinomial logistic regression
Input/Output
- input 𝑥: real-valued features, 𝑛-𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑎𝑙
- output 𝑝𝑐: Multinoulli Distribution, 𝑚-𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑎𝑙
Input/Output Relation
- 𝑦𝑐 = 𝜃𝑐,0 + 𝜃𝑐,1𝑥1 + … + 𝜃𝑐,𝑛𝑥𝑛# weighted sum for class 𝑐
- 𝑝𝑐 = 𝑒𝑥𝑝(-𝑦𝑐) / 𝛴𝑐’∊𝑎𝑙𝑙-𝑐𝑙𝑎𝑠𝑠𝑒𝑠[𝑒𝑥𝑝(-𝑦𝑐’)] # proability of class 𝑐
Model Parameter 𝜃
- 𝜃 = {𝜃1,1, …, 𝜃1,𝑛, 𝜃2,1, …,𝜃2,𝑛, …, 𝜃𝑚,𝑛}
Solving 𝜃
- no closed form solution
gradient descent
TODO
Graphical Model
log-linear model
log-linear model is a structured logistic regression
- structured: allows non-numerical input and output by defining 𝑘 feature functions
- special case: logistic regression where 𝑘 = (𝑛: number of input values)
Input/Output
- input 𝑥: real-valued features, 𝑛-𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑎𝑙
- output 𝑝𝑐: Multinoulli Distribution, 𝑚-𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑎𝑙
Input/Output Relation
- 𝑦 = 𝑤0 + 𝑤1𝐹1(𝒙,𝑦) + … + 𝑤𝑘𝐹𝑘(𝒙,𝑦)# weighted sum of 𝑘 features
- 𝐏(𝑌=𝑦|𝒙) = 𝑒𝑥𝑝(-𝑦) / 𝛴𝑦’∊𝑌[𝑒𝑥𝑝(-𝑦’)] # proability of 𝑌=𝑦
Model Parameter 𝜃
- 𝜃 = {𝑤0, 𝑤1,…, 𝑤𝑘}
Solving 𝜃
- TODO
Graphical Model
linear-chain CRF
linear-chain CRF is a specific structure of Conditional Random Field
is a log-linear model where:
- the length 𝐿 of output 𝑦 can be varying
- the form of feature function is the sum of “low-level feature functions”:
- 𝐹𝑗(𝒙,𝑦) = 𝛴1≤𝑖≤𝐿𝑓𝑗(𝑦𝑖-1,𝑦𝑖,𝒙,𝑖)
List indent undo
Example: Part of Speech (PoS) Tagging
- input (observed) 𝒙: word sequence
- output (hidden) 𝒚: PoS tag sequence
- 𝒙 = {He, sat, on, the, mat}
- 𝒚 = {pronoun, verb, preposition, article, noun}
with CRF we hope:
- 𝐏({pron, v, prep, art, n}|{He, sat, on, the, mat}) > 𝐏(⟨PoS Tags⟩|{He, sat, on, the, mat}), ∀⟨PoS Tag Sequence⟩ ≠ {pron, v, prep, art, n}
CRF
- 𝒚 = 𝑤0 + 𝑤1𝐹1(𝒙,𝒚) + … + 𝑤𝑘𝐹𝑘(𝒙,𝒚) # weighted sum of 𝑘 features
- 𝐏(𝒀=𝒚|𝒙) = 𝑒𝑥𝑝(-𝒚) / 𝛴𝒚’∊𝒀[𝑒𝑥𝑝(-𝒚’)] # proability of 𝒀=𝒚
where:
- 𝐹𝑗(𝒙,𝒚) = 𝛴1≤𝑖≤𝐿𝑓𝑗(𝑦𝑖-1,𝑦𝑖,𝒙,𝑖)
An example of low-level feature function 𝑓𝑗(𝑦𝑖-1,𝑦𝑖,𝒙,𝑖):
- “The 𝑖th word in 𝒙 is capitalized, and PoS tag 𝑦𝑖 = proper noun” [TRUE(1) or FALSE(0)]
If 𝑤𝑗 positively large, given 𝒙 and other condition fixed, then 𝒚 is more probable if 𝑓𝑗(𝑦𝑖-1,𝑦𝑖,𝒙,𝑖) is activated
CRF Training
stochastic gradient ascent
- partial derivative of conditional log-likelihood:
- 𝒚 = 𝑤0 + 𝑤1𝐹1(𝒙,𝒚) + … + 𝑤𝑘𝐹𝑘(𝒙,𝒚)
- 𝐏(𝒀=𝒚|𝒙) = 𝑒𝑥𝑝(-𝒚) / 𝛴𝒚’∊𝒀[𝑒𝑥𝑝(-𝒚’)]
- 𝛿𝑙𝑜𝑔𝐏(𝒀=𝒚|𝒙) / 𝛿𝑤𝑗 = 𝐹𝑗(𝒙,𝒚) - 𝛴𝒚’∊𝒀[𝐹𝑗(𝒙,𝒚’)𝐏(𝒀=𝒚|𝒙)]
- update weight by:
- 𝑤𝑗← 𝑤𝑗 + 𝛼 [𝛿𝑙𝑜𝑔𝐏(𝒀=𝒚|𝒙) / 𝛿𝑤𝑗]
NOTE: if 𝑗th feature function is not activated by this training example:
- we don’t need to update it
- usually only a few weights need to be updated in each iteration
CRF Testing
for 1-best derivation:
- 𝒚’ = 𝑎𝑟𝑔𝑚𝑎𝑥𝒚𝐏(𝒀=𝒚|𝒙)
- 𝒚’ = 𝑎𝑟𝑔𝑚𝑎𝑥𝒚𝐏(𝒀=𝒚|𝒙)
- 𝒚’ = 𝑎𝑟𝑔𝑚𝑎𝑥𝒚𝑒𝑥𝑝(-𝒚) / 𝛴𝒚’∊𝒀[𝑒𝑥𝑝(-𝒚’)]
- 𝒚’ = 𝑎𝑟𝑔𝑚𝑎𝑥𝒚𝑒𝑥𝑝(-𝒚)
- 𝒚’ = 𝑎𝑟𝑔𝑚𝑎𝑥𝒚𝛴0≤𝑗≤𝑘[𝑤𝑗·𝐹𝑗(𝒙,𝒚)]
- 𝒚’ = 𝑎𝑟𝑔𝑚𝑎𝑥𝒚𝛴0≤𝑗≤𝑘[𝑤𝑗·𝛴1≤𝑖≤𝐿𝑓𝑗(𝑦𝑖-1,𝑦𝑖,𝒙,𝑖)]
- 𝒚’ = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦0, …, 𝑦𝐿𝛴0≤𝑗≤𝑘[𝑤𝑗·𝛴1≤𝑖≤𝐿𝑓𝑗(𝑦𝑖-1,𝑦𝑖,𝒙,𝑖)]
- 𝒚’ = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦0, …, 𝑦𝐿𝛴1≤𝑖≤𝐿𝑔(𝑦𝑖-1,𝑦𝑖) # given {𝑤𝑗} and 𝒙
for 1-best derivation:
- precompute 𝑔(𝑦𝑖-1,𝑦𝑖) as a table for each 𝑖
- perform dynamic programming to find the best sequence 𝒚:
- 𝑠𝑐𝑜𝑟𝑒({𝑦0, …, 𝑦𝑖}) ← 𝑚𝑎𝑥𝑦𝑖-1(𝑠𝑐𝑜𝑟𝑒({𝑦0, …, 𝑦𝑖-1}), 𝑔(𝑦𝑖-1,𝑦𝑖))
complexity
- 𝑂(𝑀2𝐿𝑘)
where:
- 𝑀 - build a table
- 𝐿 - number of elements in sequence
- 𝑘 - number of feature functions
---cognitive-computing---machine-intelligence/ai---subfields/machine-learning-(ml)---pattern-recognition/ml---model-comparisons/linear-regression.png)
---cognitive-computing---machine-intelligence/ai---subfields/machine-learning-(ml)---pattern-recognition/ml---model-comparisons/sigmoid.png)
---cognitive-computing---machine-intelligence/ai---subfields/machine-learning-(ml)---pattern-recognition/ml---model-comparisons/multinomial-logistic-regression.png)
---cognitive-computing---machine-intelligence/ai---subfields/machine-learning-(ml)---pattern-recognition/ml---model-comparisons/log-linear-model.png)
---cognitive-computing---machine-intelligence/ai---subfields/machine-learning-(ml)---pattern-recognition/ml---model-comparisons/linear-chain-crf.png)
---cognitive-computing---machine-intelligence/ai---subfields/machine-learning-(ml)---pattern-recognition/ml---model-comparisons/linear-chain-crf-part-of-peech-tagging.png)
---cognitive-computing---machine-intelligence/ai---subfields/machine-learning-(ml)---pattern-recognition/ml---model-comparisons/example-pos-tagging-dynamic-programming.png)