Binomial/Binary Logistic Regression (BLR)

is a type of logistic regression where the dependent variable is nominal binary
similar to perceptron when the perceptron’s activation function is the sigmoid function
similar to Linear SVM (SVM Without Kernel)

BLR - Model Representation (without Features)

Given input attribute values 𝒙 find probability of 𝑦=1

𝑦 - binary output value
𝒙 - input attribute values vector (i.e. 𝒙 = [𝑥₀, …, 𝑥_𝑘]) # 𝑥₀=1 always, 𝑥₀is the bias
𝜽 - weight/parameter vector (i.e. 𝜽 = [𝜃₀, …, 𝜃_𝑘])

ℎ_𝜽(𝒙) = 1 / (1 + 𝑒^-𝜽ᵀ𝒙)
ℎ_𝜽(𝒙) = 𝐏(𝑦=1|𝒙;𝜽)

Click here to expand...

ℎ_𝜽(𝒙) = 𝑔(𝜽^𝑇𝒙)

𝑔(𝑧) = 1 / (1 + 𝑒^-𝑧)

𝑔(𝑧) = (𝑒^𝑧/𝑒^𝑧) [1 / (1 + 𝑒^-𝑧)]

𝑔(𝑧) = (𝑒^𝑧 * 1) / [𝑒^𝑧* (1 + 𝑒^-𝑧)]

𝑔(𝑧) = 𝑒^𝑧 / (𝑒^𝑧 + 𝑒^𝑧𝑒^-𝑧)

𝑔(𝑧) = 𝑒^𝑧 / (𝑒^𝑧 + 𝑒⁰)

𝑔(𝑧) = 𝑒^𝑧 / (𝑒^𝑧 + 1)

𝑔(𝑧) = 𝑒^𝑧 / 𝛴_0≤𝑖≤1[𝑒^𝐹(𝑧)]

𝑔(𝑧) = 𝑒^{𝛴_{0≤𝑗≤𝑙}[𝜃_𝑗·𝑓_𝑗(𝑦=1,𝒙)]} / 𝛴_0≤𝑖≤1[𝑒^{𝛴_{0≤𝑗≤𝑙}[𝜃_𝑗·𝑓_𝑗(𝑦=𝑖,𝒙)]}]

where:

𝑔(..) - is the Sigmoid Function that has a range of (0, 1)

therefore:

ℎ_𝜽(𝒙) = 1 / (1 + 𝑒^{-𝜽𝑇𝒙})

ℎ_𝜽(𝒙) is the estimated probability that 𝑦 = 1

ℎ_𝜽(𝒙) = 𝐏(𝑦=1|𝒙;𝜽)

𝑙𝑜𝑔[ 𝐏(𝑦=1|𝒙;𝜽) / 𝐏(𝑦=0|𝒙;𝜽) ] = 𝜽^𝑇𝒙

BLR - Model Representation (with Features)

Given input attribute values 𝒙 find probability of 𝑦=1

𝑦 - binary output value
𝒙 - input attribute values vector (i.e. 𝒙 = [𝑥₀, …, 𝑥_𝑘]) # 𝑥₀=1 always, 𝑥₀is the bias
𝐹(𝑦=𝑖,𝒙) - set of 𝑙 features extracted from 𝒙 (i.e. [𝑓₀(𝑦=𝑖,𝒙), …, 𝑓_𝑙(𝑦=𝑖,𝒙)]) # this will act as 𝒙 in the case of model representation without features
𝜽 - weight/parameter vector (i.e. 𝜽 = [𝜃₀, …, 𝜃_𝑙]) # where 𝑙 is the number of features extracted from 𝒙

ℎ_𝜽(𝒙) = 1 / (1 + 𝑒^{-𝜽^𝑇𝐹(𝑦=𝑖,𝒙)})
ℎ_𝜽(𝒙) = 𝐏(𝑦=1|𝒙;𝜽)
ℎ_𝜽(𝒙) = 𝑒^{𝛴_{0≤𝑗≤𝑙}[𝜃_𝑗·𝑓_𝑗(𝑦=1,𝒙)]} / 𝛴_0≤𝑖≤1[𝑒^{𝛴_{0≤𝑗≤𝑙}[𝜃_𝑗·𝑓_𝑗(𝑦=𝑖,𝒙)]}] # ?

Info

Instead of coming up with 𝑙 features manually, consider “automating” it with neural networks (see Feed-Forward Network)

BLR - Cost Function (Using Squared Error) - DO NOT USE

𝑐𝑜𝑠𝑡(ℎ_𝜃(𝑥), 𝑦) = (1/2) * [ℎ_𝜃(𝑥) - 𝑦]²

This cost function is not convex with respect to 𝜃 because ℎ_𝜃(𝑥) is a sigmoid function and is not linear like in Linear Regression

BLR - Cost Function (Using Log Loss Function)

Cost Function of Single Sample (𝒙,𝑦) - Pairwise

Cost Function of Single Sample (𝒙,𝑦) - Combined

Indent

Cost Function of Multiple Samples {(𝒙₁,𝑦₁), …, (𝒙_𝑛,𝑦_𝑛)}

Indent

BLR - Learning 𝜃s With Gradient Descent

to minimize 𝐽(𝜽) we its derivative with respect to each 𝜃_𝑗:

Indent

derivative of 𝐽(𝜃) with respect to 𝜃𝑗

𝐽(𝜽) = -(1/𝑛) 𝛴_{1≤𝑖≤𝑛}[𝑦^(𝑖)·𝑙𝑜𝑔(ℎ_𝜽(𝒙^(𝑖))) + (1-𝑦^(𝑖))·𝑙𝑜𝑔(1-ℎ_𝜽(𝒙^(𝑖)))]

(𝛿/𝛿𝜃_𝑗)𝐽(𝜽) = (𝛿/𝛿𝜃_𝑗) [ -(1/𝑛) 𝛴_{1≤𝑖≤𝑛}[𝑦^(𝑖)·𝑙𝑜𝑔(ℎ_𝜽(𝒙^(𝑖))) + (1-𝑦^(𝑖))·𝑙𝑜𝑔(1-ℎ_𝜽(𝒙^(𝑖)))] ]

(𝛿/𝛿𝜃_𝑗)𝐽(𝜽) = -(1/𝑛) 𝛴_{1≤𝑖≤𝑛}[𝑦^(𝑖)·(𝛿/𝛿𝜃_𝑗)𝑙𝑜𝑔(ℎ_𝜽(𝒙^(𝑖))) + (1-𝑦^(𝑖))·(𝛿/𝛿𝜃_𝑗)𝑙𝑜𝑔(1-ℎ_𝜽(𝒙^(𝑖)))]

(𝛿/𝛿𝜃_𝑗)𝑙𝑜𝑔(ℎ_𝜽(𝒙^(𝑖))) = (1/ℎ_𝜽(𝒙^(𝑖)))·(𝛿/𝛿𝜃_𝑗)ℎ_𝜽(𝒙^(𝑖))

(𝛿/𝛿𝜃_𝑗)𝑙𝑜𝑔(1-ℎ_𝜽(𝒙^(𝑖))) = (1/(1-ℎ_𝜽(𝒙^(𝑖⁾))·(𝛿/𝛿𝜃_𝑗)ℎ_𝜽(𝒙^(𝑖))

(𝛿/𝛿𝜃_𝑗)𝐽(𝜽) = -(1/𝑛) 𝛴_{1≤𝑖≤𝑛}[(𝑦^(𝑖)/ℎ_𝜽(𝒙^(𝑖)))·(𝛿/𝛿𝜃_𝑗)ℎ_𝜽(𝒙^(𝑖)) - (1-𝑦^(𝑖))/(1-ℎ_𝜽(𝒙^(𝑖)))·(𝛿/𝛿𝜃_𝑗)ℎ_𝜽(𝒙^(𝑖))]

(𝛿/𝛿𝜃_𝑗)ℎ_𝜽(𝒙^(𝑖)) = (𝛿/𝛿𝜃_𝑗)𝑔(𝜽·𝒙)

ℎ(𝒙) = 𝑔(𝜽·𝒙)

(𝛿/𝛿𝜃_𝑗)ℎ_𝜽(𝒙^(𝑖)) = (𝛿/𝛿𝜃_𝑗)𝑔(𝜽·𝒙)

𝑔(𝑧) = 1/[1 + 𝑒^-𝑧]

(𝛿/𝛿𝜃_𝑗)ℎ_𝜽(𝒙^(𝑖)) = 𝑔(𝑧)’ = (𝛿/𝛿𝜃_𝑗)[1 + 𝑒^-𝑧]⁻¹

(𝛿/𝛿𝜃_𝑗)ℎ_𝜽(𝒙^(𝑖)) = 𝑔(𝑧)’ = -[1 + 𝑒^-𝑧]⁻² 𝑒^-𝑧

(𝛿/𝛿𝜃_𝑗)ℎ_𝜽(𝒙^(𝑖)) = 𝑔(𝑧)’ = 𝑒^-𝑧/[1 + 𝑒^-𝑧]²

(𝛿/𝛿𝜃_𝑗)ℎ_𝜽(𝒙^(𝑖)) = 𝑔(𝑧)’ = 1/[1 + 𝑒^-𝑧] * 𝑒^-𝑧/[1 + 𝑒^-𝑧]

(𝛿/𝛿𝜃_𝑗)ℎ_𝜽(𝒙^(𝑖)) = 𝑔(𝑧)’ = 𝑔(𝑧) * [1 - 𝑔(𝑧)]

(𝛿/𝛿𝜃_𝑗)ℎ_𝜽(𝒙^(𝑖)) = ℎ_𝜽(𝒙^(𝑖)) * [1 - ℎ_𝜽(𝒙^(𝑖))] * (𝛿/𝛿𝜃_𝑗)𝜽·𝒙^(𝑖)

(𝛿/𝛿𝜃_𝑗)𝐽(𝜽) = -(1/𝑛) 𝛴_{1≤𝑖≤𝑛}[(𝑦^(𝑖)/ℎ_𝜃(𝒙^(𝑖))) * ℎ_𝜽(𝒙^(𝑖)) * [1 - ℎ_𝜽(𝒙^(𝑖))] * (𝛿/𝛿𝜃_𝑗)𝜽·𝒙^(𝑖) - ((1-𝑦^(𝑖))/(1-ℎ_𝜽(𝒙^(𝑖))) * ℎ_𝜽(𝒙^(𝑖)) * [1 - ℎ_𝜽(𝒙^(𝑖))] * (𝛿/𝛿𝜃_𝑗)𝜽·𝒙^(𝑖))]

(𝛿/𝛿𝜃_𝑗)𝐽(𝜽) = -(1/𝑛) 𝛴_{1≤𝑖≤𝑛}[[𝑦^(𝑖) - 𝑦^(𝑖)ℎ_𝜽(𝒙^(𝑖))] * (𝛿/𝛿𝜃_𝑗)𝜽·𝒙^(𝑖) - [ℎ_𝜃(𝒙^(𝑖)) - 𝑦^(𝑖)ℎ_𝜽(𝒙^(𝑖))] * (𝛿/𝛿𝜃_𝑗)𝜽·𝒙^(𝑖)]

(𝛿/𝛿𝜃_𝑗)𝜽·𝒙^(𝑖) = (𝛿/𝛿𝜃_𝑗)𝛴_{0≤𝑙≤𝑘}[𝜃_𝑙·𝑥_𝑙^(𝑖)]

(𝛿/𝛿𝜃_𝑗)𝜽·𝒙^(𝑖) = 𝑥_𝑗^(𝑖)

(𝛿/𝛿𝜃_𝑗)𝐽(𝜽) = -(1/𝑛) 𝛴_{1≤𝑖≤𝑛}[[𝑦^(𝑖) - 𝑦^(𝑖)ℎ_𝜽(𝒙^(𝑖))] * 𝑥_𝑗^(𝑖) - [ℎ_𝜃(𝒙^(𝑖)) - 𝑦^(𝑖)ℎ_𝜽(𝒙^(𝑖))] * 𝑥_𝑗^(𝑖)]

(𝛿/𝛿𝜃_𝑗)𝐽(𝜽) = -(1/𝑛) 𝛴_{1≤𝑖≤𝑛}[[𝑦^(𝑖) - 𝑦^(𝑖)ℎ_𝜽(𝒙^(𝑖))] - [ℎ_𝜽(𝒙^(𝑖)) - 𝑦^(𝑖)ℎ_𝜽(𝒙^(𝑖))]] * 𝑥_𝑗^(𝑖)

(𝛿/𝛿𝜃_𝑗)𝐽(𝜽) = -(1/𝑛) 𝛴_{1≤𝑖≤𝑛}[𝑦^(𝑖) - 𝑦^(𝑖)ℎ_𝜽(𝒙^(𝑖)) - ℎ_𝜽(𝒙^(𝑖)) + 𝑦^(𝑖)ℎ_𝜽(𝒙^(𝑖))] * 𝑥_𝑗^(𝑖)

(𝛿/𝛿𝜃_𝑗)𝐽(𝜽) = -(1/𝑛) 𝛴_{1≤𝑖≤𝑛}[𝑦^(𝑖) - ℎ_𝜽(𝒙^(𝑖))] * 𝑥_𝑗^(𝑖)

(𝛿/𝛿𝜃_𝑗)𝐽(𝜽) = -(1/𝑛) 𝛴_{1≤𝑖≤𝑛}[ℎ_𝜽(𝒙^(𝑖)) - 𝑦^(𝑖)] * 𝑥_𝑗^(𝑖) * (-1)

(𝛿/𝛿𝜃_𝑗)𝐽(𝜽) = (1/𝑛) 𝛴_{1≤𝑖≤𝑛}[ℎ_𝜽(𝒙^(𝑖)) - 𝑦^(𝑖)] * 𝑥_𝑗^(𝑖)

BLR - Cost Function With Regularization

cost function after regularization of 𝑚 regression coefficients:

𝐽(𝜽) = -(1/𝑛)·[ 𝛴_{1≤𝑖≤𝑛}[𝑦^(𝑖)·𝑙𝑜𝑔(ℎ_𝜽(𝒙^(𝑖))) + (1-𝑦^(𝑖))·𝑙𝑜𝑔(1-ℎ_𝜽(𝒙^(𝑖)))] + (𝜆/2)·[𝛴_{1≤𝑗≤𝑘}(𝜃_𝑗)²] ]

therefore, the original gradient descent update:

𝜃_𝑗 ← 𝜃_𝑗 - (𝛼/𝑛) * [ (𝛴_{1≤𝑖≤𝑛}[ℎ_𝜽(𝒙^(𝑖)) - 𝑦^(𝑖)]𝑥_𝑗^(𝑖)) ]

now becomes:

𝜃_𝑗 ← 𝜃_𝑗 - (𝛼/𝑛) * [ (𝛴_{1≤𝑖≤𝑛}[ℎ_𝜽(𝒙^(𝑖)) - 𝑦^(𝑖)]𝑥_𝑗^(𝑖)) + (𝜆𝜃_𝑗) ]

BLR - Hypothesis

given 𝒙 and the optimized values of 𝜽, the assigned output label is defined as (i.e. hypothesis):

ℎ_𝜽(𝒙) = 1, if 𝜽^𝑇𝒙 ≥ 0
ℎ_𝜽(𝒙) = 0, otherwise

Resources

Andrew Ng’s Coursera

／var／log marcus chiu

Explorer

Binomial／Binary Logistic Regression (BLR)

Binomial/Binary Logistic Regression (BLR)

BLR - Model Representation (without Features)

BLR - Model Representation (with Features)

BLR - Cost Function (Using Squared Error) - DO NOT USE

BLR - Cost Function (Using Log Loss Function)

Cost Function of Single Sample (𝒙,𝑦) - Pairwise

Cost Function of Single Sample (𝒙,𝑦) - Combined

Cost Function of Multiple Samples {(𝒙₁,𝑦₁), …, (𝒙_𝑛,𝑦_𝑛)}

BLR - Learning 𝜃s With Gradient Descent

BLR - Cost Function With Regularization

BLR - Hypothesis

Resources

／var／logmarcus chiu

Explorer

Binomial／Binary Logistic Regression (BLR)

Binomial/Binary Logistic Regression (BLR)

BLR - Model Representation (without Features)

BLR - Model Representation (with Features)

BLR - Cost Function (Using Squared Error) - DO NOT USE

BLR - Cost Function (Using Log Loss Function)

Cost Function of Single Sample (𝒙,𝑦) - Pairwise

Cost Function of Single Sample (𝒙,𝑦) - Combined

Cost Function of Multiple Samples {(𝒙1,𝑦1), …, (𝒙𝑛,𝑦𝑛)}

BLR - Learning 𝜃s With Gradient Descent

BLR - Cost Function With Regularization

BLR - Hypothesis

Resources

／var／log marcus chiu

Cost Function of Multiple Samples {(𝒙₁,𝑦₁), …, (𝒙_𝑛,𝑦_𝑛)}