Binomial/Binary Logistic Regression (BLR)

BLR - Model Representation (without Features)

Given input attribute values 𝒙 find probability of 𝑦=1

  • 𝑦 - binary output value
  • 𝒙 - input attribute values vector (i.e. 𝒙 = [𝑥0, …, 𝑥𝑘]) # 𝑥0=1 always, 𝑥0is the bias
  • 𝜽 - weight/parameter vector (i.e. 𝜽 = [𝜃0, …, 𝜃𝑘])

𝜽(𝒙) = 1 / (1 + 𝑒-𝜽ᵀ𝒙)
𝜽(𝒙) = 𝐏(𝑦=1|𝒙;𝜽)

𝑙𝑜𝑔[ 𝐏(𝑦=1|𝒙;𝜽) / 𝐏(𝑦=0|𝒙;𝜽) ] = 𝜽𝑇𝒙

BLR - Model Representation (with Features)

Given input attribute values 𝒙 find probability of 𝑦=1

  • 𝑦 - binary output value
  • 𝒙 - input attribute values vector (i.e. 𝒙 = [𝑥0, …, 𝑥𝑘]) # 𝑥0=1 always, 𝑥0is the bias
  • 𝐹(𝑦=𝑖,𝒙) - set of 𝑙 features extracted from 𝒙 (i.e. [𝑓0(𝑦=𝑖,𝒙), …, 𝑓𝑙(𝑦=𝑖,𝒙)]) # this will act as 𝒙 in the case of model representation without features
  • 𝜽 - weight/parameter vector (i.e. 𝜽 = [𝜃0, …, 𝜃𝑙]) # where 𝑙 is the number of features extracted from 𝒙

𝜽(𝒙) = 1 / (1 + 𝑒-𝜽𝑇𝐹(𝑦=𝑖,𝒙))
𝜽(𝒙) = 𝐏(𝑦=1|𝒙;𝜽)
𝜽(𝒙) = 𝑒𝛴0≤𝑗≤𝑙[𝜃𝑗·𝑓𝑗(𝑦=1,𝒙)] / 𝛴0≤𝑖≤1[𝑒𝛴0≤𝑗≤𝑙[𝜃𝑗·𝑓𝑗(𝑦=𝑖,𝒙)]] # ?

Info

Instead of coming up with 𝑙 features manually, consider “automating” it with neural networks (see Feed-Forward Network)

BLR - Cost Function (Using Squared Error) - DO NOT USE

  • 𝑐𝑜𝑠𝑡(ℎ𝜃(𝑥), 𝑦) = (1/2) * [ℎ𝜃(𝑥) - 𝑦]2

This cost function is not convex with respect to 𝜃 because ℎ𝜃(𝑥) is a sigmoid function and is not linear like in Linear Regression

BLR - Cost Function (Using Log Loss Function)

Cost Function of Single Sample (𝒙,𝑦) - Pairwise
Cost Function of Single Sample (𝒙,𝑦) - Combined

Indent

Cost Function of Multiple Samples {(𝒙1,𝑦1), …, (𝒙𝑛,𝑦𝑛)}

Indent

BLR - Learning 𝜃s With Gradient Descent

to minimize 𝐽(𝜽) we its derivative with respect to each 𝜃𝑗:

Indent

similar to gradient descent for linear regression

BLR - Cost Function With Regularization

cost function after regularization of 𝑚 regression coefficients:

  • 𝐽(𝜽) = -(1/𝑛)·𝛴1≤𝑖≤𝑛[𝑦(𝑖)·𝑙𝑜𝑔(ℎ𝜽(𝒙(𝑖))) + (1-𝑦(𝑖))·𝑙𝑜𝑔(1-ℎ𝜽(𝒙(𝑖)))] + (𝜆/2)·[𝛴1≤𝑗≤𝑘(𝜃𝑗)2]

therefore, the original gradient descent update:

  • 𝜃𝑗 ← 𝜃𝑗 - (𝛼/𝑛) * [ (𝛴1≤𝑖≤𝑛[ℎ𝜽(𝒙(𝑖)) - 𝑦(𝑖)]𝑥𝑗(𝑖)) ]

now becomes:

  • 𝜃𝑗 ← 𝜃𝑗 - (𝛼/𝑛) * [ (𝛴1≤𝑖≤𝑛[𝜽(𝒙(𝑖)) - 𝑦(𝑖)]𝑥𝑗(𝑖)) + (𝜆𝜃𝑗) ]

BLR - Hypothesis

given 𝒙 and the optimized values of 𝜽, the assigned output label is defined as (i.e. hypothesis):

  • 𝜽(𝒙) = 1, if 𝜽𝑇𝒙 ≥ 0
  • 𝜽(𝒙) = 0, otherwise

Resources