Binomial/Binary Logistic Regression (BLR)
- is a type of logistic regression where the dependent variable is nominal binary
- similar to perceptron when the perceptron’s activation function is the sigmoid function
- similar to Linear SVM (SVM Without Kernel)
BLR - Model Representation (without Features)
Given input attribute values 𝒙 find probability of 𝑦=1
- 𝑦 - binary output value
- 𝒙 - input attribute values vector (i.e. 𝒙 = [𝑥0, …, 𝑥𝑘]) # 𝑥0=1 always, 𝑥0is the bias
- 𝜽 - weight/parameter vector (i.e. 𝜽 = [𝜃0, …, 𝜃𝑘])
ℎ𝜽(𝒙) = 1 / (1 + 𝑒-𝜽ᵀ𝒙)
ℎ𝜽(𝒙) = 𝐏(𝑦=1|𝒙;𝜽)
Click here to expand...
- ℎ𝜽(𝒙) = 𝑔(𝜽𝑇𝒙)
- 𝑔(𝑧) = 1 / (1 + 𝑒-𝑧)
- 𝑔(𝑧) = (𝑒𝑧/𝑒𝑧) [1 / (1 + 𝑒-𝑧)]
- 𝑔(𝑧) = (𝑒𝑧 * 1) / [𝑒𝑧* (1 + 𝑒-𝑧)]
- 𝑔(𝑧) = 𝑒𝑧 / (𝑒𝑧 + 𝑒𝑧𝑒-𝑧)
- 𝑔(𝑧) = 𝑒𝑧 / (𝑒𝑧 + 𝑒0)
- 𝑔(𝑧) = 𝑒𝑧 / (𝑒𝑧 + 1)
- 𝑔(𝑧) = 𝑒𝑧 / 𝛴0≤𝑖≤1[𝑒𝐹(𝑧)]
- 𝑔(𝑧) = 𝑒𝛴0≤𝑗≤𝑙[𝜃𝑗·𝑓𝑗(𝑦=1,𝒙)] / 𝛴0≤𝑖≤1[𝑒𝛴0≤𝑗≤𝑙[𝜃𝑗·𝑓𝑗(𝑦=𝑖,𝒙)]]
where:
- 𝑔(..) - is the Sigmoid Function that has a range of (0, 1)
therefore:
- ℎ𝜽(𝒙) = 1 / (1 + 𝑒-𝜽𝑇𝒙)
ℎ𝜽(𝒙) is the estimated probability that 𝑦 = 1
- ℎ𝜽(𝒙) = 𝐏(𝑦=1|𝒙;𝜽)
𝑙𝑜𝑔[ 𝐏(𝑦=1|𝒙;𝜽) / 𝐏(𝑦=0|𝒙;𝜽) ] = 𝜽𝑇𝒙
BLR - Model Representation (with Features)
Given input attribute values 𝒙 find probability of 𝑦=1
- 𝑦 - binary output value
- 𝒙 - input attribute values vector (i.e. 𝒙 = [𝑥0, …, 𝑥𝑘]) # 𝑥0=1 always, 𝑥0is the bias
- 𝐹(𝑦=𝑖,𝒙) - set of 𝑙 features extracted from 𝒙 (i.e. [𝑓0(𝑦=𝑖,𝒙), …, 𝑓𝑙(𝑦=𝑖,𝒙)]) # this will act as 𝒙 in the case of model representation without features
- 𝜽 - weight/parameter vector (i.e. 𝜽 = [𝜃0, …, 𝜃𝑙]) # where 𝑙 is the number of features extracted from 𝒙
ℎ𝜽(𝒙) = 1 / (1 + 𝑒-𝜽𝑇𝐹(𝑦=𝑖,𝒙))
ℎ𝜽(𝒙) = 𝐏(𝑦=1|𝒙;𝜽)
ℎ𝜽(𝒙) = 𝑒𝛴0≤𝑗≤𝑙[𝜃𝑗·𝑓𝑗(𝑦=1,𝒙)] / 𝛴0≤𝑖≤1[𝑒𝛴0≤𝑗≤𝑙[𝜃𝑗·𝑓𝑗(𝑦=𝑖,𝒙)]] # ?
Info
Instead of coming up with 𝑙 features manually, consider “automating” it with neural networks (see Feed-Forward Network)
BLR - Cost Function (Using Squared Error) - DO NOT USE
- 𝑐𝑜𝑠𝑡(ℎ𝜃(𝑥), 𝑦) = (1/2) * [ℎ𝜃(𝑥) - 𝑦]2
This cost function is not convex with respect to 𝜃 because ℎ𝜃(𝑥) is a sigmoid function and is not linear like in Linear Regression
BLR - Cost Function (Using Log Loss Function)
Cost Function of Single Sample (𝒙,𝑦) - Pairwise
Cost Function of Single Sample (𝒙,𝑦) - Combined
Indent
Cost Function of Multiple Samples {(𝒙1,𝑦1), …, (𝒙𝑛,𝑦𝑛)}
Indent
BLR - Learning 𝜃s With Gradient Descent
---cognitive-computing---machine-intelligence/ai---subfields/machine-learning-(ml)---pattern-recognition/ml---models/regression-models/analysis-(regressor/predictor/independent/input/feature-function---response/dependent/output/outcome)-variable/parametric-regression-(pr)-models/categorical-regression-models/logistic-(logit)-regression-model/binomial/binary-logistic-regression-(blr)/binomial-logitic-regression-cost-function-graph.png)
to minimize 𝐽(𝜽) we its derivative with respect to each 𝜃𝑗:
Indent
derivative of 𝐽(𝜃) with respect to 𝜃𝑗
- 𝐽(𝜽) = -(1/𝑛) 𝛴1≤𝑖≤𝑛[𝑦(𝑖)·𝑙𝑜𝑔(ℎ𝜽(𝒙(𝑖))) + (1-𝑦(𝑖))·𝑙𝑜𝑔(1-ℎ𝜽(𝒙(𝑖)))]
- (𝛿/𝛿𝜃𝑗)𝐽(𝜽) = (𝛿/𝛿𝜃𝑗) [ -(1/𝑛) 𝛴1≤𝑖≤𝑛[𝑦(𝑖)·𝑙𝑜𝑔(ℎ𝜽(𝒙(𝑖))) + (1-𝑦(𝑖))·𝑙𝑜𝑔(1-ℎ𝜽(𝒙(𝑖)))] ]
- (𝛿/𝛿𝜃𝑗)𝐽(𝜽) = -(1/𝑛) 𝛴1≤𝑖≤𝑛[𝑦(𝑖)·(𝛿/𝛿𝜃𝑗)𝑙𝑜𝑔(ℎ𝜽(𝒙(𝑖))) + (1-𝑦(𝑖))·(𝛿/𝛿𝜃𝑗)𝑙𝑜𝑔(1-ℎ𝜽(𝒙(𝑖)))]
- (𝛿/𝛿𝜃𝑗)𝑙𝑜𝑔(ℎ𝜽(𝒙(𝑖))) = (1/ℎ𝜽(𝒙(𝑖)))·(𝛿/𝛿𝜃𝑗)ℎ𝜽(𝒙(𝑖))
- (𝛿/𝛿𝜃𝑗)𝑙𝑜𝑔(1-ℎ𝜽(𝒙(𝑖))) = (1/(1-ℎ𝜽(𝒙(𝑖)))·(𝛿/𝛿𝜃𝑗)ℎ𝜽(𝒙(𝑖))
- (𝛿/𝛿𝜃𝑗)𝐽(𝜽) = -(1/𝑛) 𝛴1≤𝑖≤𝑛[(𝑦(𝑖)/ℎ𝜽(𝒙(𝑖)))·(𝛿/𝛿𝜃𝑗)ℎ𝜽(𝒙(𝑖)) - (1-𝑦(𝑖))/(1-ℎ𝜽(𝒙(𝑖)))·(𝛿/𝛿𝜃𝑗)ℎ𝜽(𝒙(𝑖))]
- (𝛿/𝛿𝜃𝑗)ℎ𝜽(𝒙(𝑖)) = (𝛿/𝛿𝜃𝑗)𝑔(𝜽·𝒙)
- ℎ(𝒙) = 𝑔(𝜽·𝒙)
- (𝛿/𝛿𝜃𝑗)ℎ𝜽(𝒙(𝑖)) = (𝛿/𝛿𝜃𝑗)𝑔(𝜽·𝒙)
- 𝑔(𝑧) = 1/[1 + 𝑒-𝑧]
- (𝛿/𝛿𝜃𝑗)ℎ𝜽(𝒙(𝑖)) = 𝑔(𝑧)’ = (𝛿/𝛿𝜃𝑗)[1 + 𝑒-𝑧]⁻¹
- (𝛿/𝛿𝜃𝑗)ℎ𝜽(𝒙(𝑖)) = 𝑔(𝑧)’ = -[1 + 𝑒-𝑧]⁻² 𝑒-𝑧
- (𝛿/𝛿𝜃𝑗)ℎ𝜽(𝒙(𝑖)) = 𝑔(𝑧)’ = 𝑒-𝑧/[1 + 𝑒-𝑧]²
- (𝛿/𝛿𝜃𝑗)ℎ𝜽(𝒙(𝑖)) = 𝑔(𝑧)’ = 1/[1 + 𝑒-𝑧] * 𝑒-𝑧/[1 + 𝑒-𝑧]
- (𝛿/𝛿𝜃𝑗)ℎ𝜽(𝒙(𝑖)) = 𝑔(𝑧)’ = 𝑔(𝑧) * [1 - 𝑔(𝑧)]
- (𝛿/𝛿𝜃𝑗)ℎ𝜽(𝒙(𝑖)) = ℎ𝜽(𝒙(𝑖)) * [1 - ℎ𝜽(𝒙(𝑖))] * (𝛿/𝛿𝜃𝑗)𝜽·𝒙(𝑖)
- (𝛿/𝛿𝜃𝑗)𝐽(𝜽) = -(1/𝑛) 𝛴1≤𝑖≤𝑛[(𝑦(𝑖)/ℎ𝜃(𝒙(𝑖))) * ℎ𝜽(𝒙(𝑖)) * [1 - ℎ𝜽(𝒙(𝑖))] * (𝛿/𝛿𝜃𝑗)𝜽·𝒙(𝑖) - ((1-𝑦(𝑖))/(1-ℎ𝜽(𝒙(𝑖))) * ℎ𝜽(𝒙(𝑖)) * [1 - ℎ𝜽(𝒙(𝑖))] * (𝛿/𝛿𝜃𝑗)𝜽·𝒙(𝑖))]
- (𝛿/𝛿𝜃𝑗)𝐽(𝜽) = -(1/𝑛) 𝛴1≤𝑖≤𝑛[[𝑦(𝑖) - 𝑦(𝑖)ℎ𝜽(𝒙(𝑖))] * (𝛿/𝛿𝜃𝑗)𝜽·𝒙(𝑖) - [ℎ𝜃(𝒙(𝑖)) - 𝑦(𝑖)ℎ𝜽(𝒙(𝑖))] * (𝛿/𝛿𝜃𝑗)𝜽·𝒙(𝑖)]
- (𝛿/𝛿𝜃𝑗)𝜽·𝒙(𝑖) = (𝛿/𝛿𝜃𝑗)𝛴0≤𝑙≤𝑘[𝜃𝑙·𝑥𝑙(𝑖)]
- (𝛿/𝛿𝜃𝑗)𝜽·𝒙(𝑖) = 𝑥𝑗(𝑖)
- (𝛿/𝛿𝜃𝑗)𝐽(𝜽) = -(1/𝑛) 𝛴1≤𝑖≤𝑛[[𝑦(𝑖) - 𝑦(𝑖)ℎ𝜽(𝒙(𝑖))] * 𝑥𝑗(𝑖) - [ℎ𝜃(𝒙(𝑖)) - 𝑦(𝑖)ℎ𝜽(𝒙(𝑖))] * 𝑥𝑗(𝑖)]
- (𝛿/𝛿𝜃𝑗)𝐽(𝜽) = -(1/𝑛) 𝛴1≤𝑖≤𝑛[[𝑦(𝑖) - 𝑦(𝑖)ℎ𝜽(𝒙(𝑖))] - [ℎ𝜽(𝒙(𝑖)) - 𝑦(𝑖)ℎ𝜽(𝒙(𝑖))]] * 𝑥𝑗(𝑖)
- (𝛿/𝛿𝜃𝑗)𝐽(𝜽) = -(1/𝑛) 𝛴1≤𝑖≤𝑛[𝑦(𝑖) - 𝑦(𝑖)ℎ𝜽(𝒙(𝑖)) - ℎ𝜽(𝒙(𝑖)) + 𝑦(𝑖)ℎ𝜽(𝒙(𝑖))] * 𝑥𝑗(𝑖)
- (𝛿/𝛿𝜃𝑗)𝐽(𝜽) = -(1/𝑛) 𝛴1≤𝑖≤𝑛[𝑦(𝑖) - ℎ𝜽(𝒙(𝑖))] * 𝑥𝑗(𝑖)
- (𝛿/𝛿𝜃𝑗)𝐽(𝜽) = -(1/𝑛) 𝛴1≤𝑖≤𝑛[ℎ𝜽(𝒙(𝑖)) - 𝑦(𝑖)] * 𝑥𝑗(𝑖) * (-1)
- (𝛿/𝛿𝜃𝑗)𝐽(𝜽) = (1/𝑛) 𝛴1≤𝑖≤𝑛[ℎ𝜽(𝒙(𝑖)) - 𝑦(𝑖)] * 𝑥𝑗(𝑖)
similar to gradient descent for linear regression
BLR - Cost Function With Regularization
cost function after regularization of 𝑚 regression coefficients:
- 𝐽(𝜽) = -(1/𝑛)·[ 𝛴1≤𝑖≤𝑛[𝑦(𝑖)·𝑙𝑜𝑔(ℎ𝜽(𝒙(𝑖))) + (1-𝑦(𝑖))·𝑙𝑜𝑔(1-ℎ𝜽(𝒙(𝑖)))] + (𝜆/2)·[𝛴1≤𝑗≤𝑘(𝜃𝑗)2] ]
therefore, the original gradient descent update:
- 𝜃𝑗 ← 𝜃𝑗 - (𝛼/𝑛) * [ (𝛴1≤𝑖≤𝑛[ℎ𝜽(𝒙(𝑖)) - 𝑦(𝑖)]𝑥𝑗(𝑖)) ]
now becomes:
- 𝜃𝑗 ← 𝜃𝑗 - (𝛼/𝑛) * [ (𝛴1≤𝑖≤𝑛[ℎ𝜽(𝒙(𝑖)) - 𝑦(𝑖)]𝑥𝑗(𝑖)) + (𝜆𝜃𝑗) ]
BLR - Hypothesis
given 𝒙 and the optimized values of 𝜽, the assigned output label is defined as (i.e. hypothesis):
- ℎ𝜽(𝒙) = 1, if 𝜽𝑇𝒙 ≥ 0
- ℎ𝜽(𝒙) = 0, otherwise
---cognitive-computing---machine-intelligence/ai---subfields/machine-learning-(ml)---pattern-recognition/ml---models/regression-models/analysis-(regressor/predictor/independent/input/feature-function---response/dependent/output/outcome)-variable/parametric-regression-(pr)-models/categorical-regression-models/logistic-(logit)-regression-model/binomial/binary-logistic-regression-(blr)/binomial-logistic-regression-cost-function.png)
---cognitive-computing---machine-intelligence/ai---subfields/machine-learning-(ml)---pattern-recognition/ml---models/regression-models/analysis-(regressor/predictor/independent/input/feature-function---response/dependent/output/outcome)-variable/parametric-regression-(pr)-models/categorical-regression-models/logistic-(logit)-regression-model/binomial/binary-logistic-regression-(blr)/binomial-logistic-regression-cost-function-combined.png)
---cognitive-computing---machine-intelligence/ai---subfields/machine-learning-(ml)---pattern-recognition/ml---models/regression-models/analysis-(regressor/predictor/independent/input/feature-function---response/dependent/output/outcome)-variable/parametric-regression-(pr)-models/categorical-regression-models/logistic-(logit)-regression-model/binomial/binary-logistic-regression-(blr)/binomial-logistic-regression-cost-function-of-multiple-samples.png)
---cognitive-computing---machine-intelligence/ai---subfields/machine-learning-(ml)---pattern-recognition/ml---models/regression-models/analysis-(regressor/predictor/independent/input/feature-function---response/dependent/output/outcome)-variable/parametric-regression-(pr)-models/categorical-regression-models/logistic-(logit)-regression-model/binomial/binary-logistic-regression-(blr)/binomial-logistic-regression-learning-thetas-gradient-descent.png)