Vanilla/Feed-Forward Neural Networks (FNN/FFNN/FFN) - Multi-Layer/Multilayer Perceptrons (MLP)
- is the simplest type of artificial neural network architecture wherein connections between the perceptrons do not form a cycle
- those with cycles/feedbacks are recurrent neural networks
FNN - Prerequisite
FNN - Model Representation
Click here to expand...
given 𝑛 sample/training data:
- (𝑦(1), 𝒙(1)) = (𝑦(1), 𝑥1(1), 𝑥2(1), …, 𝑥𝑘(1)) # sample 1
- (𝑦(2), 𝒙(2)) = (𝑦(2), 𝑥1(2), 𝑥2(2), …, 𝑥𝑘(2)) # sample 2
- …
- (𝑦(𝑛), 𝒙(𝑛)) = (𝑦(𝑛), 𝑥1(𝑛), 𝑥2(𝑛), …, 𝑥𝑘(𝑛)) # sample 𝑛
we define:
- 𝐿 - total number of layers in the network
- 𝑠𝑙 - number of perceptrons (not counting the bias unit) in layer 𝑙
- 𝑠𝐿- number of output units
Binomial Classification (2 classes)
- 𝑦 = 0 or 1
- 𝑠𝐿= 1 (e.g. 1 output unit)
- ℎ𝜃(𝒙) outputs a scalar value between 0 and 1 inclusive (e.g. 0.99 or 0.45 is a possible output)
Multinomial Classification (𝑐 classes)
- 𝑦∊ ℝ𝑐 (e.g. when 𝑐=3 then 𝑦𝑖∊ {[1,0,0]𝑇, [0,1,0]𝑇, [0,0,1]𝑇})
- 𝑠𝐿= 𝑐 (e.g. 𝑐 output units)
- ℎ𝜃(𝒙) outputs a 𝑐-dimensional vector where each entry is a scalar value between 0 and 1 inclusive (e.g. when 𝑐=3 then [0.99,0.02,0.45]𝑇is a possible output)
FNN - Cost Function
Click here to expand...
Neural Network’s Cost Function (binomial classification)
- 𝐽(𝜃) = -(1/𝑚)·[𝛴1≤𝑖≤𝑛[(𝑦(𝑖))·𝑙𝑜𝑔(ℎ𝜃(𝒙(𝑖))) + (1-𝑦(𝑖))·𝑙𝑜𝑔(1-ℎ𝜃(𝒙(𝑖)))]] # same as binomial logistic regression
Neural Network’s Cost Function (multinomial classification)
- 𝐽(𝜃) = -(1/𝑚)·[𝛴1≤𝑖≤𝑛[𝛴1≤𝑗≤𝑐[(𝑦(𝑖)[𝑗])·𝑙𝑜𝑔(ℎ𝜃(𝒙(𝑖))[𝑗]) + (1-𝑦(𝑖)[𝑗])·𝑙𝑜𝑔(1-ℎ𝜃(𝒙(𝑖))[𝑗])]]] # same as multinomial logistic regression
where:
- 𝑦[𝑗] - is the 𝑗𝑡ℎ entry of the vector
- ℎ𝜃(𝒙)[𝑗] - is the 𝑗𝑡ℎ entry of the vector
FNN - Cost Function With Regularization of 𝜃s
Click here to expand...
Neural Network’s Cost Function with regularization of 𝜃s (binomial classification)
- 𝐽(𝜃) = -(1/𝑚)·[𝛴1≤𝑖≤𝑛[𝛴1≤𝑗≤𝑐[(𝑦(𝑖))·𝑙𝑜𝑔(ℎ𝜃(𝒙(𝑖))) + (1-𝑦(𝑖))·𝑙𝑜𝑔(1-ℎ𝜃(𝒙(𝑖)))]]] + (𝜆/2𝑛)·[𝛴1≤𝑙≤𝐿𝛴1≤𝑖≤𝑠𝑙𝛴1≤𝑗≤𝑠𝑙+1(𝜃𝑙[𝑖,𝑗])2] # similar to binomial logistic regressioni
Neural Network’s Cost Function with regularization of 𝜃s (multinomial classification)
- 𝐽(𝜃) = -(1/𝑚)·[𝛴1≤𝑖≤𝑛[𝛴1≤𝑗≤𝑐[(𝑦(𝑖)[𝑗])·𝑙𝑜𝑔(ℎ𝜃(𝒙(𝑖))[𝑗]) + (1-𝑦(𝑖)[𝑗])·𝑙𝑜𝑔(1-ℎ𝜃(𝒙(𝑖))[𝑗])]]] + (𝜆/2𝑛)·[𝛴1≤𝑙≤𝐿𝛴1≤𝑖≤𝑠𝑙𝛴1≤𝑗≤𝑠𝑙+1(𝜃𝑙[𝑖,𝑗])2] # similar to multinomial logistic regression
where:
- 𝜃𝑙[𝑖,𝑗] - the coefficient 𝜃 connecting (perceptron 𝑖 at layer 𝑙) to (perceptron 𝑗 at layer 𝑙+1)
FNN - Learning 𝜃s With Gradient Descent & Backpropagation
Click here to expand...
need to compute (𝛿/𝛿𝜃𝑙[𝑖,𝑗]) 𝐽(𝜃) wrt to every 𝜃𝑙[𝑖,𝑗]
Given 1 Training Data (𝑦, 𝑥1, …, 𝑥𝑘)
forward propagation:
- 𝑎1= [𝑥1, …, 𝑥𝑘]𝑇
- 𝑧2 = 𝜃1𝑎1
- 𝑎2= 𝑔(𝑧2)
- 𝑧3 = 𝜃2𝑎2
- 𝑎3= 𝑔(𝑧3)
- …
- 𝑧𝐿 = 𝜃𝐿-1𝑎𝐿-1
- 𝑎𝐿= 𝑔(𝑧𝐿)
- ℎ𝜃(𝑥1, …, 𝑥𝑘) = 𝑎𝐿
𝛿𝑙[𝑗]= error of node 𝑗 at layer 𝑙
for each output unit 𝑗 at the last layer 𝐿:
- 𝛿𝐿[𝑗] = ℎ𝜃(𝑥1, …, 𝑥𝑘)[𝑗] - 𝑦[𝑗]
- 𝛿𝐿[𝑗] = 𝑎𝐿[𝑗] - 𝑦[𝑗]
in vector format
- 𝛿𝐿= ℎ𝜃(𝑥1, …, 𝑥𝑘) - 𝑦
- 𝛿𝐿= 𝑎𝐿 - 𝑦
for previous layers (𝐿-1 to 1):
- 𝛿𝐿-1= (𝜃𝐿-1)𝑇𝛿𝐿 · 𝑔’(𝑧𝐿-1) = (𝜃𝐿-1)𝑇𝛿𝐿 · 𝑎𝐿-1 · (1 - 𝑎𝐿-1)
- 𝛿𝐿-2= (𝜃𝐿-2)𝑇𝛿𝐿-1 · 𝑔’(𝑧𝐿-2) = (𝜃𝐿-2)𝑇𝛿𝐿-1 · 𝑎𝐿-2 · (1 - 𝑎𝐿-2)
- …
- 𝛿2= (𝜃2)𝑇𝛿3 · 𝑔’(𝑧2) = (𝜃2)𝑇𝛿3 · 𝑎2 · (1 - 𝑎2)
- no need for 𝛿1
Given Training Set {(𝑦(1), 𝑥1(1), 𝑥2(1), …, 𝑥𝑘(1)), …, (𝑦(𝑛), 𝑥1(𝑛), 𝑥2(𝑛), …, 𝑥𝑘(𝑛))}
- set 𝛥𝑙[𝑖,𝑗] = 0 for all 𝑙𝑖𝑗
- for 𝑖 = 1 to 𝑛
- set 𝑎1= [𝑥1(𝑖), …, 𝑥𝑘(𝑖)]𝑇
- perform forward propagation to compute 𝑎𝑙 for 𝑙 = 2 to 𝐿
- using 𝑦𝑖, compute 𝛿𝐿= 𝑎𝐿 - 𝑦𝑖
- compute 𝛿𝐿-1, …, 𝛿2
- 𝛥𝑙[𝑖,𝑗] ← 𝛥𝑙[𝑖,𝑗]+ 𝑎𝑙[𝑗]·𝛿𝑙+1[𝑖] # vectorized form 𝛥𝑙 ← 𝛥𝑙 + (𝛿𝑙+1)·(𝑎𝑙)𝑇
- (𝛿/𝛿𝜃𝑙[𝑖,𝑗])𝐽(𝜃) = (1/𝑚)·𝛥𝑙[𝑖,𝑗] + 𝜆·𝜃𝑙[𝑖,𝑗] # 𝑗 ≠ 0
- (𝛿/𝛿𝜃𝑙[𝑖,𝑗])𝐽(𝜃) = (1/𝑚)·𝛥𝑙[𝑖,𝑗] # 𝑗 = 0