Introduction
Next
Let’s use gradient descent to solve Linear Regression by using Linear Least Squares (LLS) in its cost function 𝐽(𝜃)
given sample/training data:
- (𝑦(1), 𝑥1(1), …, 𝑥𝑘(1)) # sample 1
- (𝑦(2), 𝑥1(2), …, 𝑥𝑘(2)) # sample 2
- …
- (𝑦(𝑛), 𝑥1(𝑛), …, 𝑥𝑘(𝑛)) # sample 𝑛
given inputs {𝑥1, …, 𝑥𝑘} and trained model parameters {𝜃0, …, 𝜃𝑘}, the predicted value 𝑦̂ is defined as (i.e. hypothesis):
- 𝐄{𝑌|𝑋1=𝑥1, …, 𝑋𝑘=𝑥𝑘} = ℎ(𝑥1, …, 𝑥𝑘) = 𝑦̂ = 𝜃0+ 𝜃1𝑓1(𝑥1, …, 𝑥𝑘) + … + 𝜃𝑘𝑓𝑘(𝑥1, …, 𝑥𝑘)
Cost Function
the cost function of a single sample 𝑖:
- (1/2)[ℎ(𝑥1(𝑖), …, 𝑥𝑘(𝑖)) - 𝑦(𝑖)]2
the cost function of all samples:
- 𝐽(𝜃0, …, 𝜃𝑘) = (1/2𝑛)𝛴1≤𝑖≤𝑛[ℎ(𝑥1(𝑖), …, 𝑥𝑘(𝑖)) - 𝑦(𝑖)]2
Goal
minimize 𝐽(𝜃0, …, 𝜃𝑘) with respect to the regression coefficients {𝜃0, …, 𝜃𝑘} using gradient descent
Gradient Descent
repeat until convergence:
- 𝜃0← 𝜃0- 𝛼·(1/𝑚)𝛴1≤𝑖≤𝑛[[ℎ(𝑥1(𝑖), …, 𝑥𝑘(𝑖)) - 𝑦(𝑖)] · 𝑓1(𝑥1(𝑖), …, 𝑥𝑘(𝑖))]
- …
- 𝜃𝑘← 𝜃𝑘- 𝛼·(1/𝑚)𝛴1≤𝑖≤𝑛[[ℎ(𝑥1(𝑖), …, 𝑥𝑘(𝑖)) - 𝑦(𝑖)] · 𝑓𝑘(𝑥1(𝑖), …, 𝑥𝑘(𝑖))]
derivation:
Click here to expand...
generic gradient descent algorithm:
repeat until convergence:
- 𝜃0← 𝜃0- 𝛼·(𝛿/𝛿𝜃0)𝐽(𝜃0, …, 𝜃𝑘)
- …
- 𝜃𝑘← 𝜃𝑘- 𝛼·(𝛿/𝛿𝜃𝑘)𝐽(𝜃0, …, 𝜃𝑘)
taking the partial derivative of 𝐽(𝜃0, …, 𝜃𝑘) wrt to 𝜃𝑖:
- (𝛿/𝛿𝜃𝑖)𝐽(𝜃0, …, 𝜃𝑘) = (𝛿/𝛿𝜃𝑖)(1/2𝑚)𝛴1≤𝑖≤𝑛[ℎ(𝑥1(𝑖), …, 𝑥𝑘(𝑖)) - 𝑦(𝑖)]2
- (𝛿/𝛿𝜃𝑖)𝐽(𝜃0, …, 𝜃𝑘) = (1/𝑚)𝛴1≤𝑖≤𝑛[[ℎ(𝑥1(𝑖), …, 𝑥𝑘(𝑖)) - 𝑦(𝑖)] · (𝛿/𝛿𝜃𝑖)ℎ(𝑥1(𝑖), …, 𝑥𝑘(𝑖))]
modified gradient descent algorithm:
repeat until convergence:
- 𝜃0← 𝜃0- 𝛼·(1/𝑚)𝛴1≤𝑖≤𝑛[[ℎ(𝑥1(𝑖), …, 𝑥𝑘(𝑖)) - 𝑦(𝑖)] · (𝛿/𝛿𝜃0)ℎ(𝑥1(𝑖), …, 𝑥𝑘(𝑖))]
- …
- 𝜃𝑘← 𝜃𝑘- 𝛼·(1/𝑚)𝛴1≤𝑖≤𝑛[[ℎ(𝑥1(𝑖), …, 𝑥𝑘(𝑖)) - 𝑦(𝑖)] · (𝛿/𝛿𝜃𝑘)ℎ(𝑥1(𝑖), …, 𝑥𝑘(𝑖))]
partial derivative of ℎ(𝑥1, …, 𝑥𝑘) wrt to 𝜃𝑗:
- (𝛿/𝛿𝜃𝑗)ℎ(𝑥1, …, 𝑥𝑘) = (𝛿/𝛿𝜃𝑗) [𝜃0+ 𝜃1𝑓1(𝑥1, …, 𝑥𝑘) + … + 𝜃𝑘𝑓𝑘(𝑥1, …, 𝑥𝑘)]
- (𝛿/𝛿𝜃𝑗)ℎ(𝑥1, …, 𝑥𝑘) = 𝑓𝑗(𝑥1, …, 𝑥𝑘)
FINAL modified gradient descent algorithm:
repeat until convergence:
- 𝜃0← 𝜃0- 𝛼·(1/𝑚)𝛴1≤𝑖≤𝑛[[ℎ(𝑥1(𝑖), …, 𝑥𝑘(𝑖)) - 𝑦(𝑖)] · 𝑓1(𝑥1(𝑖), …, 𝑥𝑘(𝑖))]
- …
- 𝜃𝑘← 𝜃𝑘- 𝛼·(1/𝑚)𝛴1≤𝑖≤𝑛[[ℎ(𝑥1(𝑖), …, 𝑥𝑘(𝑖)) - 𝑦(𝑖)] · 𝑓𝑘(𝑥1(𝑖), …, 𝑥𝑘(𝑖))]