Introduction

Next

Let’s use gradient descent to solve Linear Regression by using Linear Least Squares (LLS) in its cost function 𝐽(𝜃)

given sample/training data:

  • (𝑦(1), 𝑥1(1), …, 𝑥𝑘(1)# sample 1
  • (𝑦(2), 𝑥1(2), …, 𝑥𝑘(2)) # sample 2
  • (𝑦(𝑛), 𝑥1(𝑛), …, 𝑥𝑘(𝑛)) # sample 𝑛

given inputs {𝑥1, …, 𝑥𝑘} and trained model parameters {𝜃0, …, 𝜃𝑘}, the predicted value 𝑦̂ is defined as (i.e. hypothesis):

  • 𝐄{𝑌|𝑋1=𝑥1, …, 𝑋𝑘=𝑥𝑘} = ℎ(𝑥1, …, 𝑥𝑘) = 𝑦̂ = 𝜃0+ 𝜃1𝑓1(𝑥1, …, 𝑥𝑘) + … + 𝜃𝑘𝑓𝑘(𝑥1, …, 𝑥𝑘)

Cost Function

the cost function of a single sample 𝑖:

  • (1/2)[ℎ(𝑥1(𝑖), …, 𝑥𝑘(𝑖)) - 𝑦(𝑖)]2

the cost function of all samples:

  • 𝐽(𝜃0, …, 𝜃𝑘) = (1/2𝑛)𝛴1≤𝑖≤𝑛[ℎ(𝑥1(𝑖), …, 𝑥𝑘(𝑖)) - 𝑦(𝑖)]2

Goal

minimize 𝐽(𝜃0, …, 𝜃𝑘) with respect to the regression coefficients {𝜃0, …, 𝜃𝑘} using gradient descent

Gradient Descent

repeat until convergence:

  • 𝜃0← 𝜃0- 𝛼·(1/𝑚)𝛴1≤𝑖≤𝑛[[ℎ(𝑥1(𝑖), …, 𝑥𝑘(𝑖)) - 𝑦(𝑖)] · 𝑓1(𝑥1(𝑖), …, 𝑥𝑘(𝑖))]
  • 𝜃𝑘← 𝜃𝑘- 𝛼·(1/𝑚)𝛴1≤𝑖≤𝑛[[ℎ(𝑥1(𝑖), …, 𝑥𝑘(𝑖)) - 𝑦(𝑖)] · 𝑓𝑘(𝑥1(𝑖), …, 𝑥𝑘(𝑖))]

derivation:

Gradient Descent Variants

see Ascent Algorithms