Comparing different Linear Regression Models

Linear Model Setup

𝐄[𝑌|𝑋₁=𝑥₁, …, 𝑋_𝑘=𝑥_𝑘] = ℎ(𝑥₁, …, 𝑥_𝑘) = 𝑦̂ = 𝜃₀+ 𝜃₁𝑓₁(𝑥₁, …, 𝑥_𝑘) + … + 𝜃_𝑘𝑓_𝑘(𝑥₁, …, 𝑥_𝑘)

syntax-semantics in Ordinary Least Squares Regression (OLS)

𝑌 = 𝑋𝜽 + 𝜖

𝑌 - target
𝑋 - feature matrix
𝜽 - vector of regression coefficients
𝜖 - error terms with expected value zero (i.e. 𝐄[𝜖] = 0)

Problem Statement - solve for 𝜽

Regression Solution Comparisons

Ordinary Least Squares (OLS) Regression

A closed-form solution for the OLS estimator 𝜽ˆ_𝑂𝐿𝑆, we minimize the sum of squared residuals (𝑆𝑆𝑅)

𝑆𝑆𝑅 = ||𝑒||₂²

𝑆𝑆𝑅 = ||𝑌 - 𝑋𝜽ˆ||₂²

𝑆𝑆𝑅 = (𝑌 - 𝑋𝜽ˆ)^𝑇(𝑌 - 𝑋𝜽ˆ)

𝑆𝑆𝑅 = 𝑌^𝑇𝑌 - 𝜽ˆ^𝑇𝑋^𝑇𝑌 - 𝑌^𝑇𝑋𝜽ˆ + 𝜽ˆ^𝑇𝑋^𝑇𝑋𝜽ˆ

𝑆𝑆𝑅 = 𝑌^𝑇𝑌 - 2𝜽ˆ^𝑇𝑋^𝑇𝑌 + 𝜽ˆ^𝑇𝑋^𝑇𝑋𝜽ˆ

Take the derivative of 𝑆𝑆𝑅 with respect to 𝜽

𝛿𝑆𝑆𝑅/𝛿𝜽 = - 2𝑋^𝑇𝑌 + 2𝑋^𝑇𝑋𝜽ˆ

Set derivative of 𝑆𝑆𝑅 to zero and solve for 𝜽

0 = - 2𝑋^𝑇𝑌 + 2𝑋^𝑇𝑋𝜽ˆ

2𝑋^𝑇𝑋𝜽ˆ = 2𝑋^𝑇𝑌

𝑋^𝑇𝑋𝜽ˆ = 𝑋^𝑇𝑌

(𝑋^𝑇𝑋)^-1𝑋^𝑇𝑋𝜽ˆ = (𝑋^𝑇𝑋)^-1𝑋^𝑇𝑌

𝜽ˆ = (𝑋^𝑇𝑋)^-1𝑋^𝑇𝑌

𝜽ˆ_𝑂𝐿𝑆 = (𝑋^𝑇𝑋)^-1𝑋^𝑇𝑌

Ridge Regression

A closed-form solution for the Ridge Regression estimator 𝜽ˆ_{𝑅𝑖𝑑𝑔𝑒}, we minimize the sum of squared residuals (𝑆𝑆𝑅) with the addition of an additive L2-norm penalty term with tunning parameter 𝜆

𝑆𝑆𝑅 = ||𝑒||₂² + 𝜆||𝜽ˆ||₂²

𝑆𝑆𝑅 = ||𝑌 - 𝑋𝜽ˆ||₂² + 𝜆||𝜽ˆ||₂²

𝑆𝑆𝑅 = (𝑌 - 𝑋𝜽ˆ)^𝑇(𝑌 - 𝑋𝜽ˆ) + 𝜆𝜽ˆ^𝑇𝜽ˆ

𝑆𝑆𝑅 = 𝑌^𝑇𝑌 - 𝜽ˆ^𝑇𝑋^𝑇𝑌 - 𝑌^𝑇𝑋𝜽ˆ + 𝜽ˆ^𝑇𝑋^𝑇𝑋𝜽ˆ + 𝜆𝜽ˆ^𝑇𝜽ˆ

𝑆𝑆𝑅 = 𝑌^𝑇𝑌 - 2𝜽ˆ^𝑇𝑋^𝑇𝑌 + 𝜽ˆ^𝑇𝑋^𝑇𝑋𝜽ˆ + 𝜆𝜽ˆ^𝑇𝜽ˆ

Take the derivative of 𝑆𝑆𝑅 with respect to 𝜽

𝛿𝑆𝑆𝑅/𝛿𝜽 = - 2𝑋^𝑇𝑌 + 2𝑋^𝑇𝑋𝜽ˆ + 2𝜆𝜽ˆ

Set derivative of 𝑆𝑆𝑅 to zero and solve for 𝜽

0 = - 2𝑋^𝑇𝑌 + 2𝑋^𝑇𝑋𝜽ˆ + 2𝜆𝜽ˆ

2𝑋^𝑇𝑋𝜽ˆ + 2𝜆𝜽ˆ = 2𝑋^𝑇𝑌

𝑋^𝑇𝑋𝜽ˆ + 𝜆𝜽ˆ = 𝑋^𝑇𝑌

(𝑋^𝑇𝑋 + 𝜆𝐼)𝜽ˆ = 𝑋^𝑇𝑌

(𝑋^𝑇𝑋 + 𝜆𝐼)^-1(𝑋^𝑇𝑋 + 𝜆𝐼)𝜽ˆ = (𝑋^𝑇𝑋 + 𝜆𝐼)^-1𝑋^𝑇𝑌

𝜽ˆ = (𝑋^𝑇𝑋 + 𝜆𝐼)^-1𝑋^𝑇𝑌

𝜽ˆ_{𝑅𝑖𝑑𝑔𝑒} = (𝑋^𝑇𝑋 + 𝜆𝐼)^-1𝑋^𝑇𝑌

Kernalized OLS (with Ridge Penalty)

Given the refresh of OLS and Ridge regression above, let’s derive the closed-form sampling estimator for ridge-penalized OLS regression with a kernelized feature space.

Let’s specify the regression coefficients 𝜽 as being equal to the dot product of 𝑋^𝑇 and a new set of regression coefficients 𝑟. We also specify a kernel matrix 𝐾_𝑥:

𝜽 = 𝑋^𝑇𝑟

𝐾_𝑥 = 𝑋𝑋^𝑇

Now we parameterize our linear model as a function of 𝐾_𝑥 instead of 𝑋:

𝑌 = 𝑋𝜽 + 𝜖

𝑌 = 𝑋𝑋^𝑇𝑟 + 𝜖

𝑌 = 𝐾_𝑥𝑟 + 𝜖

Now we minimize the sum of squared residuals (𝑆𝑆𝑅) with the addition of an additive L2-norm penalty term with tunning parameter 𝜆, with respect to coefficients 𝑟ˆ

𝑆𝑆𝑅 = ||𝑒||₂² + 𝜆||𝜽ˆ||₂²

𝑆𝑆𝑅 = ||𝑌 - 𝐾_𝑥𝑟ˆ||₂² + 𝜆||𝑋^𝑇𝑟ˆ||₂²

𝑆𝑆𝑅 = (𝑌 - 𝐾_𝑥𝑟ˆ)^𝑇(𝑌 - 𝐾_𝑥𝑟ˆ) + 𝜆𝑟ˆ^𝑇𝑋𝑋^𝑇𝑟ˆ

𝑆𝑆𝑅 = (𝑌 - 𝐾_𝑥𝑟ˆ)^𝑇(𝑌 - 𝐾_𝑥𝑟ˆ) + 𝜆𝑟ˆ^𝑇𝐾_𝑥𝑟ˆ

𝑆𝑆𝑅 = (𝑌^𝑇 - 𝑟ˆ^𝑇𝐾_𝑥^𝑇)(𝑌 - 𝐾_𝑥𝑟ˆ) + 𝜆𝑟ˆ^𝑇𝐾_𝑥𝑟ˆ

𝑆𝑆𝑅 = 𝑌^𝑇𝑌 - 2𝑟ˆ^𝑇𝐾_𝑥𝑌 + 𝑟ˆ^𝑇𝐾_𝑥𝐾_𝑥𝑟ˆ + 𝜆𝑟ˆ^𝑇𝐾_𝑥𝑟ˆ

Take the derivative of 𝑆𝑆𝑅 with respect to 𝑟ˆ

𝛿𝑆𝑆𝑅/𝛿𝜽 = -2𝐾_𝑥𝑌 + 2𝐾_𝑥𝐾_𝑥𝑟ˆ + 𝜆2𝐾_𝑥𝑟ˆ

Set derivative of 𝑆𝑆𝑅 to zero and solve for 𝑟ˆ

0 = -2𝐾_𝑥𝑌 + 2𝐾_𝑥𝐾_𝑥𝑟ˆ + 𝜆2𝐾_𝑥𝑟ˆ

2𝐾_𝑥𝐾_𝑥𝑟ˆ + 𝜆2𝐾_𝑥𝑟ˆ = 2𝐾_𝑥𝑌

𝐾_𝑥𝐾_𝑥𝑟ˆ + 𝜆𝐾_𝑥𝑟ˆ = 𝐾_𝑥𝑌

𝐾_𝑥𝑟ˆ + 𝜆𝑟ˆ = 𝑌

(𝐾_𝑥 + 𝜆)𝑟ˆ = 𝑌

(𝐾_𝑥 + 𝜆)^-1(𝐾_𝑥 + 𝜆)𝑟ˆ = (𝐾_𝑥 + 𝜆)^-1𝑌

𝑟ˆ = (𝐾_𝑥 + 𝜆)^-1𝑌

𝑟ˆ_{𝑅𝑖𝑑𝑔𝑒} = (𝐾_𝑥 + 𝜆)^-1𝑌

Info

By definition 𝐾_𝑥 = 𝑋𝑋^𝑇 is a positive semi-definite matrix. The issue is that 𝐾_𝑥 may have some eigenvalues equal to zero. If that is the case then 𝐾_𝑥 is not invertible. What the ridge penalization is doing is adding a small positive 𝜆 perturbation to the diagonal of 𝐾_𝑥. Hence, the square matrix (𝐾_𝑥 + 𝜆) is guaranteed to be positive definite and by definition is invertible

WHY PERFORM RIDGE REGULARIZATION WITH OUR KERNELIZATION?

If we have access to feature matrix 𝑋, we can recover 𝜽ˆ_{𝑅𝑖𝑑𝑔𝑒}:

𝜽ˆ_{𝑅𝑖𝑑𝑔𝑒} = 𝑋^𝑇𝑟ˆ_{𝑅𝑖𝑑𝑔𝑒}

／var／log marcus chiu

Explorer

Linear Regression (LR) Models - Comparisons

Linear Model Setup

Regression Solution Comparisons

／var／logmarcus chiu

Explorer

Linear Regression (LR) Models - Comparisons

Linear Model Setup

Regression Solution Comparisons

／var／log marcus chiu