Comparing different Linear Regression Models
Linear Model Setup
๐[๐|๐1=๐ฅ1, โฆ, ๐๐=๐ฅ๐] =ย โ(๐ฅ1, โฆ,ย ๐ฅ๐) =ย ๐ฆฬย =ย ๐0+ย ๐1๐1(๐ฅ1, โฆ,ย ๐ฅ๐) + โฆ +ย ๐๐๐๐(๐ฅ1, โฆ,ย ๐ฅ๐)
- syntax-semantics in Ordinary Least Squares Regression (OLS)
๐ = ๐๐ฝ + ๐
- ๐ - target
- ๐ - feature matrix
- ๐ฝ - vector of regression coefficients
- ๐ - error terms with expected value zero (i.e. ๐[๐] = 0)
Problem Statementย - solve for ๐ฝ
Regression Solution Comparisons
Ordinary Least Squares (OLS) Regression
A closed-form solution for the OLS estimator ๐ฝห๐๐ฟ๐, we minimize the sum of squared residuals (๐๐๐ )
- ๐๐๐ = ||๐||22
- ๐๐๐ = ||๐ - ๐๐ฝห||22
- ๐๐๐ = (๐ - ๐๐ฝห)๐(๐ - ๐๐ฝห)
- ๐๐๐ = ๐๐๐ - ๐ฝห๐๐๐๐ - ๐๐๐๐ฝห + ๐ฝห๐๐๐๐๐ฝห
- ๐๐๐ = ๐๐๐ - 2๐ฝห๐๐๐๐ + ๐ฝห๐๐๐๐๐ฝห
Take the derivative of ๐๐๐ with respect to ๐ฝ
- ๐ฟ๐๐๐ /๐ฟ๐ฝ = - 2๐๐๐ + 2๐๐๐๐ฝห
Set derivative of ๐๐๐ to zero and solve for ๐ฝ
- 0 = - 2๐๐๐ + 2๐๐๐๐ฝห
- 2๐๐๐๐ฝห = 2๐๐๐
- ๐๐๐๐ฝห = ๐๐๐
- (๐๐๐)-1๐๐๐๐ฝห = (๐๐๐)-1๐๐๐
- ๐ฝห = (๐๐๐)-1๐๐๐
- ๐ฝห๐๐ฟ๐ = (๐๐๐)-1๐๐๐
Ridge Regression
A closed-form solution for the Ridge Regression estimator ๐ฝห๐ ๐๐๐๐, we minimize the sum of squared residuals (๐๐๐ ) with the addition of an additive L2-norm penalty term with tunning parameter ๐
- ๐๐๐ = ||๐||22ย + ๐||๐ฝห||22
- ๐๐๐ = ||๐ - ๐๐ฝห||22ย + ๐||๐ฝห||22
- ๐๐๐ = (๐ - ๐๐ฝห)๐(๐ - ๐๐ฝห) + ๐๐ฝห๐๐ฝห
- ๐๐๐ = ๐๐๐ - ๐ฝห๐๐๐๐ - ๐๐๐๐ฝห + ๐ฝห๐๐๐๐๐ฝห + ๐๐ฝห๐๐ฝห
- ๐๐๐ = ๐๐๐ - 2๐ฝห๐๐๐๐ + ๐ฝห๐๐๐๐๐ฝห + ๐๐ฝห๐๐ฝห
Take the derivative of ๐๐๐ with respect to ๐ฝ
- ๐ฟ๐๐๐ /๐ฟ๐ฝ = - 2๐๐๐ + 2๐๐๐๐ฝห + 2๐๐ฝห
Set derivative of ๐๐๐ to zero and solve for ๐ฝ
- 0 = - 2๐๐๐ + 2๐๐๐๐ฝห + 2๐๐ฝห
- 2๐๐๐๐ฝห + 2๐๐ฝห = 2๐๐๐
- ๐๐๐๐ฝห + ๐๐ฝห = ๐๐๐
- (๐๐๐ + ๐๐ผ)๐ฝห = ๐๐๐
- (๐๐๐ + ๐๐ผ)-1(๐๐๐ + ๐๐ผ)๐ฝห = (๐๐๐ + ๐๐ผ)-1๐๐๐
- ๐ฝห = (๐๐๐ + ๐๐ผ)-1๐๐๐
- ๐ฝห๐ ๐๐๐๐ย = (๐๐๐ + ๐๐ผ)-1๐๐๐
Kernalized OLS (with Ridge Penalty)
Given the refresh of OLS and Ridge regression above, letโs derive the closed-form sampling estimator for ridge-penalized OLS regression with a kernelized feature space.
Letโs specify the regression coefficients ๐ฝ as being equal to the dot product of ๐๐ and a new set of regression coefficients ๐. We also specify a kernel matrix ๐พ๐ฅ:
- ๐ฝ = ๐๐๐
- ๐พ๐ฅ = ๐๐๐
Now we parameterize our linear model as a function of ๐พ๐ฅ instead of ๐:
- ๐ = ๐๐ฝ + ๐
- ๐ = ๐๐๐๐ + ๐
- ๐ = ๐พ๐ฅ๐ + ๐
Now we minimize the sum of squared residuals (๐๐๐ ) with the addition of an additive L2-norm penalty term with tunning parameter ๐, with respect to coefficients ๐ห
- ๐๐๐ = ||๐||22ย + ๐||๐ฝห||22
- ๐๐๐ = ||๐ - ๐พ๐ฅ๐ห||22 + ๐||๐๐๐ห||22
- ๐๐๐ = (๐ - ๐พ๐ฅ๐ห)๐(๐ - ๐พ๐ฅ๐ห) + ๐๐ห๐๐๐๐๐ห
- ๐๐๐ = (๐ - ๐พ๐ฅ๐ห)๐(๐ - ๐พ๐ฅ๐ห) + ๐๐ห๐๐พ๐ฅ๐ห
- ๐๐๐ = (๐๐ - ๐ห๐๐พ๐ฅ๐)(๐ - ๐พ๐ฅ๐ห) + ๐๐ห๐๐พ๐ฅ๐ห
- ๐๐๐ = ๐๐๐ - 2๐ห๐๐พ๐ฅ๐ + ๐ห๐๐พ๐ฅ๐พ๐ฅ๐ห + ๐๐ห๐๐พ๐ฅ๐ห
Take the derivative of ๐๐๐ with respect to ๐ห
- ๐ฟ๐๐๐ /๐ฟ๐ฝ = -2๐พ๐ฅ๐ + 2๐พ๐ฅ๐พ๐ฅ๐ห + ๐2๐พ๐ฅ๐ห
Set derivative of ๐๐๐ to zero and solve for ๐ห
0 = -2๐พ๐ฅ๐ + 2๐พ๐ฅ๐พ๐ฅ๐ห + ๐2๐พ๐ฅ๐ห
2๐พ๐ฅ๐พ๐ฅ๐ห + ๐2๐พ๐ฅ๐ห = 2๐พ๐ฅ๐
๐พ๐ฅ๐พ๐ฅ๐ห + ๐๐พ๐ฅ๐ห = ๐พ๐ฅ๐
๐พ๐ฅ๐ห + ๐๐ห = ๐
(๐พ๐ฅย + ๐)๐ห = ๐
(๐พ๐ฅ + ๐)-1(๐พ๐ฅย + ๐)๐ห = (๐พ๐ฅ + ๐)-1๐
๐ห = (๐พ๐ฅ + ๐)-1๐
๐ห๐ ๐๐๐๐ = (๐พ๐ฅ + ๐)-1๐
Info
By definition ๐พ๐ฅ = ๐๐๐ย is a positive semi-definite matrix. The issue is that ๐พ๐ฅ may have some eigenvalues equal to zero. If that is the case then ๐พ๐ฅ is not invertible. What the ridge penalization is doing is adding a small positive ๐ perturbation to the diagonal of ๐พ๐ฅ. Hence, the square matrixย (๐พ๐ฅ + ๐) is guaranteed to be positive definite and by definition is invertible
WHY PERFORM RIDGE REGULARIZATION WITH OUR KERNELIZATION?
If we have access to feature matrix ๐, we can recover ๐ฝห๐ ๐๐๐๐:
- ๐ฝห๐ ๐๐๐๐ = ๐๐๐ห๐ ๐๐๐๐