Ridge Regression

is a type of Linear Regression Model for estimating the coefficients of multiple-regression models in scenarios where independent variables are highly correlated (i.e. colinear problem)
utilizes Adjusted R Squared?

Ridge Regression

The solution or estimator for 𝛽ˆ using ridge regression is defined as:

$\overset{𝛽}{^}_{R} = (X^{T} X + 𝛬)^{- 1} X^{T} y$

where:

$𝛬 = d ia g (𝜆_{j}) is a diagonal matrix of positive numbers, which are to be chosen$

If 𝑋^T𝑋 = 𝐷, then:

$V a r (\overset{𝛽}{^}_{R j}) = 𝜎^{2} \frac{d _{j}}{( d _{j} + 𝜆 _{j} ) ^{2}}$

The downside to reducing the variance is that the estimator is biased:

$E [\overset{𝛽}{^}_{R j}] = 𝛽_{j} \frac{d _{j}}{d _{j} + 𝜆 _{j}}$

The mean square error (MSE) is given by:

$𝛽_{j}^{2} \frac{𝜆 _{j}^{2}}{( d _{j} + 𝜆 _{j} ) ^{2}} + 𝜎^{2} \frac{d _{j}^{2}}{( d _{j} + 𝜆 _{j} ) ^{2}}$

The aim would be to set 𝜆_𝑗 so this is minimized.

Not able to minimize exactly since 𝛽_𝑗 is unknown.

Note that:

$\overset{𝛽}{^}_{R j} = \overset{𝛽}{^}_{j} \frac{d _{j}}{d _{j} + 𝜆 _{j}}$

and so 𝛽_𝑅𝑗ˆ is being shrunk towards the origin. Known as a shrinkage estimator.

Another Way to Derive

The ridge estimator can be derived using “regularization”, or the inclusion of a penalty term in the objective function.

Suppose instead of minimizing:

𝐼(𝛽) = (𝑦 - 𝑋𝛽)^T(𝑦 - 𝑋𝛽)

We minimize:

𝐼(𝛽) = (𝑦 - 𝑋𝛽)^T(𝑦 - 𝑋𝛽) + 𝜆||𝛽||²

Now we get:

$\frac{\partial I _{R}}{\partial 𝛽} = 2 X^{T} X 𝛽 - 2 X^{T} y + 2𝜆𝛽$
$0 = 2 X^{T} X 𝛽 - 2 X^{T} y + 2𝜆𝛽$
$2 X^{T} y = 2 X^{T} X 𝛽 + 2𝜆𝛽$
$X^{T} y = X^{T} X 𝛽 + 𝜆𝛽$
$X^{T} y = (X^{T} X + 𝜆𝛪) 𝛽$
$(X^{T} X + 𝜆𝛪)^{- 1} X^{T} y = 𝛽$
$\overset{𝛽}{^}_{R} = (X^{T} X + 𝜆𝛪)^{- 1} X^{T} y$

It is easy to see how to make this more general with different 𝜆s.

Regularization methods for estimating 𝛽 are now standard:

𝐼(𝛽) = (𝑦 - 𝑋𝛽)^T(𝑦 - 𝑋𝛽) + 𝑃(𝛽)

for some penalty term 𝑃.

The penalty terms prevent the estimator 𝛽 from becoming large and indeed some can set some components of the estimator to be 0.

Ridge Regression - Example

Click here to expand...

Take:

𝑛 = 100

𝑝 = 5

𝑥_𝑖𝑗 are independent standard uniform for 𝑗 = 1:4 and for 𝑗=5 we take 𝑥_𝑖5 = 𝑥_𝑖1 + 0.01𝑧_𝑖 where 𝑧_𝑖 are indepedent standard normal.

The:

true 𝜎=1 which we assume to be known

true 𝛽^T = [2, -1, 3, -2, 0]

The first and last columns of 𝑋^T𝑋 are highly colinear.

The smallest eigenvalue of 𝑋^T𝑋 is 0.005. This will cause a high variance for some of the 𝛽_𝑗.

The diagonal elements of (𝑋^T𝑋)^-1 are (95.92, 0.10, 0.10, 0.11, 96.33).

The estimator of 𝛽 is:

𝛽ˆ^T = (9.78, -1.19, 3.00, -2.09, -7.77)

The first and fifth estimators are unreliable, as anticipated.

We can get 𝛽_𝑅ˆ for a range of 𝜆 values.

In practice a choice of 𝜆 could be close to 0, with no need for a large value.

A plot of the 𝛽_𝑅1ˆ and 𝛽_𝑅5ˆ as 𝜆 ranges between 0 and 5 is shown below

Subpages

Ridge Regression vs LASSO Regression

Resources

http://r-statistics.co/Ridge-Regression-With-R.html

／var／log marcus chiu

Explorer

Ridge Regression

Ridge Regression

Ridge Regression

Another Way to Derive

Ridge Regression - Example

Subpages

Resources

／var／logmarcus chiu

Explorer

Ridge Regression

Ridge Regression

Ridge Regression

Another Way to Derive

Ridge Regression - Example

Subpages

Resources

／var／log marcus chiu