nested model - model 𝐴 is a nested model of 𝐵 if the predictors of 𝐴 is a subset of predictors of 𝐵

Comparing 2 Models

the 2 models to be compared:

𝑀_𝐿 - larger model - predictors {𝑥₁, …, 𝑥_𝑘}
𝑀_𝑆- smaller model (nested) - predictors {𝑥₁, …, 𝑥_𝑗} i.e. does not have {𝑥_𝑗+1, …, 𝑥_𝑘}

Extra Sum of Squares

extra sum of squares 𝑆𝑆_𝐸𝑋 is the difference of the models’ sum of squares regressions 𝑆𝑆_𝑅𝐸𝐺 or sum of squares error 𝑆𝑆_𝐸𝑅𝑅:

𝑆𝑆_𝐸𝑋= 𝑆𝑆_𝑅𝐸𝐺(𝑀_𝐿) - 𝑆𝑆_𝑅𝐸𝐺(𝑀_𝑆)
𝑆𝑆_𝐸𝑋= 𝑆𝑆_𝐸𝑅𝑅(𝑀_𝑆) - 𝑆𝑆_𝐸𝑅𝑅(𝑀_𝐿)

degrees of freedom of 𝑆𝑆_𝐸𝑋:

𝑑𝑓_𝐸𝑋 = 𝑑𝑓_𝑅𝐸𝐺(𝑀_𝐿) - 𝑑𝑓_𝑅𝐸𝐺(𝑀_𝑆)
𝑑𝑓_𝐸𝑋 = 𝑛𝑢𝑚-𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑜𝑟-𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠(𝑀_𝐿) - 𝑛𝑢𝑚-𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑜𝑟-𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠(𝑀_𝑆)
𝑑𝑓_𝐸𝑋 = 𝑘 - 𝑗

Partial F-Test Statistic

significance of the additional explained variation (measured by 𝑆𝑆_𝐸𝑋) is tested by a partial f-test statistic:

𝐹 = 𝑀𝑆_𝐸𝑋 / 𝑀𝑆_𝐸𝑅𝑅(larger)
𝐹 = [𝑆𝑆_𝐸𝑋/(𝑘-𝑗)] / [𝑆𝑆_𝐸𝑅𝑅(larger)/(𝑛-𝑘-1)]

the set of predictor variables in 𝑀₁\𝑀₂= {𝑋_𝑘+1, …, 𝑋_𝑚} affect the response 𝑌 if at least one of the slopes {𝜷_𝑘+1, …, 𝜷_𝑚} is not zero in 𝑀₁. The partial F-test is a test of:

null hypothesis 𝐻₀:
- 𝜃_𝑗+1 = … = 𝜃_𝑘 = 0
- the full model does not capture more variation than the reduced model
alternative hypothesis 𝐻_𝐴:
- at least one of {𝜃_𝑗+1, …, 𝜃_𝑘} is ≠ 0
- the full model captures more variation than the reduced model

null hypothesis 𝐻₀’s null distribution:

𝐹_{(𝑘-𝑗),(𝑛-𝑘-1)}

The partial F-test is used for sequential selection of predictors in multivariate regression

rejection region:
- qf(0.95, df1=k-j, df2=n-k-1)
p-value:
- pf(𝐹_𝑜𝑏𝑠, df1=k-j, df2=n-k-1)

Example R Code

Click here to expand...

read the home price data

home <- read.table("homeprice_multiple_predictors.txt", sep=",", header=T)

display contents

str(home)
'data.frame': 29 obs. of 7 variables:
$ list         : num 80 151 310 295 339  ...
$ sale         : num 118 151 300 275 340 ...
$ full         : int 1 1 2 2 2 1 3 1 1 1 ...
$ half         : int 0 0 1 1 0 1 0 1 2 0 ...
$ bedrooms     : int 3 4 4 4 3 4 3 3 3 1 ...
$ rooms        : int 6 7 9 8 7 8 7 7 7 3 ...
$ neighborhood : int 1 1 3 3 4 3 2 2 3 2 ...

attach the dataset in R’s memory so that we can directly use the names of the variables

attach(home)

look at distributions of some predictors

table(bedrooms)
bedrooms
1  2  3  4  5
1  3 16  8  1

table(full)
full
1  2  3
13 11 5

table(half)
half
0  1  2
13 13 3

table(neighborhood)
neighborhood
1  2  3  4  5
2  8 12  5  2

Create reduced model: regress sale price on # bedrooms and neighborhood

fit1 <- lm(sale ~ bedrooms + neighborhood
summary(fit1)
Call:
lm(formula = sale ~ bedrooms + neighborhood)

Residuals:
    Min      1Q  Median      3Q     Max
-90.871 -39.861   0.636  28.815 107.660

Coefficients:
             Estimate  Std. Error  t-value  Pr(>|t|)
(Intercept)  -132.057      40.341   -3.273  0.003001 **
bedrooms       42.483      11.446    3.712  0.000987 ***
neighborhood   93.493       9.101   10.273  1.21e-10 ***
---
Signif. codes: 0'***' 0.001'**' 0.01'*' 0.05'.' 0.1' ' 1

Residual standard error: 47.3 on 26 degrees of freedom
Multiple R-squared: 0.8491, Adjusted R-squared: 0.8375
F-statistic: 73.16 on 2 and 26 DF, p-value: 2.1e-11

create another model: add # full and half baths to previous model

fit2 <- update(fit1, . ~ . + full + half)
summary(fit2)

Call:
lm(formula = sale ~ bedrooms + neighborhood + full + half)

Residuals:
    Min      1Q  Median      3Q     Max
-56.554 -38.067   6.027  26.998  53.311

Coefficients:
             Estimate  Std. Error  t value  Pr(>|t|)
(Intercept)  -125.121      33.136   -3.776  0.000926 ***
bedrooms       29.513      10.091    2.925  0.007419 **
neighborhood   78.724       9.669    8.142  2.31e-08 ***
full           27.345      13.604    2.010  0.055785 .
half           45.553      12.129	 3.756  0.000974 ***
---
Signif. codes: 0'***' 0.001'**' 0.01'*' 0.05'.' 0.1' ' 1

Residual standard error: 38.79 on 24 degrees of freedom
Multiple R-squared: 0.9063, Adjusted R-squared: 0.8907
F-statistic: 58.05 on 4 and 24 DF, p-value: 5.425e-12

drop # full baths

fit3 <- update(fit2, . ~ . - full)
summary(fit3)

Call:
lm(formula = sale ~ bedrooms + neighborhood + half)

Residuals:
   Min     1Q Median     3Q    Max
-67.55 -42.27   7.17  26.93  68.83

Coefficients:
            Estimate  Std. Error  t value  Pr(>|t|)
(Intercept) -127.348      35.073   -3.631   0.00127 **
bedrooms      35.649      10.187    3.500   0.00177 **
neighborhood  90.982       7.947   11.449  1.95e-11 ***
half          37.004      12.030    3.076   0.00503 **
---
Signif. codes: 0'***' 0.001'**' 0.01'*' 0.05'.' 0.1' ' 1

Residual standard error: 41.08 on 25 degrees of freedom
Multiple R-squared: 0.8905, Adjusted R-squared: 0.8774
F-statistic: 67.8 on 3 and 25 DF, p-value: 3.808e-12

compare the nested models

# check anova.lm

important node: when comparing 2 models using anova the results are as expected from the partial F-test. However, when more than 2 models are compared using anova, the F-statistic and p-value may not be what we would like. The reason for this is that the F-statistic compares the mean 𝑆𝑆_𝐸𝑋 to the 𝑀𝑆_𝐸𝑅𝑅 for the largest model considered in sequential order

anova(fit1, fit3, fit2)
Analysis of Variance Table

Model 1: sale ~ bedrooms + neighborhood
Model 2: sale ~ bedrooms + neighborhood + half
Model 3: sale ~ bedrooms + neighborhood + full + half
  Res.Df   RSS  Df  Sum of Sq        F    Pr(>F)
1  26    58164   
2  25    42194   1    15970.1  10.6132  0.003338 **
3  24    36114   1     6080.1   4.0406  0.055785 .
---
Signif. codes: 0'***' 0.001'**' 0.01'*' 0.05'.' 0.1' ' 1

anova(fit1, fit2)
Analysis of Variance Table

Model 1: sale ~ bedrooms + neighborhood
Model 2: sale ~ bedrooms + neighborhood + full + half
  Res.Df    RSS  Df  Sum of Sq       F    Pr(>F)
1     26  58164
2     24  36114   2      22050  7.3269  0.003283 **
---
Signif. codes: 0'***' 0.001'**' 0.01'*' 0.05'.' 0.1' ' 1

anova(fit3, fit2)
Analysis of Variance Table

Model 1: sale ~ bedrooms + neighborhood + half
Model 2: sale ~ bedrooms + neighborhood + full + half
  Res.Df    RSS  Df  Sum of Sq       F  Pr(>F)
1     25  42194
2     24  36114   1     6080.1  4.0406  0.05579 .
---
Signif. codes: 0'***' 0.001'**' 0.01'*' 0.05'.' 0.1' ' 1

residual plot

plot(fitted(fit3), resid(fit3))
abline(h=0)

QQ plot

qqnorm(resid(fit3))
qqline(resid(fit3))

take sqrt(sale) rather than sale as response

fit4 <- update(fit3, sqrt(sale) ~ .)

new QQ plot

qqnorm(resid(fit4))
qqline(resid(fit4))

／var／log marcus chiu

Explorer

LR - Comparing 2 Models (Extra Sum of Squares & Partial F-Test Statistic)

Comparing 2 Models

Extra Sum of Squares

Partial F-Test Statistic

Example R Code

／var／logmarcus chiu

Explorer

LR - Comparing 2 Models (Extra Sum of Squares & Partial F-Test Statistic)

Comparing 2 Models

Extra Sum of Squares

Partial F-Test Statistic

Example R Code

／var／log marcus chiu