Abstract
Multiple Linear Regression - dummy variables and curvilinear relationship.One of the problems that appraisers of real estate face is the lack of sufficient data. However, over the last few years they can access enormous data banks from the Federal National Mortgage Association (Fannie Mae) and the Federal Home Loan Mortgage Corporation (Freddie Mac) that have substantially reduced the problem of data availability. The data on millions of transactions in which these secondary mortgage market organizations are involved are now stored and available online to lenders and appraisers. The models and procedures that will govern the application of these data sources are now largely developed and will undoubtedly lead to major changes both in the scope of the residential appraisal function and in the way that residential appraisals are conducted. Multiple regression analysis is very useful in estimating models for appraising rents for one- to four-bedroom family residences as well as other properties where substantial data on a large number of transactions are available. The independent variables that theory suggest include categorical variables like whether the apartment has a dishwasher (DSHWSHER=1), a tennis court in the complex (TENISCRT=1), a sauna in the complex (SAUNA=1), whether the apartment is for yearly lease only (YRLYLEAS=1), are there some extra utilities like Ethernet connections and laundry services (XTRAUTIL=1), whether the apartment is in a poor neighborhood (POORNBHD=1), whether utilities are included (UTLITIES=1), whether the apartment is furnished (FURNSHED=1), and the type of apartment (efficiency, one-, two, three-, four-bedroom). In the data, the most commonly occurring type of apartment was one-bedroom. Besides those characteristics, the apartment rate also depends on the number of baths per bedroom (BTHPERBD), average square feet per bedroom (BDRMSIZE), and a measure of location in number of miles from city center (LOCATION). You decide to use multiple regression analysis in estimating models for appraising rents for apartments (efficiencies and one- to four-bedroom apartments). The data on rent and the aforementioned characteristics for 300 of these apartments are stored in this Excel spreadsheet.
library(xlsx)
library(fastDummies)
data<-read.xlsx("rent_proj2_dat7.xls", sheetName ="Data", header=T, as.data.frame = T)
lapply(data, class)
## $RENT
## [1] "numeric"
##
## $DSHWSHER
## [1] "factor"
##
## $TENISCRT
## [1] "factor"
##
## $SAUNA
## [1] "factor"
##
## $YRLYLEAS
## [1] "factor"
##
## $XTRAUTIL
## [1] "factor"
##
## $POORNBHD
## [1] "factor"
##
## $UTLITIES
## [1] "factor"
##
## $FURNSHED
## [1] "factor"
##
## $BTHPERBD
## [1] "numeric"
##
## $BDRMSIZE
## [1] "numeric"
##
## $LOCATION
## [1] "numeric"
##
## $EFFICNCY
## [1] "factor"
##
## $TWOBDRM
## [1] "factor"
##
## $THREBDRM
## [1] "factor"
##
## $FOURBDRM
## [1] "factor"
## Since independent variables with Yes/No are factors, there is no need to
## create dummy variables for each case. Just run the regression
## with . and you will use all the Xs in your dataframe
## Full Model
##
full<-lm(RENT~., data=data)
summary(full)
##
## Call:
## lm(formula = RENT ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -203.202 -51.120 -3.478 48.719 198.613
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 266.6188 23.2248 11.480 < 2e-16 ***
## DSHWSHERYes 19.3591 11.3219 1.710 0.08838 .
## TENISCRTYes 1.1612 13.1639 0.088 0.92977
## SAUNAYes 7.5378 11.4829 0.656 0.51207
## YRLYLEASYes 24.4220 46.6911 0.523 0.60134
## XTRAUTILYes 20.9381 15.5396 1.347 0.17892
## POORNBHDYes -39.9552 9.4414 -4.232 3.13e-05 ***
## UTLITIESYes 24.7427 10.4972 2.357 0.01910 *
## FURNSHEDYes 42.0003 9.3352 4.499 9.96e-06 ***
## BTHPERBD 8.2657 6.6296 1.247 0.21350
## BDRMSIZE 0.4278 0.0554 7.722 1.98e-13 ***
## LOCATION -12.4218 1.0479 -11.854 < 2e-16 ***
## EFFICNCYYes -43.0677 13.3683 -3.222 0.00142 **
## TWOBDRMYes 59.5707 11.6458 5.115 5.78e-07 ***
## THREBDRMYes 99.9038 15.7285 6.352 8.43e-10 ***
## FOURBDRMYes 148.4485 18.5817 7.989 3.43e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 78.42 on 284 degrees of freedom
## Multiple R-squared: 0.5965, Adjusted R-squared: 0.5752
## F-statistic: 27.99 on 15 and 284 DF, p-value: < 2.2e-16
## Drop all the not statistically significant variables and run the reduced model
reduc<-lm(RENT~.-TENISCRT-SAUNA-YRLYLEAS-XTRAUTIL-BTHPERBD-DSHWSHER, data=data)
summary(reduc)
##
## Call:
## lm(formula = RENT ~ . - TENISCRT - SAUNA - YRLYLEAS - XTRAUTIL -
## BTHPERBD - DSHWSHER, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -197.685 -54.088 -4.217 51.406 214.275
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 292.74182 20.43674 14.324 < 2e-16 ***
## POORNBHDYes -41.11861 9.36937 -4.389 1.60e-05 ***
## UTLITIESYes 27.91858 10.43578 2.675 0.00789 **
## FURNSHEDYes 39.59889 9.28595 4.264 2.72e-05 ***
## BDRMSIZE 0.42977 0.05474 7.852 8.02e-14 ***
## LOCATION -12.79089 1.04282 -12.266 < 2e-16 ***
## EFFICNCYYes -43.12174 13.27260 -3.249 0.00129 **
## TWOBDRMYes 57.89485 11.57784 5.000 9.92e-07 ***
## THREBDRMYes 101.84946 15.71738 6.480 3.93e-10 ***
## FOURBDRMYes 145.35713 18.48332 7.864 7.37e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 78.85 on 290 degrees of freedom
## Multiple R-squared: 0.5834, Adjusted R-squared: 0.5705
## F-statistic: 45.13 on 9 and 290 DF, p-value: < 2.2e-16
## Partial F-test
anova(full, reduc)
## Analysis of Variance Table
##
## Model 1: RENT ~ DSHWSHER + TENISCRT + SAUNA + YRLYLEAS + XTRAUTIL + POORNBHD +
## UTLITIES + FURNSHED + BTHPERBD + BDRMSIZE + LOCATION + EFFICNCY +
## TWOBDRM + THREBDRM + FOURBDRM
## Model 2: RENT ~ (DSHWSHER + TENISCRT + SAUNA + YRLYLEAS + XTRAUTIL + POORNBHD +
## UTLITIES + FURNSHED + BTHPERBD + BDRMSIZE + LOCATION + EFFICNCY +
## TWOBDRM + THREBDRM + FOURBDRM) - TENISCRT - SAUNA - YRLYLEAS -
## XTRAUTIL - BTHPERBD - DSHWSHER
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 284 1746359
## 2 290 1802929 -6 -56570 1.5333 0.1671
############### P-value>.05 and you do not reject the null. Therefore,
############### for prediction purposes, use the reduced model.
## Interaction dummy for DSHWSHER, SAUNA and TENISCRT
## Since those variables are factors, I will create dummy variables (numeric) for them.
## After that, I construct the interaction dummy
dummies<-dummy_cols(data[,c(2,3,4)], select_columns = c("DSHWSHER","SAUNA","TENISCRT"))
dummies$interact<-dummies$DSHWSHER_Yes*dummies$SAUNA_Yes*dummies$TENISCRT_Yes
################################
## Bringing the new column to your data set
data$interact<-dummies$interact
full_2<-lm(RENT~., data=data)
summary(full_2)
##
## Call:
## lm(formula = RENT ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -204.198 -50.335 -3.744 48.523 197.931
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 266.1601 23.4386 11.356 < 2e-16 ***
## DSHWSHERYes 21.8702 19.2854 1.134 0.25774
## TENISCRTYes 0.2771 14.2845 0.019 0.98454
## SAUNAYes 9.0061 14.6797 0.614 0.54003
## YRLYLEASYes 23.9018 46.8829 0.510 0.61058
## XTRAUTILYes 21.0792 15.5909 1.352 0.17745
## POORNBHDYes -40.0073 9.4632 -4.228 3.19e-05 ***
## UTLITIESYes 24.9458 10.5908 2.355 0.01918 *
## FURNSHEDYes 42.0172 9.3518 4.493 1.02e-05 ***
## BTHPERBD 8.1916 6.6569 1.231 0.21952
## BDRMSIZE 0.4277 0.0555 7.706 2.21e-13 ***
## LOCATION -12.4212 1.0497 -11.833 < 2e-16 ***
## EFFICNCYYes -42.8418 13.4646 -3.182 0.00163 **
## TWOBDRMYes 59.7214 11.7034 5.103 6.14e-07 ***
## THREBDRMYes 100.1873 15.8537 6.319 1.02e-09 ***
## FOURBDRMYes 148.4632 18.6139 7.976 3.77e-14 ***
## interact -3.5809 22.2435 -0.161 0.87222
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 78.55 on 283 degrees of freedom
## Multiple R-squared: 0.5965, Adjusted R-squared: 0.5737
## F-statistic: 26.15 on 16 and 283 DF, p-value: < 2.2e-16