Homework #18 Problem 3

A statistics professor wanted to know whether time limits on quizzes affect students’ scores on the quiz. Accordingly, she took a random sample of economics statistics students and split them into five groups of 20 students each. All students took a quiz that involved simple manual calculations. Each group was given a different time limit. Group 1 was limited to 40 minutes; group 2, 45 minutes, group 3, 50 minutes; group 4, 55 minutes; and group 5, 60 minutes. The quizzes were marked (out of 40) and recorded in this Excel file. Run a regression of Score on Time (be sure to check the Residuals and Standardized Residuals check boxes) and answer the following questions.

The algebraic form of the estimated equation of your model is Score = _____ + _____ × Time.

The value of the coefficient of determination is ________

Now, you want to check whether the required assumption #1 (Normality) is violated. Create a histogram of the standardized residuals. From the residual analysis, you can conclude that the errors are ________

Also, you want to check whether the required assumption #2 (Homoskedasticity) is violated. You can conclude that the errors have ________

One way to deal with the problem you found in the previous part, as has been addressed in class, is to try a log tranformation of the dependent variable. So run a regression using log of Score (ln(Score)) as the dependent variable (Y) on the independent variable time.

The algebraic form of estimated equation of your transformed model is lnScore = ______ + ______× Time.

Transformation of the dependent variable has ________ the coefficient of determination.

The estimated score on the quiz of a student who gets 55 minutes to solve the quiz is ________

Answer

library(ggplot2)
library(xlsx)
## Importing the Data. Remember to set up your working directory with setwd()
quizScore<-read.xlsx("quizScorev3.xls", sheetName = "Data", as.data.frame = T, header = T)

## lm is the function used to fit linear models
model<-lm(Score~Time, data=quizScore)
summary(model)
## 
## Call:
## lm(formula = Score ~ Time, data = quizScore)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.355  -1.355   0.190   1.735   9.100 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.64000    2.41423  -0.265    0.791    
## Time         0.50900    0.04781  10.647   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.381 on 98 degrees of freedom
## Multiple R-squared:  0.5363, Adjusted R-squared:  0.5316 
## F-statistic: 113.3 on 1 and 98 DF,  p-value: < 2.2e-16
residuals<-resid(model)
predicted<-model$fitted.values
## Attaching those new series in your dataset
quizScore$residuals<-residuals
quizScore$predicted<-predicted

## Can you assume Homoskedasticity?
homosk<-ggplot(quizScore, aes(x=predicted, y=residuals)) + geom_point()+labs(x="Predicted Values",y="Residuals") 
homosk

## Can you assume Normality?
## Histogram
hist(residuals)

############### Creating a new dependent variable 
quizScore$lnScore<-log(quizScore$Score, base=exp(1))

model2<-lm(lnScore~Time, data=quizScore)
summary(model2)
## 
## Call:
## lm(formula = lnScore ~ Time, data = quizScore)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.09190 -0.04913  0.00444  0.07812  0.27593 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.205441   0.111679  19.748  < 2e-16 ***
## Time        0.019703   0.002212   8.909 2.81e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1564 on 98 degrees of freedom
## Multiple R-squared:  0.4475, Adjusted R-squared:  0.4419 
## F-statistic: 79.37 on 1 and 98 DF,  p-value: 2.813e-14
############## Predicted Value for 55 minutes
beta_0<-model2$coefficients[1]
beta_1<-model2$coefficients[2]

predicted_lnscore<-beta_0+beta_1*55
predicted_score<-exp(predicted_lnscore)
predicted_score
## (Intercept) 
##    26.81925

Homework #18 Problem 6

You want to know how the weather affects ticket sales at a ski resort. This Excel file contains data on tickets sold and total snowfall and average temperature over Christmas week for 20 consecutive years. Run a multiple regression of ticket sales on snowfall and temperature and answer the following questions. Plotting the residuals against time suggests that:

The Calculated value of the Durbin-Watson statistic is _______

Answer

library(lmtest)
## Importing the Data
ski_resort<-read.xlsx("ski_resortv3.xls", sheetName = "Data", as.data.frame = T, header = T)

## lm is the function used to fit linear models
reg<-lm(Tickets~Snowfall+Temperature, data=ski_resort)
summary(reg)
## 
## Call:
## lm(formula = Tickets ~ Snowfall + Temperature, data = ski_resort)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5258.9  -958.9   430.7   973.0  3032.0 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7761.251   1124.297   6.903 2.56e-06 ***
## Snowfall      73.042     64.162   1.138    0.271    
## Temperature    8.354     24.514   0.341    0.737    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2129 on 17 degrees of freedom
## Multiple R-squared:  0.07586,    Adjusted R-squared:  -0.03286 
## F-statistic: 0.6978 on 2 and 17 DF,  p-value: 0.5114
residuals<-resid(reg)
ski_resort$resid<-residuals

## Creating a new column (TIME)
ski_resort$Time<-c(1:20)


## Can you assume Non-autocorrelation?

autocorr<-ggplot(ski_resort, aes(x=Time, y=residuals)) + geom_point()+labs(x="Year",y="Residuals") 
autocorr

### Durbin Watson Using the Package lmtest
DW<-dwtest(reg)
DW$statistic
##        DW 
## 0.4388053
### Durbin Watson by hand

## First loop for the numerator: you are subtracting rows in your data frame "resid(2)- resid(1)"
## "resid(3)-resid(2)" etc and placing them in a vector called num
## Note that you have only 19 values - you have 20 values for residuals (take a look at the formula)

num<-rep(0,19)
for(i in 1:19){num[i]<-((ski_resort[i+1,4]-ski_resort[i,4])^2)}

## Second loop for the denominator: you are squaring each value for your residuals
## and placing them in a vector called den

den<-rep(0,20)
for(i in 1:20){den[i]<-(ski_resort[i,4])^2}

## Sum all those values for "num" and "den". Then, divide them. 

DW_by_hand<-sum(num)/sum(den)
DW_by_hand
## [1] 0.4388053

Homework #19 Problem 2

A real estate agent is interested in estimating the value of a piece of lakefront property. She believes that price is a function of lot size (thousands of square feet), number of mature trees on the lot, and distance to the lake (in yards). She has collected the data in this Excel file on the basis of recent sales. So in this model the price of a home is said to depend on the lot size, the number of trees in the lot, and the distance from the home to the lake. Run regression on the data selecting the price column as the Y range and selecting the Lot Size, Trees, and Distance columns as the X range. Look at the regression output and determine whether any signs of multicollinearity are present. (Hint: Remember the two ways multicollinearity is usually detected: if the results of the overall F-test and the individual t-test contradict each other, or if the correlation between any pair of independent variables exceeds 0.8 in absolute value.)

Answer

## Importing the Data
trees_are_good<-read.xlsx("trees_are_goodv4.xls", sheetName = "Data", as.data.frame = T, header = T)

reg2<-lm(Price~Lot.size+Trees+Distance , data=trees_are_good)
summary(reg2)
## 
## Call:
## lm(formula = Price ~ Lot.size + Trees + Distance, data = trees_are_good)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -61.94 -32.48   1.60  36.59  62.90 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  57.7660    29.7197   1.944   0.0594 .
## Lot.size      0.5757     0.8578   0.671   0.5062  
## Trees         0.5875     0.3944   1.489   0.1446  
## Distance     -0.3883     0.2324  -1.671   0.1029  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.84 on 38 degrees of freedom
## Multiple R-squared:  0.2321, Adjusted R-squared:  0.1715 
## F-statistic: 3.829 on 3 and 38 DF,  p-value: 0.01721
corr_matrix<-data.frame(cor(trees_are_good))
corr_matrix
##               Price   Lot.size      Trees    Distance
## Price     1.0000000  0.3904720 0.36754984 -0.28792515
## Lot.size  0.3904720  1.0000000 0.64190886 -0.27819594
## Trees     0.3675498  0.6419089 1.00000000  0.02325195
## Distance -0.2879251 -0.2781959 0.02325195  1.00000000