Lab Quiz Review

Reading the data

This exercise will be using data on quality-of-life (QOL) in the US from Albouy and Lue (2015). You already saw it here and here. Download the .RDS file here.

You know the drill: setwd() first, and then readRDS(). Also, call the packages that we will be using during the exercise: tidyverse and tidycensus.

library(tidyverse)
library(tidycensus)
setwd("C:/Users/User/Desktop/LabReview")
qol<-readRDS("QOL.RDS")
# Always good to View() the data
View(qol)

The columns QOL, PopDens, and PopChg19902000 refer to quality-of-life, population density, and population change between 1990-2000, respectively.

Top 5 & Bottom 3

Let’s arrange the dataset to see the five most and the three least desirable US regions in terms of quality-of-life.

Top 5

qol%>%arrange(desc(QOL))%>%slice(1:5)

##   statefip stusps     stname  PumaID
## 1       36     NY   New York 3603805
## 2        6     CA California  605410
## 3        6     CA California  602707
## 4        6     CA California  601202
## 5       36     NY   New York 3604201
##                                                   msaname       QOL    PopDens
## 1 New York, Northern New Jersey, Long Island, NY-NJ-CT-PA 0.3269864 106676.922
## 2                 Los Angeles-Riverside-Orange County, CA 0.3109373   2273.541
## 3                      San Francisco-Oakland-San Jose, CA 0.3087680   1247.340
## 4                      San Francisco-Oakland-San Jose, CA 0.3037749   2896.483
## 5 New York, Northern New Jersey, Long Island, NY-NJ-CT-PA 0.2988747   2860.220
##   PopChg19902000
## 1     0.02931997
## 2     0.03910486
## 3     0.08461364
## 4     0.11845883
## 5     0.06038354

Bottom 3

qol%>%arrange(QOL)%>%slice(1:3)

##   statefip stusps       stname  PumaID                                            msaname
## 1       26     MI     Michigan 2603706                        Detroit-Ann Arbor-Flint, MI
## 2       24     MD     Maryland 2400100                                  Cumberland, MD-WV
## 3       42     PA Pennsylvania 4204107 Philadelphia-Wilmington-Atlantic City, PA-NJ-DE-MD
##          QOL    PopDens PopChg19902000
## 1 -0.2256128  5932.3267    -0.05896693
## 2 -0.2173824    97.6136     0.01641372
## 3 -0.2022377 14952.6895    -0.16778216

Using group_by()

One can condense the data into state-level quality-of-life. A better way to do that would be using a weighted (by population) average, but we will use the simple average to make things easier. Although we do not have NAs in the data, na.rm=T does not hurt you.

state_qol<-qol%>%group_by(stname)%>%summarize(qol=mean(QOL, na.rm=T))
head(state_qol)

## # A tibble: 6 x 2
##   stname           qol
##   <chr>          <dbl>
## 1 Alabama     -0.0515 
## 2 Arizona      0.0156 
## 3 Arkansas    -0.0504 
## 4 California   0.0880 
## 5 Colorado     0.0483 
## 6 Connecticut -0.00245

Using get_acs()

Having information about average quality-of-life per state, I am curious to see if there is a correlation between rents as a share of household income and quality of life. Go back to the tidycensus Rlab if you are not comfortable getting acs(). The geography is state, the dataset is acs5, and the period is 2005-2009 - the oldest available for acs5. Hence, year=2009. You should also load_variables() to check the variables’ code and their concept.

state09<-get_acs(geography = "state", 
                 variables = c(rent_income="B25071_001"), 
                 dataset="acs5",
                 output="wide",
                 year=2009)
# View(state09)

Merging

Let’s put both data frames together. Sometimes, you get lucky, and the common columns between datasets have the same name. More frequently, you need to rename columns, and that is our case. Hence, I will need to rename the first column in state_qol (stname) to NAME. It is also fine to do the opposite: rename the second column in state09 (NAME) to stname and merge by="stname".

left_join() keeps in the merged dataset all the states included in the state09 - the first input in the left_join() function. Since there is no information about quality of life in Hawaii, Alaska, and Puerto Rico, those states have NA in the column qol.

names(state_qol)[1]<-"NAME"
final_data<-left_join(state09, state_qol, by="NAME")
View(final_data)

Using filter()

Let’s check how many states have above the average rent/income share. To do that, one needs to filter() by mean().

The mean() is

mean(final_data$rent_incomeE)

## [1] 29.23654

And here are the states (26 in total) that have a rent/income share higher than 29.23:

final_data%>%filter(rent_incomeE>=mean(rent_incomeE))

## # A tibble: 26 x 5
##    GEOID NAME        rent_incomeE rent_incomeM      qol
##    <chr> <chr>              <dbl>        <dbl>    <dbl>
##  1 01    Alabama             29.3          0.3 -0.0515 
##  2 04    Arizona             30            0.3  0.0156 
##  3 06    California          32.3          0.2  0.0880 
##  4 08    Colorado            30.3          0.3  0.0483 
##  5 09    Connecticut         30.3          0.4 -0.00245
##  6 10    Delaware            29.7          0.6 -0.0290 
##  7 12    Florida             33.6          0.2  0.0227 
##  8 13    Georgia             30            0.2 -0.0316 
##  9 15    Hawaii              32            0.5 NA      
## 10 17    Illinois            29.9          0.2 -0.00615
## # ... with 16 more rows

Using lm()

As the final exercise, we run a simple regression using the lm() function. Below you can see a positive relationship between quality of life and rent as a share of income - the coefficient of qol is positive and statistically significant (p-value < 0.05). That was somehow expected: the higher rent/income share must offset the higher quality-of-life.

model1<-lm(rent_incomeE~qol, data=final_data)
summary(model1)

## 
## Call:
## lm(formula = rent_incomeE ~ qol, data = final_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6976 -0.6852  0.0027  0.6414  3.7739 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  29.3995     0.2277 129.126  < 2e-16 ***
## qol          18.7923     5.4882   3.424  0.00129 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.505 on 47 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.1997, Adjusted R-squared:  0.1826 
## F-statistic: 11.72 on 1 and 47 DF,  p-value: 0.001289