Lab Quiz Review
Reading the data
This exercise will be using data on quality-of-life (QOL) in the US from Albouy and Lue (2015). You already saw it here and here. Download the .RDS
file here.
You know the drill: setwd() first, and then readRDS(). Also, call the packages that we will be using during the exercise: tidyverse
and tidycensus
.
library(tidyverse)
library(tidycensus)
setwd("C:/Users/User/Desktop/LabReview")
<-readRDS("QOL.RDS")
qol# Always good to View() the data
View(qol)
The columns QOL
, PopDens
, and PopChg19902000
refer to quality-of-life, population density, and population change between 1990-2000, respectively.
Top 5 & Bottom 3
Let’s arrange the dataset to see the five most and the three least desirable US regions in terms of quality-of-life.
Top 5
%>%arrange(desc(QOL))%>%slice(1:5) qol
## statefip stusps stname PumaID
## 1 36 NY New York 3603805
## 2 6 CA California 605410
## 3 6 CA California 602707
## 4 6 CA California 601202
## 5 36 NY New York 3604201
## msaname QOL PopDens
## 1 New York, Northern New Jersey, Long Island, NY-NJ-CT-PA 0.3269864 106676.922
## 2 Los Angeles-Riverside-Orange County, CA 0.3109373 2273.541
## 3 San Francisco-Oakland-San Jose, CA 0.3087680 1247.340
## 4 San Francisco-Oakland-San Jose, CA 0.3037749 2896.483
## 5 New York, Northern New Jersey, Long Island, NY-NJ-CT-PA 0.2988747 2860.220
## PopChg19902000
## 1 0.02931997
## 2 0.03910486
## 3 0.08461364
## 4 0.11845883
## 5 0.06038354
Bottom 3
%>%arrange(QOL)%>%slice(1:3) qol
## statefip stusps stname PumaID msaname
## 1 26 MI Michigan 2603706 Detroit-Ann Arbor-Flint, MI
## 2 24 MD Maryland 2400100 Cumberland, MD-WV
## 3 42 PA Pennsylvania 4204107 Philadelphia-Wilmington-Atlantic City, PA-NJ-DE-MD
## QOL PopDens PopChg19902000
## 1 -0.2256128 5932.3267 -0.05896693
## 2 -0.2173824 97.6136 0.01641372
## 3 -0.2022377 14952.6895 -0.16778216
Using group_by()
One can condense the data into state-level quality-of-life. A better way to do that would be using a weighted (by population) average, but we will use the simple average to make things easier. Although we do not have NAs in the data, na.rm=T
does not hurt you.
<-qol%>%group_by(stname)%>%summarize(qol=mean(QOL, na.rm=T))
state_qolhead(state_qol)
## # A tibble: 6 x 2
## stname qol
## <chr> <dbl>
## 1 Alabama -0.0515
## 2 Arizona 0.0156
## 3 Arkansas -0.0504
## 4 California 0.0880
## 5 Colorado 0.0483
## 6 Connecticut -0.00245
Using get_acs()
Having information about average quality-of-life per state, I am curious to see if there is a correlation between rents as a share of household income and quality of life. Go back to the tidycensus Rlab if you are not comfortable getting acs()
. The geography
is state
, the dataset is acs5
, and the period is 2005-2009 - the oldest available for acs5. Hence, year=2009
. You should also load_variables()
to check the variables’ code and their concept.
<-get_acs(geography = "state",
state09variables = c(rent_income="B25071_001"),
dataset="acs5",
output="wide",
year=2009)
# View(state09)
Merging
Let’s put both data frames together. Sometimes, you get lucky, and the common columns between datasets have the same name. More frequently, you need to rename columns, and that is our case. Hence, I will need to rename the first column in state_qol
(stname) to NAME
. It is also fine to do the opposite: rename the second column in state09
(NAME) to stname
and merge by="stname"
.
left_join()
keeps in the merged dataset all the states included in the state09
- the first input in the left_join() function. Since there is no information about quality of life in Hawaii, Alaska, and Puerto Rico, those states have NA
in the column qol
.
names(state_qol)[1]<-"NAME"
<-left_join(state09, state_qol, by="NAME")
final_dataView(final_data)
Using filter()
Let’s check how many states have above the average rent/income share. To do that, one needs to filter() by mean().
The mean() is
mean(final_data$rent_incomeE)
## [1] 29.23654
And here are the states (26 in total) that have a rent/income share higher than 29.23:
%>%filter(rent_incomeE>=mean(rent_incomeE)) final_data
## # A tibble: 26 x 5
## GEOID NAME rent_incomeE rent_incomeM qol
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 01 Alabama 29.3 0.3 -0.0515
## 2 04 Arizona 30 0.3 0.0156
## 3 06 California 32.3 0.2 0.0880
## 4 08 Colorado 30.3 0.3 0.0483
## 5 09 Connecticut 30.3 0.4 -0.00245
## 6 10 Delaware 29.7 0.6 -0.0290
## 7 12 Florida 33.6 0.2 0.0227
## 8 13 Georgia 30 0.2 -0.0316
## 9 15 Hawaii 32 0.5 NA
## 10 17 Illinois 29.9 0.2 -0.00615
## # ... with 16 more rows
Using lm()
As the final exercise, we run a simple regression using the lm()
function. Below you can see a positive relationship between quality of life and rent as a share of income - the coefficient of qol
is positive and statistically significant (p-value < 0.05). That was somehow expected: the higher rent/income share must offset the higher quality-of-life.
<-lm(rent_incomeE~qol, data=final_data)
model1summary(model1)
##
## Call:
## lm(formula = rent_incomeE ~ qol, data = final_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6976 -0.6852 0.0027 0.6414 3.7739
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.3995 0.2277 129.126 < 2e-16 ***
## qol 18.7923 5.4882 3.424 0.00129 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.505 on 47 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.1997, Adjusted R-squared: 0.1826
## F-statistic: 11.72 on 1 and 47 DF, p-value: 0.001289