Intro to tidyverse

Marcelino Guerra

Last update: 01/27/2021

Working with .csv and .RDS files

The first step is to set up your working directory. To organize things better, I have a folder named Rlabs on my desktop. Inside of it, I also have different folders for each lab - in this case, Lab1. To change the working directory, use setwd() with the path to the folder Lab1. Download both datasets that we are using and place them inside of the folder Lab1.

setwd("C:/Users/User/Desktop/Rlabs/Lab1")

Since all the packages we need in the exercise belong to tidyverse, I will call that library. After that, you need to import the data to the R environment. The first one is the Household Income dataset. It is a .csv file and you might want to use the function read_csv(), from the package readr.

library(tidyverse)
HHincome<-read_csv("HHincome18.csv", col_names = TRUE) ## col_names is TRUE because all the columns have names in the first row

If you want to see the first values on that dataset, you can use the function head(), or use View(HHincome) to open a new tab.

#View(HHincome)
head(HHincome)

## # A tibble: 6 x 4
##   FIPStxt Stabr area_name          Median_Household_Income_2018
##   <chr>   <chr> <chr>                                     <dbl>
## 1 00000   US    United States                             61937
## 2 01000   AL    Alabama                                   49881
## 3 01001   AL    Autauga County, AL                        59338
## 4 01003   AL    Baldwin County, AL                        57588
## 5 01005   AL    Barbour County, AL                        34382
## 6 01007   AL    Bibb County, AL                           46064

Keeping this exercise going, there is a .RDS file storing health indicators of 3141 US counties. You may find more information about the meaning of those variables here. Read the .RDS data using the function readRDS()

health_data<-readRDS("health.RDS")
View(health_data)

One possibility that you might want to explore is the creation of new variables. For instance, the column #EligibleforMedicare2018 refers to the number of people eligible for Medicare per county in 2018. Note that you also have the estimated 2018 population (PopulationEstimate2018 column), so it is possible to create a variable % of people eligible for Medicare using health_data$your_new_column after your dataset. I am calling this new column perc_Medicare, and this new variable will be placed last in your dataset (check it using View(health_data)).

health_data$perc_Medicare<-health_data$`#EligibleforMedicare2018`/health_data$PopulationEstimate2018
head(health_data$perc_Medicare)

## [1] 0.1497090 0.1643653 0.1513478 0.2034678 0.1813108 0.1943226

Another way to create columns is using the mutate() function. We use the pipe %>% to express a sequence of operations. First, I want to store the results in my old dataset and that is the reason for health_data<-. Second, the mutate() operation is on the health_data dataset. Inside mutate(), I am calling my new variable perc_Medicare2. Now, compare perc_Medicare with perc_Medicare2 - they should be exactly the same.

health_data<-health_data%>%mutate(perc_Medicare2=`#EligibleforMedicare2018`/PopulationEstimate2018)

Merging datasets

In case you want to put the two datasets (income and health) together, you can use join function. There are some details here. You have 3,275 counties (rows) in HHincome, but only 3,141 counties in health_data. Here I will use left_join restricting the merge to those 3,141 counties inside health_data. Check the other join types here. Using the function dim() you will realize that your new data has 3,141 rows and 91 columns.

full_data<-left_join(health_data, HHincome,by = "FIPStxt")
dim(full_data)

## [1] 3141   92

Subsetting/Filtering data

For now, we just want to work with population density per squared mile (PopulationDensityperSqMile2010) and median household income (Median_Household_Income_2018). Those to variables are placed in columns 19 and 91 (check your dataset). Since we do not need the rest of the data, I will just create a new one with those two variables and the name of counties (area_name) and state (Stabr).

sub_data<-full_data[c(91,90,92, 19)]
head(sub_data)

## # A tibble: 6 x 4
##   area_name           Stabr Median_Household_Income_2018 PopulationDensityperSqMile2010
##   <chr>               <chr>                        <dbl>                          <dbl>
## 1 Kings County, NY    NY                           60862                         35369.
## 2 Queens County, NY   NY                           68432                         20554.
## 3 Bronx County, NY    NY                           38566                         32903.
## 4 Nassau County, NY   NY                          115301                          4705.
## 5 New York County, NY NY                           84610                         69468.
## 6 Wayne County, MI    MI                           46440                          2974.

Another way to do that is using select():

sub_data2<-full_data%>%select(area_name, Stabr, Median_Household_Income_2018, PopulationDensityperSqMile2010)
head(sub_data2)

## # A tibble: 6 x 4
##   area_name           Stabr Median_Household_Income_2018 PopulationDensityperSqMile2010
##   <chr>               <chr>                        <dbl>                          <dbl>
## 1 Kings County, NY    NY                           60862                         35369.
## 2 Queens County, NY   NY                           68432                         20554.
## 3 Bronx County, NY    NY                           38566                         32903.
## 4 Nassau County, NY   NY                          115301                          4705.
## 5 New York County, NY NY                           84610                         69468.
## 6 Wayne County, MI    MI                           46440                          2974.

With the new dataset, let’s order it by the median household income values:

sub_data%>%arrange(Median_Household_Income_2018)

## # A tibble: 3,141 x 4
##    area_name               Stabr Median_Household_Income_2018 PopulationDensityperSqMile2010
##    <chr>                   <chr>                        <dbl>                          <dbl>
##  1 Wilcox County, AL       AL                           25385                           13.1
##  2 Buffalo County, SD      SD                           25973                            4.1
##  3 Owsley County, KY       KY                           26278                           24.1
##  4 Holmes County, MS       MS                           26449                           25.4
##  5 Perry County, AL        AL                           26814                           14.7
##  6 Issaquena County, MS    MS                           26936                            3.4
##  7 McDowell County, WV     WV                           27292                           41.5
##  8 East Carroll Parish, LA LA                           27714                           18.4
##  9 Quitman County, MS      MS                           27767                           20.3
## 10 Sumter County, AL       AL                           27859                           15.2
## # ... with 3,131 more rows

As one can see, Wilcox county has the minimum value of household income in this sample. Using the functionfilter(), you can take a look at counties from Illinois:

sub_data%>%filter(Stabr=="IL")%>%arrange(Median_Household_Income_2018)

## # A tibble: 102 x 4
##    area_name            Stabr Median_Household_Income_2018 PopulationDensityperSqMile2010
##    <chr>                <chr>                        <dbl>                          <dbl>
##  1 Alexander County, IL IL                           35249                           35  
##  2 Pulaski County, IL   IL                           38400                           30.9
##  3 Saline County, IL    IL                           40761                           65.6
##  4 Jackson County, IL   IL                           41445                          103. 
##  5 Hardin County, IL    IL                           41713                           24.3
##  6 Franklin County, IL  IL                           42071                           96.8
##  7 McDonough County, IL IL                           42887                           55.3
##  8 Pope County, IL      IL                           43053                           12.1
##  9 Gallatin County, IL  IL                           43330                           17.3
## 10 Lawrence County, IL  IL                           43767                           45.2
## # ... with 92 more rows

Hence, Alexander County had the lowest median household income in 2018.

One can save some lines of code. Try to select variables while filtering results:

full_data%>% select(area_name, Stabr,PopulationDensityperSqMile2010, Median_Household_Income_2018)%>%filter(Stabr=="IL")%>%arrange(PopulationDensityperSqMile2010)

## # A tibble: 102 x 4
##    area_name            Stabr PopulationDensityperSqMile2010 Median_Household_Income_2018
##    <chr>                <chr>                          <dbl>                        <dbl>
##  1 Pope County, IL      IL                              12.1                        43053
##  2 Schuyler County, IL  IL                              17.3                        52018
##  3 Gallatin County, IL  IL                              17.3                        43330
##  4 Henderson County, IL IL                              19.4                        53632
##  5 Hamilton County, IL  IL                              19.5                        53028
##  6 Jasper County, IL    IL                              19.6                        54552
##  7 Pike County, IL      IL                              19.8                        47815
##  8 Calhoun County, IL   IL                              20.1                        56099
##  9 Stark County, IL     IL                              20.8                        52651
## 10 Scott County, IL     IL                              21.3                        55198
## # ... with 92 more rows

What if you want to sort a variable in descending order? Use desc():

full_data%>% select(area_name, Stabr,PopulationDensityperSqMile2010, Median_Household_Income_2018)%>%filter(Stabr=="IL")%>%arrange(desc(PopulationDensityperSqMile2010))

## # A tibble: 102 x 4
##    area_name            Stabr PopulationDensityperSqMile2010 Median_Household_Income_2018
##    <chr>                <chr>                          <dbl>                        <dbl>
##  1 Cook County, IL      IL                             5495.                        63347
##  2 DuPage County, IL    IL                             2800.                        93540
##  3 Lake County, IL      IL                             1586.                        87336
##  4 Kane County, IL      IL                              991.                        79834
##  5 Will County, IL      IL                              810.                        86108
##  6 Winnebago County, IL IL                              575.                        53795
##  7 McHenry County, IL   IL                              512.                        86268
##  8 St. Clair County, IL IL                              411.                        58222
##  9 Madison County, IL   IL                              376.                        60121
## 10 Kendall County, IL   IL                              358.                        93345
## # ... with 92 more rows

Summarizing data

An important thing to do is to get summary statistics by group. Let’s say you want to group your data by some characteristic in your dataset (that means you want group_by()), and summarize() some information among groups. How to that?

Assume you want to check how many ICU beds and Hospitals are in each state - in other words, we need to sum() all of the counties’ ICU beds and Hospitals in each state. Do this:

full_data%>%group_by(State)%>%summarize(sumICU=sum(`#ICU_beds`, na.rm=T), sumHosp=sum(`#Hospitals`, na.rm=T))

## # A tibble: 50 x 3
##    State                sumICU sumHosp
##    <chr>                 <dbl>   <dbl>
##  1 Alabama                1533      86
##  2 Arizona                1559      76
##  3 Arkansas                732      74
##  4 California             7338     329
##  5 Colorado               1095      80
##  6 Connecticut             674      30
##  7 Delaware                186       6
##  8 District Of Columbia    314       6
##  9 Florida                5604     178
## 10 Georgia                2508     129
## # ... with 40 more rows

Finally, let’s arrange that data to see the states with the highest numbers of hospitals in the US:

full_data%>%group_by(State)%>%summarize(sumICU=sum(`#ICU_beds`, na.rm=T), sumHosp=sum(`#Hospitals`, na.rm=T))%>%arrange(desc(sumHosp))

## # A tibble: 50 x 3
##    State        sumICU sumHosp
##    <chr>         <dbl>   <dbl>
##  1 Texas          6199     384
##  2 California     7338     329
##  3 Florida        5604     178
##  4 Illinois       3144     176
##  5 New York       3952     165
##  6 Pennsylvania   3169     162
##  7 Ohio           3314     156
##  8 Kansas          767     132
##  9 Michigan       2423     130
## 10 Georgia        2508     129
## # ... with 40 more rows