Working with .csv and .RDS files
The first step is to set up your working directory. To organize things better, I have a folder named Rlabs
on my desktop. Inside of it, I also have different folders for each lab - in this case, Lab1
. To change the working directory, use setwd()
with the path to the folder Lab1
. Download both datasets that we are using and place them inside of the folder Lab1
.
setwd("C:/Users/User/Desktop/Rlabs/Lab1")
Since all the packages we need in the exercise belong to tidyverse, I will call that library. After that, you need to import the data to the R environment. The first one is the Household Income dataset. It is a .csv
file and you might want to use the function read_csv()
, from the package readr
.
library(tidyverse)
read_csv("HHincome18.csv", col_names = TRUE) ## col_names is TRUE because all the columns have names in the first row HHincome<-
If you want to see the first values on that dataset, you can use the function head()
, or use View(HHincome)
to open a new tab.
#View(HHincome)
head(HHincome)
## # A tibble: 6 x 4
## FIPStxt Stabr area_name Median_Household_Income_2018
## <chr> <chr> <chr> <dbl>
## 1 00000 US United States 61937
## 2 01000 AL Alabama 49881
## 3 01001 AL Autauga County, AL 59338
## 4 01003 AL Baldwin County, AL 57588
## 5 01005 AL Barbour County, AL 34382
## 6 01007 AL Bibb County, AL 46064
Keeping this exercise going, there is a .RDS
file storing health indicators of 3141 US counties. You may find more information about the meaning of those variables here. Read the .RDS
data using the function readRDS()
readRDS("health.RDS")
health_data<-View(health_data)
One possibility that you might want to explore is the creation of new variables. For instance, the column #EligibleforMedicare2018
refers to the number of people eligible for Medicare per county in 2018. Note that you also have the estimated 2018 population (PopulationEstimate2018
column), so it is possible to create a variable % of people eligible for Medicare
using health_data$your_new_column
after your dataset. I am calling this new column perc_Medicare
, and this new variable will be placed last in your dataset (check it using View(health_data)
).
$perc_Medicare<-health_data$`#EligibleforMedicare2018`/health_data$PopulationEstimate2018
health_datahead(health_data$perc_Medicare)
## [1] 0.1497090 0.1643653 0.1513478 0.2034678 0.1813108 0.1943226
Another way to create columns is using the mutate()
function. We use the pipe %>%
to express a sequence of operations. First, I want to store the results in my old dataset and that is the reason for health_data<-
. Second, the mutate()
operation is on the health_data
dataset. Inside mutate()
, I am calling my new variable perc_Medicare2
. Now, compare perc_Medicare
with perc_Medicare2
- they should be exactly the same.
%>%mutate(perc_Medicare2=`#EligibleforMedicare2018`/PopulationEstimate2018) health_data<-health_data
Merging datasets
In case you want to put the two datasets (income and health) together, you can use join
function. There are some details here. You have 3,275 counties (rows) in HHincome
, but only 3,141 counties in health_data
. Here I will use left_join
restricting the merge to those 3,141 counties inside health_data
. Check the other join types here. Using the function dim()
you will realize that your new data has 3,141 rows and 91 columns.
left_join(health_data, HHincome,by = "FIPStxt")
full_data<-dim(full_data)
## [1] 3141 92
Subsetting/Filtering data
For now, we just want to work with population density per squared mile (PopulationDensityperSqMile2010
) and median household income (Median_Household_Income_2018
). Those to variables are placed in columns 19 and 91 (check your dataset). Since we do not need the rest of the data, I will just create a new one with those two variables and the name of counties (area_name
) and state (Stabr
).
c(91,90,92, 19)]
sub_data<-full_data[head(sub_data)
## # A tibble: 6 x 4
## area_name Stabr Median_Household_Income_2018 PopulationDensityperSqMile2010
## <chr> <chr> <dbl> <dbl>
## 1 Kings County, NY NY 60862 35369.
## 2 Queens County, NY NY 68432 20554.
## 3 Bronx County, NY NY 38566 32903.
## 4 Nassau County, NY NY 115301 4705.
## 5 New York County, NY NY 84610 69468.
## 6 Wayne County, MI MI 46440 2974.
Another way to do that is using select()
:
%>%select(area_name, Stabr, Median_Household_Income_2018, PopulationDensityperSqMile2010)
sub_data2<-full_datahead(sub_data2)
## # A tibble: 6 x 4
## area_name Stabr Median_Household_Income_2018 PopulationDensityperSqMile2010
## <chr> <chr> <dbl> <dbl>
## 1 Kings County, NY NY 60862 35369.
## 2 Queens County, NY NY 68432 20554.
## 3 Bronx County, NY NY 38566 32903.
## 4 Nassau County, NY NY 115301 4705.
## 5 New York County, NY NY 84610 69468.
## 6 Wayne County, MI MI 46440 2974.
With the new dataset, let’s order it by the median household income values:
%>%arrange(Median_Household_Income_2018) sub_data
## # A tibble: 3,141 x 4
## area_name Stabr Median_Household_Income_2018 PopulationDensityperSqMile2010
## <chr> <chr> <dbl> <dbl>
## 1 Wilcox County, AL AL 25385 13.1
## 2 Buffalo County, SD SD 25973 4.1
## 3 Owsley County, KY KY 26278 24.1
## 4 Holmes County, MS MS 26449 25.4
## 5 Perry County, AL AL 26814 14.7
## 6 Issaquena County, MS MS 26936 3.4
## 7 McDowell County, WV WV 27292 41.5
## 8 East Carroll Parish, LA LA 27714 18.4
## 9 Quitman County, MS MS 27767 20.3
## 10 Sumter County, AL AL 27859 15.2
## # ... with 3,131 more rows
As one can see, Wilcox county has the minimum value of household income in this sample. Using the functionfilter()
, you can take a look at counties from Illinois:
%>%filter(Stabr=="IL")%>%arrange(Median_Household_Income_2018) sub_data
## # A tibble: 102 x 4
## area_name Stabr Median_Household_Income_2018 PopulationDensityperSqMile2010
## <chr> <chr> <dbl> <dbl>
## 1 Alexander County, IL IL 35249 35
## 2 Pulaski County, IL IL 38400 30.9
## 3 Saline County, IL IL 40761 65.6
## 4 Jackson County, IL IL 41445 103.
## 5 Hardin County, IL IL 41713 24.3
## 6 Franklin County, IL IL 42071 96.8
## 7 McDonough County, IL IL 42887 55.3
## 8 Pope County, IL IL 43053 12.1
## 9 Gallatin County, IL IL 43330 17.3
## 10 Lawrence County, IL IL 43767 45.2
## # ... with 92 more rows
Hence, Alexander County had the lowest median household income in 2018.
One can save some lines of code. Try to select variables while filtering results:
%>% select(area_name, Stabr,PopulationDensityperSqMile2010, Median_Household_Income_2018)%>%filter(Stabr=="IL")%>%arrange(PopulationDensityperSqMile2010) full_data
## # A tibble: 102 x 4
## area_name Stabr PopulationDensityperSqMile2010 Median_Household_Income_2018
## <chr> <chr> <dbl> <dbl>
## 1 Pope County, IL IL 12.1 43053
## 2 Schuyler County, IL IL 17.3 52018
## 3 Gallatin County, IL IL 17.3 43330
## 4 Henderson County, IL IL 19.4 53632
## 5 Hamilton County, IL IL 19.5 53028
## 6 Jasper County, IL IL 19.6 54552
## 7 Pike County, IL IL 19.8 47815
## 8 Calhoun County, IL IL 20.1 56099
## 9 Stark County, IL IL 20.8 52651
## 10 Scott County, IL IL 21.3 55198
## # ... with 92 more rows
What if you want to sort a variable in descending order? Use desc()
:
%>% select(area_name, Stabr,PopulationDensityperSqMile2010, Median_Household_Income_2018)%>%filter(Stabr=="IL")%>%arrange(desc(PopulationDensityperSqMile2010)) full_data
## # A tibble: 102 x 4
## area_name Stabr PopulationDensityperSqMile2010 Median_Household_Income_2018
## <chr> <chr> <dbl> <dbl>
## 1 Cook County, IL IL 5495. 63347
## 2 DuPage County, IL IL 2800. 93540
## 3 Lake County, IL IL 1586. 87336
## 4 Kane County, IL IL 991. 79834
## 5 Will County, IL IL 810. 86108
## 6 Winnebago County, IL IL 575. 53795
## 7 McHenry County, IL IL 512. 86268
## 8 St. Clair County, IL IL 411. 58222
## 9 Madison County, IL IL 376. 60121
## 10 Kendall County, IL IL 358. 93345
## # ... with 92 more rows
Summarizing data
An important thing to do is to get summary statistics by group. Let’s say you want to group your data by some characteristic in your dataset (that means you want group_by()
), and summarize()
some information among groups. How to that?
Assume you want to check how many ICU beds and Hospitals are in each state - in other words, we need to sum()
all of the counties’ ICU beds and Hospitals in each state. Do this:
%>%group_by(State)%>%summarize(sumICU=sum(`#ICU_beds`, na.rm=T), sumHosp=sum(`#Hospitals`, na.rm=T)) full_data
## # A tibble: 50 x 3
## State sumICU sumHosp
## <chr> <dbl> <dbl>
## 1 Alabama 1533 86
## 2 Arizona 1559 76
## 3 Arkansas 732 74
## 4 California 7338 329
## 5 Colorado 1095 80
## 6 Connecticut 674 30
## 7 Delaware 186 6
## 8 District Of Columbia 314 6
## 9 Florida 5604 178
## 10 Georgia 2508 129
## # ... with 40 more rows
Finally, let’s arrange that data to see the states with the highest numbers of hospitals in the US:
%>%group_by(State)%>%summarize(sumICU=sum(`#ICU_beds`, na.rm=T), sumHosp=sum(`#Hospitals`, na.rm=T))%>%arrange(desc(sumHosp)) full_data
## # A tibble: 50 x 3
## State sumICU sumHosp
## <chr> <dbl> <dbl>
## 1 Texas 6199 384
## 2 California 7338 329
## 3 Florida 5604 178
## 4 Illinois 3144 176
## 5 New York 3952 165
## 6 Pennsylvania 3169 162
## 7 Ohio 3314 156
## 8 Kansas 767 132
## 9 Michigan 2423 130
## 10 Georgia 2508 129
## # ... with 40 more rows