Tidycensus
Tidycensus is an package that allows for interfacing with the US Census Bureau’s data. Using the function get_decennial()
, you have access to the 2000 and 2010 decennial US Census, and using get_acs()
, you can get data from the 1-year and 5-year American Community Survey.
To get started with tidycensus
, you need to install the package and set a Census API key. Use this link (here) to request a key. You can use “University of Illinois at Urbana-Champaign” as an organization and your student account (@illinois.edu) as an email address. After that, use census_api_key("your_api_number_goes_here", install =TRUE)
and you are set.
Let’s examine the variables in the American Community Survey 5-year estimates (acs5
) for 2018.
library(tidyverse)
library(tidycensus)
### Get your Census API key and apply the function census_api_key
#census_api_key("put_your_key_here", install =TRUE)
load_variables(year = 2019, dataset = "acs5")
variables<-head(variables)
## # A tibble: 6 x 3
## name label concept
## <chr> <chr> <chr>
## 1 B01001_001 Estimate!!Total: SEX BY AGE
## 2 B01001_002 Estimate!!Total:!!Male: SEX BY AGE
## 3 B01001_003 Estimate!!Total:!!Male:!!Under 5 years SEX BY AGE
## 4 B01001_004 Estimate!!Total:!!Male:!!5 to 9 years SEX BY AGE
## 5 B01001_005 Estimate!!Total:!!Male:!!10 to 14 years SEX BY AGE
## 6 B01001_006 Estimate!!Total:!!Male:!!15 to 17 years SEX BY AGE
Working with tidycensus
, you can get data from a collection of geographies - state, county, block, tract, etc. Here we work with ACS data from census tracts in Cook County-IL in 2015-2019. All the variables have a specific code, and you might want to come back to variables
to check what is available and what code corresponds to what variables. Note that get_acs()
returns two columns: estimate and margin of error.
get_acs(geography = "tract",
chi_data <-state="IL",
county = "Cook County",
year=2018,
dataset="acs5",
output = "wide",
variables=c(medincome="B19013_001",housevalue="B25077_001"))
head(chi_data)
## # A tibble: 6 x 6
## GEOID NAME medincomeE medincomeM housevalueE housevalueM
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 170314302~ Census Tract 4302, Cook County~ 22945 8863 439600 58172
## 2 170314305~ Census Tract 4305, Cook County~ 18432 2442 161600 25885
## 3 170314314~ Census Tract 4314, Cook County~ 28952 8079 76000 36879
## 4 170314407~ Census Tract 4407, Cook County~ 37228 17809 155700 11612
## 5 170314701~ Census Tract 4701, Cook County~ 25287 7241 98300 6500
## 6 170318214~ Census Tract 8214.02, Cook Cou~ 49300 9327 108600 8574
The E
after the variable’s name corresponds to “estimate” - that is the value we are looking for. The M
after the variable’s name refers to the margin of error - here you have more info about the moe. With the following code, you get a scatter plot of median household income and median home value for census tracts in Cook County-IL.
library(ggthemes)
library(ggrepel)
ggplot(chi_data, aes(x=medincomeE, y=housevalueE)) +
geom_point(color="#6794a7", size=3,alpha=.7) +
stat_smooth(method = "lm", formula =y~x, se=F, colour="#014d64") +
scale_x_continuous(name = "Median Household Income") +
scale_y_continuous(name = "Median Home Value") +
theme_economist(base_size = 17)+
theme(axis.text=element_text(size=15),
axis.title=element_text(size=15,face="bold"),
panel.grid.major.x = element_line( size=.05, color="white"))
Now, let’s get information about median rent as a percentage of household income by state in 2005-2009. First, load the ACS 2005-2009 variables:
load_variables(year = 2009, dataset = "acs5") var09<-
Knowing the variable’s code, you can easily import the information you want and make a table, for instance. As you can see, in the period 2005-2009, Florida had the highest value for the median gross rent as a share of household income:
get_acs(geography = "state",
state09<-variables = c(rent_income="B25071_001"),
dataset="acs5",
output="wide",
year=2009)
%>%select(NAME, rent_incomeE)%>%arrange(desc(rent_incomeE))%>%slice(1:10) state09
## # A tibble: 10 x 2
## NAME rent_incomeE
## <chr> <dbl>
## 1 Florida 33.6
## 2 Puerto Rico 33.2
## 3 California 32.3
## 4 Hawaii 32
## 5 Michigan 31.9
## 6 Mississippi 31.3
## 7 Louisiana 30.6
## 8 New York 30.5
## 9 Vermont 30.5
## 10 New Jersey 30.4
Did those values change a lot? Let’s check these numbers for the period 2015-2019:
get_acs(geography = "state",
state19<-variables = c(rent_income="B25071_001"),
dataset="acs5",
output="wide",
year=2019)
%>%select(NAME, rent_incomeE)%>%arrange(desc(rent_incomeE))%>%slice(1:10) state19
## # A tibble: 10 x 2
## NAME rent_incomeE
## <chr> <dbl>
## 1 Florida 33.3
## 2 California 32.5
## 3 Hawaii 32.5
## 4 Louisiana 32.5
## 5 Puerto Rico 32.3
## 6 New York 31.2
## 7 Connecticut 30.9
## 8 New Jersey 30.8
## 9 Colorado 30.5
## 10 Oregon 30.3
Finally, let’s see population growth in US states over the period 2000-2010. This time, we use Census data get_decennial()
. The variable code that corresponds to Total Population
is P001001
. We get that info for 2000 and 2010, and merge the two data frames. After that, we calculate the growth rate and arrange the data to see the top and bottom five states in terms of population growth between those years.
get_decennial(geography = "state",
state_pop00<-variables=c(pop00="P001001"),
output="wide",
year=2000)
get_decennial(geography = "state",
state_pop10<-variables=c(pop10="P001001"),
output="wide",
year=2010)
left_join(state_pop00, state_pop10, by=c("NAME","GEOID"))
state_pop<-
$pop_growth<-(state_pop$pop10-state_pop$pop00)/state_pop$pop00*100
state_pop### TOP 5
%>%select(NAME, pop_growth)%>%arrange(desc(pop_growth))%>%slice(1:5) state_pop
## # A tibble: 5 x 2
## NAME pop_growth
## <chr> <dbl>
## 1 Nevada 35.1
## 2 Arizona 24.6
## 3 Utah 23.8
## 4 Idaho 21.1
## 5 Texas 20.6
### BOTTOM 5
%>%select(NAME, pop_growth)%>%arrange(pop_growth)%>%slice(1:5) state_pop
## # A tibble: 5 x 2
## NAME pop_growth
## <chr> <dbl>
## 1 Puerto Rico -2.17
## 2 Michigan -0.551
## 3 Rhode Island 0.405
## 4 Louisiana 1.44
## 5 Ohio 1.62