Intro to Tidycensus

Marcelino Guerra

Last update: 2/25/2021

Tidycensus

Tidycensus is an package that allows for interfacing with the US Census Bureau’s data. Using the function get_decennial(), you have access to the 2000 and 2010 decennial US Census, and using get_acs(), you can get data from the 1-year and 5-year American Community Survey.

To get started with tidycensus, you need to install the package and set a Census API key. Use this link (here) to request a key. You can use “University of Illinois at Urbana-Champaign” as an organization and your student account (@illinois.edu) as an email address. After that, use census_api_key("your_api_number_goes_here", install =TRUE) and you are set.

Let’s examine the variables in the American Community Survey 5-year estimates (acs5) for 2018.

library(tidyverse)
library(tidycensus)
### Get your Census API key and apply the function census_api_key
#census_api_key("put_your_key_here", install =TRUE)
variables<- load_variables(year = 2019, dataset = "acs5")
head(variables)

## # A tibble: 6 x 3
##   name       label                                   concept   
##   <chr>      <chr>                                   <chr>     
## 1 B01001_001 Estimate!!Total:                        SEX BY AGE
## 2 B01001_002 Estimate!!Total:!!Male:                 SEX BY AGE
## 3 B01001_003 Estimate!!Total:!!Male:!!Under 5 years  SEX BY AGE
## 4 B01001_004 Estimate!!Total:!!Male:!!5 to 9 years   SEX BY AGE
## 5 B01001_005 Estimate!!Total:!!Male:!!10 to 14 years SEX BY AGE
## 6 B01001_006 Estimate!!Total:!!Male:!!15 to 17 years SEX BY AGE

Working with tidycensus, you can get data from a collection of geographies - state, county, block, tract, etc. Here we work with ACS data from census tracts in Cook County-IL in 2015-2019. All the variables have a specific code, and you might want to come back to variables to check what is available and what code corresponds to what variables. Note that get_acs() returns two columns: estimate and margin of error.

chi_data <- get_acs(geography = "tract", 
                    state="IL", 
                    county = "Cook County", 
                    year=2018, 
                    dataset="acs5", 
                    output = "wide", 
                    variables=c(medincome="B19013_001",housevalue="B25077_001"))
head(chi_data)

## # A tibble: 6 x 6
##   GEOID      NAME                            medincomeE medincomeM housevalueE housevalueM
##   <chr>      <chr>                                <dbl>      <dbl>       <dbl>       <dbl>
## 1 170314302~ Census Tract 4302, Cook County~      22945       8863      439600       58172
## 2 170314305~ Census Tract 4305, Cook County~      18432       2442      161600       25885
## 3 170314314~ Census Tract 4314, Cook County~      28952       8079       76000       36879
## 4 170314407~ Census Tract 4407, Cook County~      37228      17809      155700       11612
## 5 170314701~ Census Tract 4701, Cook County~      25287       7241       98300        6500
## 6 170318214~ Census Tract 8214.02, Cook Cou~      49300       9327      108600        8574

The E after the variable’s name corresponds to “estimate” - that is the value we are looking for. The M after the variable’s name refers to the margin of error - here you have more info about the moe. With the following code, you get a scatter plot of median household income and median home value for census tracts in Cook County-IL.

library(ggthemes)
library(ggrepel)

ggplot(chi_data, aes(x=medincomeE, y=housevalueE)) + 
  geom_point(color="#6794a7", size=3,alpha=.7) + 
  stat_smooth(method = "lm", formula =y~x, se=F,  colour="#014d64") +
  scale_x_continuous(name = "Median Household Income") +
  scale_y_continuous(name = "Median Home Value") +
  theme_economist(base_size = 17)+
  theme(axis.text=element_text(size=15),
        axis.title=element_text(size=15,face="bold"),
        panel.grid.major.x = element_line( size=.05, color="white"))

Now, let’s get information about median rent as a percentage of household income by state in 2005-2009. First, load the ACS 2005-2009 variables:

var09<- load_variables(year = 2009, dataset = "acs5")

Knowing the variable’s code, you can easily import the information you want and make a table, for instance. As you can see, in the period 2005-2009, Florida had the highest value for the median gross rent as a share of household income:

state09<-get_acs(geography = "state", 
                 variables = c(rent_income="B25071_001"), 
                 dataset="acs5",
                 output="wide",
                 year=2009)
state09%>%select(NAME, rent_incomeE)%>%arrange(desc(rent_incomeE))%>%slice(1:10)

## # A tibble: 10 x 2
##    NAME        rent_incomeE
##    <chr>              <dbl>
##  1 Florida             33.6
##  2 Puerto Rico         33.2
##  3 California          32.3
##  4 Hawaii              32  
##  5 Michigan            31.9
##  6 Mississippi         31.3
##  7 Louisiana           30.6
##  8 New York            30.5
##  9 Vermont             30.5
## 10 New Jersey          30.4

Did those values change a lot? Let’s check these numbers for the period 2015-2019:

state19<-get_acs(geography = "state", 
                 variables = c(rent_income="B25071_001"), 
                 dataset="acs5",
                 output="wide",
                 year=2019)
state19%>%select(NAME, rent_incomeE)%>%arrange(desc(rent_incomeE))%>%slice(1:10)

## # A tibble: 10 x 2
##    NAME        rent_incomeE
##    <chr>              <dbl>
##  1 Florida             33.3
##  2 California          32.5
##  3 Hawaii              32.5
##  4 Louisiana           32.5
##  5 Puerto Rico         32.3
##  6 New York            31.2
##  7 Connecticut         30.9
##  8 New Jersey          30.8
##  9 Colorado            30.5
## 10 Oregon              30.3

Finally, let’s see population growth in US states over the period 2000-2010. This time, we use Census data get_decennial(). The variable code that corresponds to Total Population is P001001. We get that info for 2000 and 2010, and merge the two data frames. After that, we calculate the growth rate and arrange the data to see the top and bottom five states in terms of population growth between those years.

state_pop00<-get_decennial(geography = "state",
                         variables=c(pop00="P001001"),
                         output="wide",
                         year=2000)

state_pop10<-get_decennial(geography = "state",
                         variables=c(pop10="P001001"),
                         output="wide",
                         year=2010)

state_pop<-left_join(state_pop00, state_pop10, by=c("NAME","GEOID"))

state_pop$pop_growth<-(state_pop$pop10-state_pop$pop00)/state_pop$pop00*100
### TOP 5
state_pop%>%select(NAME, pop_growth)%>%arrange(desc(pop_growth))%>%slice(1:5)

## # A tibble: 5 x 2
##   NAME    pop_growth
##   <chr>        <dbl>
## 1 Nevada        35.1
## 2 Arizona       24.6
## 3 Utah          23.8
## 4 Idaho         21.1
## 5 Texas         20.6

### BOTTOM 5
state_pop%>%select(NAME, pop_growth)%>%arrange(pop_growth)%>%slice(1:5)

## # A tibble: 5 x 2
##   NAME         pop_growth
##   <chr>             <dbl>
## 1 Puerto Rico      -2.17 
## 2 Michigan         -0.551
## 3 Rhode Island      0.405
## 4 Louisiana         1.44 
## 5 Ohio              1.62