Econ 414 Final Project - SPRING 2021

Getting the data from tidycensus

You are going to use the tidycensus package to get data from neighborhoods within a metro area (your choice!). The goal of the project is to identify some short-run changes in these neighborhoods. You are interested in the following variables:

Variables Description
Variable Definition Code
total_pop Total population B03002_001
white Total White (alone and non-hispanic) B03002_003
black Total Black/African American (alone and non-hispanic) B03002_004
asian Total Asian (alone and non-hispanic) B03002_006
hisp Total Hispanic or Latino B03002_012
foreign Total foreign born B05002_013
educ_tot Total population (>25) B15003_001
bach Total bachelor’s degree B15003_022
master Total master’s degree B15003_023
prof Total professional school degree B15003_024
doc Total doctorate degree B15003_025
med_inc Median housheold income B19013_001
rent_inc_share Median gross rent as a percentage of income B25071_001
med_home_value Median home value B25077_001
pov_tot Total population (poverty status) B17001_001
below_pov Total population with income below the poverty line B17001_002
tenure_tot Total housing units (tenure) B25003_001
house_renter Total housing units renter-occupied B25003_003
occupancy_tot Total housing units (occupancy status) B25002_001
house_vacant Total vacant housing units B25002_003
build_age Median year structure built B25035_001

Using the following codes, I will request the data using get_acs(). Since we are working with metro areas, we might need to get data from multiple states. In this example, I want to get data from Washington-Arlington-Alexandria, DC-VA-MD-WV MSA. Hence, I have states=c("DC","VA", "MD", "WV"). The geography is “tract” because we want to look at neighborhoods. In case you are working with Chicago-Naperville-Elgin, IL-IN-WI, you would need states=c("IL","IN", "WI").

Also, I am calling get_acs() twice: one for 2008-2012 and another for 2015-2019. To map the data, you need geography=TRUE in one of your calls (no need to get geometry twice).

library(tidycensus)
library(tidyverse)
library(tigris)
options(tigris_class = "sf")
options(tigris_use_cache = TRUE)

## You can always load_variables to check name and concept
#var2012_acs5<-load_variables(2012, dataset="acs5")
#var2019_acs5<-load_variables(2019, dataset="acs5")

Variables12 <- get_acs(geography = "tract", 
                       variables = c(total_pop12="B03002_001",
                                     white12="B03002_003",
                                     black12="B03002_004",
                                     asian12="B03002_006",
                                     hisp12="B03002_012",
                                     foreign12="B05002_013",
                                     educ_tot12="B15003_001",
                                     bach12="B15003_022",
                                     master12="B15003_023",
                                     prof12="B15003_024",
                                     doc12="B15003_025",
                                     med_inc12="B19013_001",
                                     rent_inc_share12="B25071_001",
                                     med_home_value12="B25077_001",
                                     pov_tot12="B17001_001",
                                     below_pov12="B17001_002",
                                     tenure_tot12="B25003_001",
                                     house_renter12="B25003_003",
                                     occupancy_tot12="B25002_001",
                                     house_vacant12="B25002_003",
                                     build_age12="B25035_001"), 
                       state = c("DC","VA", "MD", "WV"), 
                       geometry = F,
                       output="wide",
                       year=2012)

Variables19 <- get_acs(geography = "tract", 
                       variables = c(total_pop19="B03002_001",
                                     white19="B03002_003",
                                     black19="B03002_004",
                                     asian19="B03002_006",
                                     hisp19="B03002_012",
                                     foreign19="B05002_013",
                                     educ_tot19="B15003_001",
                                     bach19="B15003_022",
                                     master19="B15003_023",
                                     prof19="B15003_024",
                                     doc19="B15003_025",
                                     med_inc19="B19013_001",
                                     rent_inc_share19="B25071_001",
                                     med_home_value19="B25077_001",
                                     pov_tot19="B17001_001",
                                     below_pov19="B17001_002",
                                     tenure_tot19="B25003_001",
                                     house_renter19="B25003_003",
                                     occupancy_tot19="B25002_001",
                                     house_vacant19="B25002_003",
                                     build_age19="B25035_001"), 
                       state = c("DC","VA", "MD", "WV"), 
                       geometry = T,
                       output="wide",
                       year=2019)

After getting all the data, let’s do some cleaning. There is no need to keep the columns ending with “M” since they represent the margin of error, and we are looking for the estimates (variables ending with E). So we drop them and merge those two files into total_census.

Variables12<-Variables12 %>% select(-ends_with('M'))
Variables19<-Variables19 %>% select(-ends_with('M'))

total_census<-left_join(Variables19, Variables12)

Subsetting for metro area

We will be using additional packages in this task. tigris helps with the MSA boundaries, sf with the spatial join, and mapview is a quick way to map (I prefer tmap though). I also show you how to create a static map with ggplot().

## If you don't have them, first install.packages()
library(sf)
library(mapview)

The function core_based_statistical_areas() gives you all the CBAs in the US . When I search for Washington-Arlington-Alexandria, I found the GEOID 47900. Then, to get the specific boundaries for Washington-Arlington-Alexandria, DC-VA-MD-WV MSA, filter the metros by the specific GEOID of the MSA. For instance, if you are working with Chicago-Naperville-Elgin, you would filter() by `GEOID %in% c(“16980”).

Finally, st_join subsets all the census tracts within the MSA boundary (WA_MSA).

metros <- core_based_statistical_areas(cb = TRUE)
#View(metros) to search the specific GEOID of your MSA
WA_MSA<- metros%>%
  filter(GEOID %in% c("47900")) %>%
  select(metro_name = NAME)

dc_data <- st_join(total_census, WA_MSA, join = st_within,left = FALSE)

Mapping information

Before mapping, let’s create the desired shares and organize the data frame for 2012 and 2019. First, the Median year structure built for the ACS 5-year 2015-2019 gives zero to the buildings constructed before 1939, and we need to change it to compare that with data from 2012. Then, calculate the shares and bring prices from 2012 to 2019.

dc_data$build_age19E<-ifelse(dc_data$build_age19E==0, 1939, dc_data$build_age19E)
dc_data<-dc_data%>%
  mutate(white_share12=white12E/total_pop12E,
         white_share19=white19E/total_pop19E,
         black_share12=black12E/total_pop12E,
         black_share19=black19E/total_pop19E,
         hisp_share12=hisp12E/total_pop12E,
         hisp_share19=hisp19E/total_pop19E,
         asian_share12=asian12E/total_pop12E,
         asian_share19=asian19E/total_pop19E,
         foreign_share12=foreign12E/total_pop12E,
         foreign_share19=foreign19E/total_pop19E,
         bachelor_share12=(bach12E+master12E+prof12E+doc12E)/educ_tot12E,
         bachelor_share19=(bach19E+master19E+prof19E+doc19E)/educ_tot19E,
         med_inc12E=med_inc12E*1.1, ## Bring 2012 dollars to 2019 dollars 
         med_home_value12E=med_home_value12E*1.1, ## Bring 2012 dollars to 2019 dollars 
         poverty_share12=below_pov12E/pov_tot12E,
         poverty_share19=below_pov19E/pov_tot19E,
         share_renter12=house_renter12E/tenure_tot12E,
         share_renter19=house_renter19E/tenure_tot19E,
         vacant_share12=house_vacant12E/occupancy_tot12E,
         vacant_share19=house_vacant19E/occupancy_tot19E,
         med_age12=2012-dc_data$build_age12E,
         med_age19=2019-dc_data$build_age19E)

With the data ready, it is straightforward to map some variables:

library(patchwork)
map_pov12<-dc_data%>%ggplot(aes(fill = poverty_share12)) + 
  geom_sf(color = NA) + 
  coord_sf(crs = 26911) + 
  scale_fill_viridis_c(option = "magma") 

map_pov19<-dc_data%>%ggplot(aes(fill = poverty_share19)) + 
  geom_sf(color = NA) + 
  coord_sf(crs = 26911) + 
  scale_fill_viridis_c(option = "magma") 

map_pov12+map_pov19

If you want a quick dynamic map:

library(leafpop)
mapviewOptions(basemaps ="OpenStreetMap.DE")
mapview(
  dc_data, 
  zcol = "med_inc19E", 
  legend = T,
  popup = 
     popupTable(dc_data,zcol=c("med_inc19E","poverty_share19", "poverty_share12","med_home_value19E")))

Making tables

To compare neighborhoods’ characteristics over time, create tables summarizing important aspects about demographics, education, and housing. Here I will separate areas with poverty levels above 40% and summarize information for 2012. In the Final Project, you also need to do that for 2019.

Identifying the high-poverty tracts in 2012:

dc_data$HP12<-ifelse(dc_data$poverty_share12>=.4, "High-Poverty tracts", "Other tracts")

dc_data12<-dc_data%>%select(contains("12")|contains("12E"))
dc_data12<-dc_data12%>%drop_na()

Summarizing neighborhood characteristics:

options(scipen = 999, digits=1)
library(kableExtra)

HP12<-dc_data12%>%
  filter(HP12=="High-Poverty tracts")

table_HP12<-HP12%>%summarise(`Median household income (U$ 2019)` = weighted.mean(med_inc12E, total_pop12E),
               `Population`=sum(total_pop12E),
               `Population below poverty line`=sum(below_pov12E),
               `% White (non-hispanic)`=weighted.mean(white_share12, total_pop12E)*100,
               `% Black`=weighted.mean(black_share12, total_pop12E)*100,
               `% Hispanic`=weighted.mean(hisp_share12, total_pop12E)*100,
               `% Asian`=weighted.mean(asian_share12, total_pop12E)*100,
               `% Foreign born`=weighted.mean(foreign_share12, total_pop12E)*100,
               `% Adults with bachelor's degree or higher`=weighted.mean(bachelor_share12, educ_tot12E)*100,
               `Rent as a share of income (%)`=weighted.mean(rent_inc_share12E, total_pop12E),
               `Median home value (U$ 2019)`=weighted.mean(med_home_value12E, total_pop12E),
               `% of renter-occupied`=weighted.mean(share_renter12, tenure_tot12E)*100,
               `Vacancy rate`=weighted.mean(vacant_share12, occupancy_tot12E)*100,
               `Median year structure built`=median(build_age12E),
               `Observations`=n())


other12<-dc_data12%>%
  filter(HP12=="Other tracts")

table_other12<-other12%>%summarise(`Median household income (U$ 2019)` = weighted.mean(med_inc12E, total_pop12E),
                            `Population`=sum(total_pop12E),
                            `Population below poverty line`=sum(below_pov12E),
                            `% White (non-hispanic)`=weighted.mean(white_share12, total_pop12E)*100,
                            `% Black`=weighted.mean(black_share12, total_pop12E)*100,
                            `% Hispanic`=weighted.mean(hisp_share12, total_pop12E)*100,
                            `% Asian`=weighted.mean(asian_share12, total_pop12E)*100,
                            `% Foreign born`=weighted.mean(foreign_share12, total_pop12E)*100,
                            `% Adults with bachelor's degree or higher`=weighted.mean(bachelor_share12, educ_tot12E)*100,
                            `Rent as a share of income (%)`=weighted.mean(rent_inc_share12E, total_pop12E),
                            `Median home value (U$ 2019)`=weighted.mean(med_home_value12E, total_pop12E),
                            `% of renter-occupied`=weighted.mean(share_renter12, tenure_tot12E)*100,
                            `Vacancy rate`=weighted.mean(vacant_share12, occupancy_tot12E)*100,
                            `Median year structure built`=median(build_age12E),
                            `Observations`=n())

Making the 2012 table:

table_HP12<-t(table_HP12)
table_other12<-t(table_other12)

table2012<-data.frame(cbind(table_HP12, table_other12))
names(table2012)[1]<-"High-poverty tracts"
names(table2012)[2]<-"Other tracts"

table2012 %>%
  kbl(caption = "Washington-Arlington-Alexandria neighborhood characteristics - 2012") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Washington-Arlington-Alexandria neighborhood characteristics - 2012
High-poverty tracts Other tracts
Median household income (U$ 2019) 26912 107550
Population 76843 5379244
Population below poverty line 27228 401984
% White (non-hispanic) 24 49
% Black 68 25
% Hispanic 3 14
% Asian 4 9
% Foreign born 5 22
% Adults with bachelor’s degree or higher 16 47
Rent as a share of income (%) 40 30
Median home value (U$ 2019) 328998 450051
% of renter-occupied 82 35
Vacancy rate 17 7
Median year structure built 1961 1976
Observations 21 1273

Is there any tipping-like behavior?

Since 1990, differences in residential location between minorities and whites occur due to white flight. Card, Mas and Rothstein (2008) provide evidence that white population flows exhibited tipping-like behavior in most U.S. cities, from 1970 to 2000. With tract-level data in 2012 and 2019, it is possible to run a visual inspection on changes in the white population and the minority share seven years ago.

By no means we are replicating Card, Mas and Rothstein (2008), but one can wonder if there was “white flight” in the past years plotting the mean percentage changes in the white population of Census tracts from 2012 to 2019 against the minority share (non-whites) in 2012. First, prepare the data to get the change and the share and group tracts into cells of width 1% by the 2012 minority. I do that by rounding the shares to the nearest integer.

tipping<-dc_data%>%select(GEOID, NAME, white12E, white19E, total_pop12E, total_pop19E)
tipping<-st_drop_geometry(tipping)
tipping<-tipping%>%mutate(minority_share12=(total_pop12E-white12E)/total_pop12E*100)
tipping$group<-round(tipping$minority_share12)

tipping_plot<-tipping%>%select(white12E, white19E, total_pop12E, total_pop19E, group)
tipping_plot<-tipping_plot%>%drop_na()
tipping_plot<-tipping_plot%>%group_by(group)%>%summarise_all(.funs = c(sum="sum"))
tipping_plot<-tipping_plot%>%mutate(minority_share12=(total_pop12E_sum-white12E_sum)/total_pop12E_sum*100, 
                                    white_change=(white19E_sum-white12E_sum)/total_pop12E_sum*100)

It is simple to plot that information. There is no visible tipping point for Washington MSA: you actually see white inflows at many minority share levels.

options(digits=2)
library(ggthemes)
library(plotly)
g1<-ggplot(tipping_plot, aes(group, white_change))+ 
  theme_fivethirtyeight() +
  geom_point(size=5, alpha=.5)+ 
  geom_hline(yintercept = 0, linetype = 'dashed')+ 
  labs(x = "Percent non-white in tract in 2008-2012", 
       y="Change in white population - 20012-2019")+ 
  scale_x_continuous(breaks = seq(0,100,5))+
  theme(axis.text=element_text(size=18),
        axis.title=element_text(size=18,face="bold"), 
        legend.text=element_text(size=15),
        panel.grid.major = element_line( size=.2))+
  theme(legend.title=element_blank())

ggplotly(g1, tooltip = c("x", "y"))

ESDA - Poverty share in 2019

Let’s check the presence of spatial autocorrelation in the share of poverty (2019) within Washington’s census tracts. We are back using dc_data, and the variable in question is poverty_share19. The first step (assuming that you already have quantile maps for poverty) is to construct neighborhoods - here we use 10 nearest neighbors. Also, to better organize things, I will choose only a few variables from dc_data and store them inside poverty_data. Finally, I will consider whoever is NA as equal to zero.

############### Neighborhood construction
library(spdep)
coords<-st_centroid(dc_data)
knn<-knearneigh(coords, k = 10, longlat = TRUE)
knn_nb<-knn2nb(knn)
knn_listw <- nb2listw(knn_nb,style="W")

############## Selecting some variables
poverty_data<-dc_data%>%select(NAME, poverty_share19, total_pop19E, 
                               white_share19, black_share19, 
                               med_age19, vacant_share19, geometry)

poverty_data[is.na(poverty_data)] = 0

############# Spatially lagged poverty share
poverty_data$Wpoverty_share19<-lag.listw(knn_listw, poverty_data$poverty_share19)

#
#library(plotly)
#library(ggthemes)
#g1<-ggplot(poverty_data, aes(x=poverty_share19, 
#                             y=Wpoverty_share19, 
#                             label=NAME)) + 
#  geom_point(color="darkorange")+
#  stat_smooth(method = "lm", formula =y~x, se=F, color="darkblue") +
#  scale_y_continuous(name = "Spatially lagged poverty share 2019") +
#  scale_x_continuous(name = "Poverty share 2019")+
#  theme_economist_white(base_size = 17, gray_bg=FALSE)+
#  theme(axis.text=element_text(size=12),
#        axis.title=element_text(size=12,face="bold"))

#ggplotly(g1, tooltip = c("label", "x", "y"))

To get the Moran’s I, run moran.mc() with poverty_share19 and the knn_listw. As one can see below, the p-value is less than .05, and we can reject the null: overall, poverty is concentrated over space. The next step is to get the local Moran’s statistics.

moran.mc(poverty_data$poverty_share19, knn_listw, 999)
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  poverty_data$poverty_share19 
## weights: knn_listw  
## number of simulations + 1: 1000 
## 
## statistic = 0.4, observed rank = 1000, p-value = 0.001
## alternative hypothesis: greater

Use the localmoran() function to get the local Moran’s I for each one of the 1,360 census tracts within Washington MSA.

locali<-as.data.frame(localmoran(poverty_data$poverty_share19, knn_listw, alternative = "two.sided", p.adjust.method="fdr"))

poverty_data$localp<-locali[,5]

poverty_data$pov19_s<-scale(poverty_data$poverty_share19)
poverty_data$wpov19_s<-scale(poverty_data$Wpoverty_share19)

poverty_data$label <- NA
poverty_data$label[poverty_data$pov19_s >= 0 & poverty_data$wpov19_s >= 0 & poverty_data$localp <= 0.05] <- "High-High"
poverty_data$label[poverty_data$pov19_s <= 0 & poverty_data$wpov19_s <= 0 & poverty_data$localp <= 0.05] <- "Low-Low"
poverty_data$label[poverty_data$pov19_s >= 0 & poverty_data$wpov19_s <= 0 & poverty_data$localp <= 0.05] <- "High-Low"
poverty_data$label[poverty_data$pov19_s <= 0 & poverty_data$wpov19_s >= 0 & poverty_data$localp <= 0.05] <- "Low-High"
poverty_data$label[poverty_data$localp > 0.05] <- "Not Significant" 

unique(poverty_data$label)
## [1] "Not Significant" "High-High"       "Low-High"        "High-Low"

As one can see above, there are 4 labels: Not Significant, High-High, Low-High, High-Low. For each label, I will assign a color: High-High is red3, Not Significant is white (#FFFFFF), and the outliers Low-High and High-Low are associated with skyblue and coral.

According to the LISA map, in general, high-poverty areas are clustered in neighborhoods of the District of Columbia and Prince George’s County. In your project, go one step further and compare the suburbs with central city. Where those clusters are located? Do you spot any outlier?

library(tmap)
tmap_mode("view")
tm_shape(poverty_data)+
  tm_borders(lwd = .4, col = "black")+
  tm_polygons("label",  
              palette = c("red3", "coral", "skyblue",  "#FFFFFF") ,
              alpha=0.4, 
              title="LISA map - Poverty rate in Washington Metro 2019",
              popup.vars=c("Census tract"="NAME", "Share of whites"="white_share19", "Poverty share"="poverty_share19", "Total Population"="total_pop19E", "Share of Vacant units"="vacant_share19" ))+
  tm_basemap(server="OpenStreetMap",alpha=0.5)

One can put LISA (local) and the Moran’s I (global) together in the Moran’s scatterplot using the code below:

library(plotly)
library(ggthemes)

g1<-ggplot(poverty_data, aes(pov19_s, wpov19_s,color=label, label=NAME))+ 
  theme_fivethirtyeight() +
  geom_point(size=5)+ 
  geom_hline(yintercept = 0, linetype = 'dashed')+ 
  geom_vline(xintercept = 0, linetype = 'dashed')+ 
  scale_colour_manual(values=c("red3", "coral", "skyblue",  "#FFFFFF"))+ 
  labs(x = "Poverty share 2019", y="Spatially lagged poverty share 2019")+ 
  theme(axis.text=element_text(size=18),axis.title=element_text(size=18,face="bold"), legend.text=element_text(size=15))+
  theme(legend.title=element_blank())+
  ggtitle("Moran's I: 0.406")

ggplotly(g1, tooltip = c("label", "x", "y"))

The spatial autocorrelation is driven by hotspots of poverty within Washington, although many of those hotspots have a share of poverty lower than 40%.