Urban Transportation [50 points]

Freeway Congestion [35 points]

The relationship between commuting cost \(g\) (in hundreds) and \(T\) (in thousands) is shown in the figure below, with some degree of poetic license. The monthly commuting cost per driver when the freeway is not congested (i.e., when there are fewer than \(\bar{T}=500\) cars at the same time) is $300. The commuting cost function when the freeway is congested is given by \(g(T)=T^{2}+3\), and the demand functional form at the rush-hour is \(D=5-T\). Assume \(T>0\) and \(n=2,500\) commuters in total using either the freeway or the alternate routes. If you need help solving equations, use Wolfram.

  1. What are the aggregate and marginal costs of using the freeway as a function of \(T\)?

  2. What is the number of cars using the freeway when there is no congestion pricing (free market equilibrium)? Also, compute the total commuting cost in this scenario: the cost of using the freeway + the alternate costs.

Hint: find the area below the \(MC\) curve from .5 to \(T_{eq}\). Find the area below the demand curve from \(T_{eq}\) to 2.5. Finally, find the area of the rectangle from 0 to \(\bar{T}\). The sum of those areas is the total commuting cost. To get the area below those curves you might need to use integral.

  1. What is the optimal number of cars \(T_{opt}\) using the freeway? Also, compute the total commuting cost in this scenario.

Hint: same as before, but now use \(T_{opt}\) instead of \(T_{eq}\).

  1. Suppose the mayor wants to impose a congestion toll to reach the optimal number of drivers using the freeway. Find the value of the congestion toll that must be charged to achieve \(T_{opt}\) freeway commuters at the rush-hour.

  2. Assume that the mayor wants to avoid congestion at the rush-hour imposing the following restriction: only cars from 1 to 500 would be allowed to commute using the freeway. It is your duty as a good economist to give some advice to him on that matter. What do you say to him?

Hint: compare the total commuting costs between the scenario where there are \(T_{opt}\) cars with the one that has 500 cars.

I hope you still remember calculus…

xkcd 1050: Forgot Algebra

a)

The aggregate commuting cost is:

\(Tg(T)=T(T^{2}+3)=T^{3}+3T\)

The marginal cost is:

\(MC=\frac{d(Tg(T))}{dT}=g(T)+Tg^{'}(T)\)

Hence, \(MC=T^{2}+3+T(2T)=3T^{2}+3\)

b)

When there is no congestion pricing, the equilibrium is \(T_{eq}\), where the private cost \(g(T)\) intersects the demand curve. \(5-T=T^{2} +3 \rightarrow T=1\) or \(T=-2\). The optimal number of cars using the freeway when there is no congestion toll is \(1,000\).

The total commuting cost is equal to \(3*0.5+\underbrace{\int_{.5}^{1}(3T^{2}+3)dT}_\text{Freeway cost} + \underbrace{\int_{1}^{2.5}(5-T)dT}_\text{Alternate routes cost}\). You can get those areas using the following code:

MC<-function(t) {3*t^2+3}
D<-function(t){5-t}

## Market Eq
Teq<-integrate(MC, lower=.5, upper=1)
Deq<-integrate(D, lower=1,upper=2.5)

cost_eq<-(.5*3+Teq$value+Deq$value)*10^5
cost_eq
## [1] 875000

c)

The socially optimum number of cars using the freeway is represented by the intersectoin of \(D\) and \(MC\). \(5-T=3T^{2} +3 \rightarrow T=\frac{2}{3}\) or \(T=-1\)

The total commuting cost is equal to \(3*0.5+\underbrace{\int_{.5}^{\frac{2}{3}}(3T^{2}+3)dT}_\text{Freeway cost} + \underbrace{\int_{\frac{2}{3}}^{2.5}(5-T)dT}_\text{Alternate routes cost}\). You can get those areas using the following code:

MC<-function(t) {3*t^2+3}
D<-function(t){5-t}

## Market Eq
Top<-integrate(MC, lower=.5, upper=2/3)
Dop<-integrate(D, lower=2/3,upper=2.5)

cost_opt<-(.5*3+Top$value+Dop$value)*10^5
cost_opt
## [1] 843518.5

d)

The optimal outcome can be achieved by charging a congestion toll equal to the vertical difference between the private cost \(AC\) and \(MC\) at the optimal point \(T_{opt}\). Then, \(Toll = MC(T_{opt})-g(T_{opt}) =g(\frac{2}{3})+(\frac{2}{3})*g'(\frac{2}{3})- g(\frac{2}{3})=\frac{8}{9}\)

Therefore, the monthly congestion toll should be equal to \(\$88.88\).

e)

If the mayor sets the limit of 500 cars using the freeway, there would be no congestion at all. The total cost of using the freeway now is \(3*0.5\). However, the alternate cost is higher: \(\int_{.5}^{2.5}(5-T)dT\). You can get those areas using the following code:

D<-function(t){5-t}

Dm<-integrate(D, lower=.5,upper=2.5)

cost_mayor<-(.5*3+Dm$value)*10^5
cost_mayor
## [1] 850000

As one can see, the total commuting cost in this scenario is higher than when the number of cars using the freeway is \(T_{opt}\). You should explain to him that the goal is to minimize the total commuting cost, and it would be better to charge a congestion toll that brings \(T_{eq}\) to \(T_{opt}\).

Commuting Time in the U.S. [15 points]

Let’s find out where are America’s longest and fastest travel times to work. Here you have a shapefile that contains the average commute time per U.S. county (excluding Hawaii and Alaska) based on the American Community Survey 2014-18. What are the top five and bottom five counties in terms of commute time? What is America’s average travel time to work? Finally, map the spatial distribution of commuting to work in U.S. counties.

## Packages that you might need to work with shapefiles
libs <- c("tigris","rgdal", "maptools", "reshape", "sp", "spdep", "GISTools", "ggplot2", "tidyverse", "tmap", "viridis")
lapply(libs, library, character.only = TRUE)

To access the dataframe contained in the .shp use @data. The average commute time to work is \(23.68\) minutes.

### Importing the .shp file
commute<-readOGR("commute_time.shp", layer="commute_time",  verbose=F)

### Bottom five
commute@data%>%arrange(mean_tt)%>%slice(1:5)
##   GEOID                      NAME mean_tt
## 1 30107 Wheatland County, Montana     7.9
## 2 48023      Baylor County, Texas     8.4
## 3 48263        Kent County, Texas     9.1
## 4 20071    Greeley County, Kansas     9.5
## 5 20123   Mitchell County, Kansas     9.7
### Top five
commute@data%>%arrange(desc(mean_tt))%>%slice(1:5)
##   GEOID                      NAME mean_tt
## 1 42103 Pike County, Pennsylvania    45.0
## 2 36005    Bronx County, New York    44.8
## 3 36085 Richmond County, New York    44.5
## 4 24017  Charles County, Maryland    44.4
## 5 36081   Queens County, New York    44.1
### America's average travel time to work

mean(commute@data$mean_tt, na.rm=T)
## [1] 23.6792

In case you want to define the breaks manually instead of using quantiles, set breaks=c() and use the values that you want. In the second map, I use breaks to break down the commuting times between 28 and 45 minutes in the quantile map in more classes.

tmap_mode("plot")
tm_shape(commute)+
tm_borders(lwd = 2, col = "white", alpha = .4)+
tm_polygons("mean_tt", style="quantile", palette = "-viridis", 
            title="Commute time in US counties")+
tm_legend(legend.title.size = 1,legend.text.size = .7, 
          legend.position = c("left", "top"))+
tm_layout(inner.margins = c(.02,.18,.10,.02))

tmap_mode("view")
tm_shape(commute)+tm_borders(lwd = 2, col = "white", alpha = .4)+
tm_polygons("mean_tt", breaks=c(7,15,22,30,36,45), palette = "-viridis", 
              title="Commute time in US counties", id="NAME",
              popup.vars=c("County Name"="NAME","Commute time"="mean_tt"))+
tm_legend(legend.title.size = 1,legend.text.size = .7)+
tm_basemap(server="OpenStreetMap")

Housing Demand and Tenure Choice [25 points]

Vanilla Version of Hedonic Regression [15 points]

Using this .RDS data on house sales in King County-WA (May-Dec 2014), estimate a hedonic regression using the lm() function - you can find the original dataset here. The dependent variable is price, and all the other columns (besides GEOID and id) are treated as explanatory variables. First, run the regression with only house characteristics (everything but med_inc, popdens14, bach_share, and owner_share) and answer the following:

  1. What is the implicit price of number of bathrooms?

  2. By how much a house value raises with an increase of 10 square feet in the size of living area?

  3. Do older houses worth less than newer homes on average?

  4. Does the implicit price of distance from downtown agree with the Monocentric City Model? Why?

  5. Now, run a regression including the Census tracts characteristics (i.e., with everything plus med_inc, popdens14, bach_share, and owner_share). Explain the signs of the coefficients related to neighborhood characteristics.

Variables Description
Variable Definition
GEOID Census tract id
id Identification
price Sale price
bathrooms Number of bathrooms
sqft_liv Size of living area in square feet
sqft_lot Size of the lot in square feet
waterfront ‘1’ if the property has a waterfront, ‘0’ if not
view An index from 0 to 4 of how good the view of the property was
condition Condition of the house, ranked from 1 to 5
grade Refers to the types of materials used and the quality of workmanship. Buildings of better quality have higher grade
dist Distance from downtown Seattle
basement ‘1’ if the property has a basement, ‘0’ if not
renov ‘1’ if the property was renovated in the last 10 years, ‘0’ if not
age Age of the property
med_inc Median Household Income in the Census tract
popdens14 Population Density in the Census tract
bach_share Share of people with bachelor’s degree in the Census tract
owner_share Share of houses that are owner-occupied in the Census tract

a

library(kableExtra)
options(scipen=999)
house_kc<-readRDS("kings_sales.RDS")
reg<-lm(price~bathrooms+sqft_liv+sqft_lot+waterfront+view+condition+grade+dist+basement+renov+age, data=house_kc)
### If you want run a regression without a table, 
### use summary(reg)
table<-coef(summary(reg))
knitr::kable(table, caption = "OLS Regression Estimates")%>% kable_classic(full_width = F, html_font = "Cambria")
OLS Regression Estimates
Estimate Std. Error t value Pr(>|t|)
(Intercept) -665731.4132033 25175.1568576 -26.443983 0.0000000
bathrooms 24549.4418892 4232.5943764 5.800093 0.0000000
sqft_liv 166.0619775 4.3283601 38.366027 0.0000000
sqft_lot 0.2777733 0.0620816 4.474327 0.0000078
waterfront 601915.0429530 25510.0072007 23.595252 0.0000000
view 49429.4473895 3167.5125816 15.605131 0.0000000
condition 33654.2454338 3479.8999410 9.671038 0.0000000
grade 104425.8710998 3097.8659320 33.708970 0.0000000
dist -9328.8488731 239.6982844 -38.919131 0.0000000
basement -47190.2531980 4821.6324395 -9.787194 0.0000000
renov 66333.3282294 17304.8164597 3.833229 0.0001274
age 1486.5778381 101.8193279 14.600154 0.0000000

The implicit price of bathrooms is $24,549.44

b)

Keeping all the other housing characteristics constant, an increase of 10 sq feet in the size of the living area is associated, on average, with a price increase of around $1,660.61

c)

Actually, the opposite is happening: keeping all the other housing characteristics constant, an increase of 10 years in the house’s age is associated with an increase in prices around $14,865, on average.

d)

Yes! There is a negative relationship between housing prices and distance from downtown Seattle. Even controlling for other house characteristics, that relationship is still statistically significant.

e)

house_kc<-readRDS("kings_sales.RDS")
reg2<-lm(price~bathrooms+sqft_liv+sqft_lot+waterfront+view+condition+grade+dist+basement+renov+age+med_inc+popdens14+bach_share+owner_share, data=house_kc)
### If you want run a regression without a table, 
### use summary(reg)
table2<-coef(summary(reg2))
knitr::kable(table2, caption = "OLS Regression Estimates - Neighborhood Characteristics" )%>% kable_classic(full_width = F, html_font = "Cambria")
OLS Regression Estimates - Neighborhood Characteristics
Estimate Std. Error t value Pr(>|t|)
(Intercept) -596441.7436664 24734.5469216 -24.113712 0.0000000
bathrooms 21435.9372574 3879.9188892 5.524842 0.0000000
sqft_liv 157.3886994 4.0417375 38.940851 0.0000000
sqft_lot 0.2979985 0.0567360 5.252370 0.0000002
waterfront 629359.1975379 23278.7117288 27.035826 0.0000000
view 53122.7728574 2886.8748729 18.401481 0.0000000
condition 32111.4656938 3174.9350407 10.114054 0.0000000
grade 71783.6638541 2949.2844213 24.339349 0.0000000
dist -5588.5139022 282.8405517 -19.758531 0.0000000
basement -38536.3308977 4409.3534115 -8.739678 0.0000000
renov 77159.7164860 15753.6242832 4.897902 0.0000010
age 1344.0266923 93.6062033 14.358308 0.0000000
med_inc 2.1107609 0.1468974 14.368945 0.0000000
popdens14 -7.0621086 2.6392022 -2.675850 0.0074692
bach_share 299130.0700330 20014.1724288 14.945913 0.0000000
owner_share -272989.9444110 18520.5312476 -14.739855 0.0000000

The median household income and the share of people with bachelor’s degree (or higher) living in the neighborhood are associated with higher house sale prices. That was expected since those two neighborhood characteristics are very desirable. There is a negative correlation between population density and house sale prices: on average, people pay less for houses in more crowded areas. Finally, homeowners have higher incentives to get involved and take care of their neighborhood. Hence, the sign of owner_share is unexpected since higher ownership is associated with some preferable neighborhood characteristics such as social cohesion.

Tenure Choice [10 points]

In this question, use the model with accelerated depreciation deduction and assume the following parameters: \(i=0.03\), \(h=0.02\), \(d=0.02\), \(g=0.04\), \(e=0.02\) and \(\lambda = 0.35\).

  1. Compute the tax rate \(\hat{\tau}\) that separates renters and owner-occupiers

  2. Suppose \(\lambda\) increases to 0.40. What is the new \(\hat{\tau}\)? Give an intuitive explanation for your answer - what happens with the housing tenure choice graph?

a

The tax rate \(\hat{\tau}\) that separates renterns and owner-occupiers is the intersection point of the flat rent line and the downward-sloping line correspoding to the owner-occupier’s cost. Hence,

\[ \{(1-\tau)(i+h)+d-g\}v=(i+h+d-g)v-\frac{\lambda ev}{1-\lambda} \]

Dividing both sides by \(v\) and replacing \(i\), \(h\), \(d\), \(g\),\(e\), and \(\lambda\) by the numbers provided:

i=.03
h=.02
d=.02
g=.04
e=.02
lambda=.35

i+h+d-g-lambda*e/(1-lambda)-d+g
## [1] 0.03923077

\((1-\tau)(.05)=.0392\rightarrow \tau=.216\)

b

i=.03
h=.02
d=.02
g=.04
e=.02
lambda=.4

i+h+d-g-lambda*e/(1-lambda)-d+g
## [1] 0.03666667

\((1-\tau)(.05)=.0367\rightarrow \tau=.266\)

An increase in the landlord’s income-tax rate \(\lambda\) decreases the renter user-cost, shifting the horizontal line downward. As a result, the cutoff tax rate \(\hat{\tau}\) rises, decreasing the number of households that are homeowners.

Innovation Jobs & US Tech-Hubs [40 points]

One way to map innovation is to look at the number of patents per resident. Of course, this is an imperfect measure of innovation since not all new ideas are patented, and not all patents turn into valuable innovations. However, we are going to use that measure as a proxy of innovative activity. Consider tech-hubs the areas with a high/very-high number of patents per capita.

The United States Patent and Trademark Office provides a periodic tabulation of patent data, and you can find this information in a .RDS file here.

Patents per 100,000 population [10 points]

  1. Using tidycensus, find the ACS 2015 5-year estimate of the total population per county. This time, we want to add the option geometry = T inside the get_acs() function

Hint: load the variables and find the one related to TOTAL POPULATION.

  1. Merge the datasets of patents and total population while keeping all the counties from the ACS data

Hint: you have more counties in the ACS data (3,220) than in the patents data (3,019), so use left_join(). You are going to get NAs related to patent information for 201 counties.

  1. Calculate the total number of patents in 2000-2015 (column Total) per 100,000 population. What are the top five and bottom five counties in terms of innovativeness?

a)

The first step is to load the variables related to ACS 5-year from 2011-2015. Then, you can search for the variable TOTAL POPULATION using View(var) and using the search box in the top right corner (just type TOTAL POPULATION there).

library(tidyverse)
library(tidycensus)
var<-load_variables("acs5", year=2015)
#View(var)

You are looking for the variable B01003_001, and geography is county. Set geometry=T to get the counties’ coordinates and shift_geo=T to place Alaska at the bottom of the map.

pop<-get_acs(geography = "county", 
                      variables = c(total_pop="B01003_001"), 
                      year=2015,
                      geometry = T,
                      sumfile = "acs5",
                      shift_geo = TRUE,
                      output="wide")

b)

patents<-readRDS("patentsUS")

full_data<-left_join(pop, patents, by="GEOID")

## check the names of the columns in your new dataset
## View(full_data)

c)

full_data$patents_per_100<-full_data$Total/full_data$total_popE*100000

## Bottom 5

full_data%>%dplyr::select(NAME, patents_per_100)%>%arrange(patents_per_100)%>%slice(1:5)
## Simple feature collection with 5 features and 2 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 83414.34 ymin: -2074505 xmax: 1743484 ymax: -745065
## CRS:            +proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +a=6370997 +b=6370997 +units=m +no_defs
##                              NAME patents_per_100                       geometry
## 1           Knox County, Kentucky        3.143764 MULTIPOLYGON (((1450274 -75...
## 2             Starr County, Texas        3.192440 MULTIPOLYGON (((168816.4 -2...
## 3     Simpson County, Mississippi        3.649502 MULTIPOLYGON (((971316.4 -1...
## 4     Coahoma County, Mississippi        3.959769 MULTIPOLYGON (((865312.4 -1...
## 5 Barnwell County, South Carolina        4.525296 MULTIPOLYGON (((1689534 -11...
## Top 5
full_data%>%dplyr::select(NAME, patents_per_100)%>%arrange(desc(patents_per_100))%>%slice(1:5)
## Simple feature collection with 5 features and 2 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -1960346 ymin: -1688207 xmax: 479269.7 ymax: 1470.373
## CRS:            +proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +a=6370997 +b=6370997 +units=m +no_defs
##                             NAME patents_per_100                       geometry
## 1 Santa Clara County, California        7654.529 MULTIPOLYGON (((-1902539 -6...
## 2           Guthrie County, Iowa        6033.520 MULTIPOLYGON (((441129.7 -3...
## 3             Hays County, Texas        5162.704 MULTIPOLYGON (((176728 -162...
## 4   San Mateo County, California        4895.617 MULTIPOLYGON (((-1952774 -6...
## 5              Ada County, Idaho        4269.211 MULTIPOLYGON (((-1281220 -5...

Mapping data using tmap+tidycensus [15 points]

Using tidycensus with the option geometry = T, one can get coordinates of the desired areas (in our case, counties). Then, to map the data is straightforward - treat your data as a shapefile tm_shape(your_data)+…+….

  1. Use tmap to map the patents per 100,000 population in U.S. counties. Are the tech-hubs randomly distributed over space? What do you see?

  2. Tech-hubs have at least one thing in common: a very skilled labor force. That means the geographical distribution of skilled people is similar to the distribution of tech-hubs in U.S. counties. And here is the question: summarize Rebecca Diamond’s take on US workers’ diverging locations by skill, and try to relate it with what you know about the role of innovation jobs and the geographical distribution of tech-hubs in the US.

a)

Again, if you want to define the breaks manually instead of using quantiles, set breaks=c() and use the values you want. In this case, I want to highlight counties with a high/very-high number of patents per 100,000 people (between 1,500-3,000 and 3,000-7,655).

As one can see, there are very few counties in the category high/very-high, and they are located in a handful of states such as California, Texas, and New York.

tmap_mode("plot")
tm_shape(full_data)+
tm_borders(lwd = 2, col = "white", alpha = .4)+
tm_polygons("patents_per_100", breaks=c(0, 200, 700, 1500, 3000, 7655), palette = "BuPu", title="Patents per 100,000 people in US counties")+  tm_legend(legend.title.size = 1,legend.text.size = .7, 
           legend.position = c("left", "top"))+
tm_layout(inner.margins = c(.02,.18,.10,.02))

b)

We have discussed The Great Divergence: the socio-economic gaps between more and less educated areas have become quite large. Indeed, skills have increasingly come to dictate metropolitan fortunes, and high-skilled people are geographically concentrated in few places, mostly tech-hubs (that you just mapped!). The desire to reduce transportation costs for people and ideas is becoming more, not less, critical. Also, as you know, agglomeration forces are naturally self-reinforcing - people and firms want to enjoy labor market pooling and knowledge spillovers already present. This fact makes the process continue to increase: there is a possibility of having a more uneven America in the future. Finally, more college graduates are associated with higher earnings of low-skilled people in metro areas, and one innovative job can generate five other jobs in that area.

A previous take on that tendency of skilled people living in high wage, high rent cities was that local costs of living would offset part of the income inequality. Hence, the real welfare disparity would be overstated by income differences. The rents in New York and Silicon Valley are absurd, and a significant share of the above-the-average salary of those high-skilled workers is needed for housing costs in those areas. On the other hand, as Rebecca Diamond shows, those high-skilled areas also offer higher amenities: better air and school quality, lower crime rates, higher supply of bars, and entertainment. In that sense, the nominal wage inequality understates the real welfare inequality.

Your neighbor’s education affects your salary [15 points]

Now, we want to explore the link between local human capital and salaries across U.S. micro and metropolitan regions. Examining the data from the American Community Survey 2019 - 5 years using the package tidycensus, construct a scatter plot to show the relationship between the median earnings in the past 12 months for high school graduates in the 945 core-based statistical areas and the share of the population that has a Bachelor degree. What are the main reasons for that relationship to hold?

Hint: load the variables from acs5/year 2019 first. Then, look for median earnings in the past 12 months, and get the name of the one that is labeled as Total!!High school graduate. After that, you need to get the total population and the entire population with a bachelor’s degree. You are searching for educational attainment for the population 25 years and over with labels !!Total, Total!!Bachelor's degree, Total!!Master's degree, Total!!Professional school degree, Total!!Doctorate degree. Also, remember that your geography is “metropolitan statistical area/micropolitan statistical area”.

Answer

The link between local human capital and salaries can be seen below, and it holds for most cb statistical areas and metropolitan areas (second plot). On average, more college graduates are associated with higher high school earnings. The economic effect is considerable in metro areas: when the share of people with a bachelor’s degree increases by 10%, the high school earnings rise by 1.36%.

Skilled and unskilled workers complement each other. Increasing productivity from one raises the productivity of the other, and an increase in the overall level of human capital generates knowledge spillovers. Since this positive externality is not mediated by the market, one way to correct that market failure is to provide subsidies for a college education.

The cross-section analysis might raise concerns. Maybe this effect is not related to places per se, i.e., the fact that you are working in a skilled city. Instead, it could be due to sorting: the workers who pick those cities with many college grads are fundamentally different from workers who pick less skilled cities, explaining the differences in earnings. However, Moretti (2004) tracks the life histories of 12,000 individuals over time and concludes that the same individual earns different salaries depending on how many skilled workers surround him.

library(ggthemes)
hs_earnings<-get_acs(geography = "metropolitan statistical area/micropolitan statistical area", 
                     variables = "B20004_003", 
                     sumfile="acs5", 
                     year=2019)

pop_share<-get_acs(geography = "metropolitan statistical area/micropolitan statistical area", 
                     variables = c(total="B15003_001",
                                   bachelor="B15003_022", 
                                   prof="B15003_024", 
                                   master="B15003_023", 
                                   phd="B15003_025"), 
                     sumfile="acs5", 
                     year=2019,
                     output="wide")

pop_share<-pop_share%>%mutate(bachelor_share=(bachelorE+masterE+phdE+profE)/totalE)

full_data<-full_join(pop_share, hs_earnings, by="GEOID")

plot_data<-full_data%>%dplyr::select(NAME.x, bachelor_share, HS_grad=estimate)

g1<-ggplot(plot_data, aes(x=bachelor_share, y=HS_grad)) + 
  geom_point(color="darkgrey") +
  stat_smooth(method = "lm", formula =y~x, se=F, color="black") +
  scale_x_continuous(name = "Share of population with bachelor's degree (%)") +
  scale_y_continuous(name = "Median earnings - High School Grad") +
  theme_economist_white(base_size = 17, gray_bg=FALSE)+
  theme(axis.text=element_text(size=12),
        axis.title=element_text(size=12,face="bold"))
g1

If you want to work only with Metropolitan Areas, remove all the micro areas from the column NAME.x. You will need the package stringr. You are asking R to keep all the rows from the column NAME.x that have Metro Area in their names. To get the effect that I mentioned in my answer, run the regression between log(HS_grad) and log(bachelor_share) using the lm() function.

library(stringr)
metro_data<-plot_data%>% filter(str_detect(NAME.x, 'Metro Area'))

g2<-ggplot(metro_data, aes(x=bachelor_share, y=HS_grad)) + 
  geom_point(color="darkgrey") +
  stat_smooth(method = "lm", formula =y~x, se=F, color="black") +
  scale_x_continuous(name = "Share of population with bachelor's degree (%)") +
  scale_y_continuous(name = "Median earnings - High School Grad") +
  theme_economist_white(base_size = 17, gray_bg=FALSE)+
  theme(axis.text=element_text(size=12),
        axis.title=element_text(size=12,face="bold"))
g2

reg<-lm(log(HS_grad)~log(bachelor_share), data=metro_data)
summary(reg)
## 
## Call:
## lm(formula = log(HS_grad) ~ log(bachelor_share), data = metro_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.83509 -0.04432  0.01090  0.06899  0.35909 
## 
## Coefficients:
##                     Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)         10.49508    0.03273 320.610 < 0.0000000000000002 ***
## log(bachelor_share)  0.13601    0.02464   5.521         0.0000000617 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1417 on 390 degrees of freedom
## Multiple R-squared:  0.07248,    Adjusted R-squared:  0.07011 
## F-statistic: 30.48 on 1 and 390 DF,  p-value: 0.00000006172