5 Instrumental Variables

5.1 Preliminaries

Here we replicate some tables from Angrist and Krueger (1991). The dataset ak91_census1980 download here comprehends men born in 1930-1939 in the U.S. Census. The table below describes the variables:

Table 5.1: Variables Description
Variable Definition
lnw Log weekly wages
s Years of schooling
yob Year of birth
qob Quarter of birth
sob State of birth
age Age

The authors present an IV analysis of returns to schooling, instrumenting years of education with birth quarters (QOB).

5.2 OLS results

Reading the ak91_census1980.RDS file and running the regression of lnw on s:

setwd("C:/Users/User/Desktop/474-Rlab/datasets")
ak91<-readRDS("ak91_census1980.RDS")

library(fixest)
reg_ols1<-feols(lnw~s, data=ak91, se="hetero")
reg_ols2<-feols(lnw~s|yob+sob, data=ak91, se="hetero")
etable(reg_ols1, reg_ols2, signifCode = c("***"=0.01, "**"=0.05, "*"=0.10))
##                           reg_ols1           reg_ols2
## Dependent Var.:                lnw                lnw
##                                                      
## (Intercept)      4.995*** (0.0051)                   
## s               0.0708*** (0.0004) 0.0673*** (0.0004)
## Fixed-Effects:  ------------------ ------------------
## yob                             No                Yes
## sob                             No                Yes
## _______________ __________________ __________________
## S.E. type       Heteroskedas.-rob. Heteroskedas.-rob.
## Observations               329,509            329,509
## R2                         0.11729            0.12878
## Within R2                       --            0.10289

The results point to an average increase in wages of around 6.73-7.09% due to one additional year of schooling - the second regression controls for the year of birth and state of birth.

5.3 Wald Estimator

Let’s create the instrument \(Z_{i}\) that takes on 1 if the individual was born in the \(1^{st}\) quarter of the year and 0 otherwise:

ak91$instrument<-ifelse(ak91$qob==1,1,0)

Now, take a look at the average salaries and years of schooling by QOB status:

library(tidyverse)
ak91%>%group_by(instrument)%>%summarize(wages=mean(lnw), schooling=mean(s))
## # A tibble: 2 x 3
##   instrument wages schooling
##        <dbl> <dbl>     <dbl>
## 1          0  5.90      12.8
## 2          1  5.89      12.7

People born in the first quarter of the year (i.e., \(Z_{i}=1\)) have, on average, slightly lower wages and years of schooling - the same pattern we saw in the lecture notes. To estimate the returns to schooling using the Wald estimator, divide the reduced form by the first stage results:

\[\text{Effect of schooling on wages}= \frac{\text{Effect of QOB on wages}}{\text{Effect of QOB on schooling}}\]

RF<-mean(ak91$lnw[ak91$instrument==1])-mean(ak91$lnw[ak91$instrument==0])
FS<-mean(ak91$s[ak91$instrument==1])-mean(ak91$s[ak91$instrument==0])

Wald=RF/FS

Wald
## [1] 0.101995

The Wald estimator gives a return to education around 10.2%. The difference from the OLS results is driven by the omitted variable bias.

5.4 2SLS Estimates

5.4.1 Manually getting the coefficient (don’t do it!)

To better understand the 2SLS estimation method, run the regression \(s_{i}\) on \(Z_{i}\) and store the fitted values \(\hat{s}_{i}\):

first_stage<-feols(s~instrument, data=ak91, se="hetero")
ak91$s_hat<-first_stage$fitted.values

Since \(\hat{s}_{i}\) is not correlated with the error term anymore, the regression \(lnw_{i}\) on \(\hat{s}_{i}\) gives you the causal effect of years of schooling on weekly wages:

second_stage<-feols(lnw~s_hat, data=ak91, se="hetero")
summary(second_stage)
## OLS estimation, Dep. Var.: lnw
## Observations: 329,509 
## Standard-errors: Heteroskedasticity-robust 
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 4.597500   0.322078  14.274 < 2.2e-16 ***
## s_hat       0.101995   0.025221   4.044   5.3e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.678806   Adj. R2: 4.68e-5

Which is the same 10.2% we got using the Wald Estimator. The advantage of regression is the ability to include more than one instrument and also covariates (such as year of birth and state of birth).

5.4.2 The correct way

The procedure above gives you the correct coefficient, but the standard errors are not quite right. We let estimate an IV regression using 2SLS for us.

The simple case is one instrument without any covariates:

ivreg<-feols(lnw~1|s~instrument, data=ak91, se="hetero")
summary(ivreg)
## TSLS estimation, Dep. Var.: lnw, Endo.: s, Instr.: instrument
## Second stage: Dep. Var.: lnw
## Observations: 329,509 
## Standard-errors: Heteroskedasticity-robust 
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 4.597500   0.306890 14.9810 < 2.2e-16 ***
## fit_s       0.101995   0.024032  4.2442   2.2e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.645908   Adj. R2: 0.094623
## F-test (1st stage): stat = 67.6   , p < 2.2e-16 , on 1 and 329,507 DoF.
##         Wu-Hausman: stat =  1.7349, p = 0.187788, on 1 and 329,506 DoF.

After the vertical bar |, identify the endogenous variable (in this case, \(s_{i}\)) and the instrument \(Z_{i}\) you want to use.

If you want to add controls:

ivreg2<-feols(lnw~1|yob+sob|s~instrument, data=ak91, se="hetero")
summary(ivreg2)
## TSLS estimation, Dep. Var.: lnw, Endo.: s, Instr.: instrument
## Second stage: Dep. Var.: lnw
## Observations: 329,509 
## Fixed-effects: yob: 10,  sob: 51
## Standard-errors: Heteroskedasticity-robust 
##       Estimate Std. Error t value Pr(>|t|)    
## fit_s 0.104194   0.025669  4.0592  4.9e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.644398     Adj. R2: 0.098689
##                  Within R2: 0.072069
## F-test (1st stage): stat = 62.5   , p = 2.673e-15, on 1 and 329,448 DoF.
##         Wu-Hausman: stat =  2.1472, p = 0.142833 , on 1 and 329,447 DoF.

You can also combine multiple instruments. For instance, let’s bundle up the instrument first quarter of year with second and third:

ak91$instrument2<-ifelse(ak91$qob==2,1,0)
ak91$instrument3<-ifelse(ak91$qob==3,1,0)

Now, we have three instruments: the first three quarters of the year. As you can see, we got a more precise estimate (the standard error went down), and the returns to schooling increased a bit.

ivreg3<-feols(lnw~1|yob+sob|s~instrument+instrument2+instrument3, data=ak91, se="hetero")
summary(ivreg3)
## TSLS estimation, Dep. Var.: lnw, Endo.: s, Instr.: instrument, instrument2, instrument3
## Second stage: Dep. Var.: lnw
## Observations: 329,509 
## Fixed-effects: yob: 10,  sob: 51
## Standard-errors: Heteroskedasticity-robust 
##       Estimate Std. Error t value Pr(>|t|)    
## fit_s 0.107694   0.019559  5.5061 3.67e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.646523     Adj. R2: 0.092734
##                  Within R2: 0.065938
## F-test (1st stage): stat = 36.0   , p < 2.2e-16 , on 3 and 329,446 DoF.
##         Wu-Hausman: stat =  4.453 , p = 0.034841, on 1 and 329,447 DoF.
##             Sargan: stat =  3.0652, p = 0.215978, on 2 DoF.