5 Instrumental Variables
5.1 Preliminaries
Here we replicate some tables from Angrist and Krueger (1991). The dataset ak91_census1980
download here comprehends men born in 1930-1939 in the U.S. Census. The table below describes the variables:
Variable | Definition |
---|---|
lnw | Log weekly wages |
s | Years of schooling |
yob | Year of birth |
qob | Quarter of birth |
sob | State of birth |
age | Age |
The authors present an IV analysis of returns to schooling, instrumenting years of education with birth quarters (QOB).
5.2 OLS results
Reading the ak91_census1980.RDS
file and running the regression of lnw
on s
:
setwd("C:/Users/User/Desktop/474-Rlab/datasets")
<-readRDS("ak91_census1980.RDS")
ak91
library(fixest)
<-feols(lnw~s, data=ak91, se="hetero")
reg_ols1<-feols(lnw~s|yob+sob, data=ak91, se="hetero")
reg_ols2etable(reg_ols1, reg_ols2, signifCode = c("***"=0.01, "**"=0.05, "*"=0.10))
## reg_ols1 reg_ols2
## Dependent Var.: lnw lnw
##
## (Intercept) 4.995*** (0.0051)
## s 0.0708*** (0.0004) 0.0673*** (0.0004)
## Fixed-Effects: ------------------ ------------------
## yob No Yes
## sob No Yes
## _______________ __________________ __________________
## S.E. type Heteroskedas.-rob. Heteroskedas.-rob.
## Observations 329,509 329,509
## R2 0.11729 0.12878
## Within R2 -- 0.10289
The results point to an average increase in wages of around 6.73-7.09% due to one additional year of schooling - the second regression controls for the year of birth and state of birth.
5.3 Wald Estimator
Let’s create the instrument \(Z_{i}\) that takes on 1 if the individual was born in the \(1^{st}\) quarter of the year and 0 otherwise:
$instrument<-ifelse(ak91$qob==1,1,0) ak91
Now, take a look at the average salaries and years of schooling by QOB status:
library(tidyverse)
%>%group_by(instrument)%>%summarize(wages=mean(lnw), schooling=mean(s)) ak91
## # A tibble: 2 x 3
## instrument wages schooling
## <dbl> <dbl> <dbl>
## 1 0 5.90 12.8
## 2 1 5.89 12.7
People born in the first quarter of the year (i.e., \(Z_{i}=1\)) have, on average, slightly lower wages and years of schooling - the same pattern we saw in the lecture notes. To estimate the returns to schooling using the Wald estimator, divide the reduced form by the first stage results:
\[\text{Effect of schooling on wages}= \frac{\text{Effect of QOB on wages}}{\text{Effect of QOB on schooling}}\]
<-mean(ak91$lnw[ak91$instrument==1])-mean(ak91$lnw[ak91$instrument==0])
RF<-mean(ak91$s[ak91$instrument==1])-mean(ak91$s[ak91$instrument==0])
FS
=RF/FS
Wald
Wald
## [1] 0.101995
The Wald estimator gives a return to education around 10.2%. The difference from the OLS results is driven by the omitted variable bias.
5.4 2SLS Estimates
5.4.1 Manually getting the coefficient (don’t do it!)
To better understand the 2SLS estimation method, run the regression \(s_{i}\) on \(Z_{i}\) and store the fitted values \(\hat{s}_{i}\):
<-feols(s~instrument, data=ak91, se="hetero")
first_stage$s_hat<-first_stage$fitted.values ak91
Since \(\hat{s}_{i}\) is not correlated with the error term anymore, the regression \(lnw_{i}\) on \(\hat{s}_{i}\) gives you the causal effect of years of schooling on weekly wages:
<-feols(lnw~s_hat, data=ak91, se="hetero")
second_stagesummary(second_stage)
## OLS estimation, Dep. Var.: lnw
## Observations: 329,509
## Standard-errors: Heteroskedasticity-robust
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.597500 0.322078 14.274 < 2.2e-16 ***
## s_hat 0.101995 0.025221 4.044 5.3e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.678806 Adj. R2: 4.68e-5
Which is the same 10.2% we got using the Wald Estimator. The advantage of regression is the ability to include more than one instrument and also covariates (such as year of birth and state of birth).
5.4.2 The correct way
The procedure above gives you the correct coefficient, but the standard errors are not quite right. We let estimate an IV regression using 2SLS for us.
The simple case is one instrument without any covariates:
<-feols(lnw~1|s~instrument, data=ak91, se="hetero")
ivregsummary(ivreg)
## TSLS estimation, Dep. Var.: lnw, Endo.: s, Instr.: instrument
## Second stage: Dep. Var.: lnw
## Observations: 329,509
## Standard-errors: Heteroskedasticity-robust
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.597500 0.306890 14.9810 < 2.2e-16 ***
## fit_s 0.101995 0.024032 4.2442 2.2e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.645908 Adj. R2: 0.094623
## F-test (1st stage): stat = 67.6 , p < 2.2e-16 , on 1 and 329,507 DoF.
## Wu-Hausman: stat = 1.7349, p = 0.187788, on 1 and 329,506 DoF.
After the vertical bar |
, identify the endogenous variable (in this case, \(s_{i}\)) and the instrument \(Z_{i}\) you want to use.
If you want to add controls:
<-feols(lnw~1|yob+sob|s~instrument, data=ak91, se="hetero")
ivreg2summary(ivreg2)
## TSLS estimation, Dep. Var.: lnw, Endo.: s, Instr.: instrument
## Second stage: Dep. Var.: lnw
## Observations: 329,509
## Fixed-effects: yob: 10, sob: 51
## Standard-errors: Heteroskedasticity-robust
## Estimate Std. Error t value Pr(>|t|)
## fit_s 0.104194 0.025669 4.0592 4.9e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.644398 Adj. R2: 0.098689
## Within R2: 0.072069
## F-test (1st stage): stat = 62.5 , p = 2.673e-15, on 1 and 329,448 DoF.
## Wu-Hausman: stat = 2.1472, p = 0.142833 , on 1 and 329,447 DoF.
You can also combine multiple instruments. For instance, let’s bundle up the instrument first quarter of year
with second and third:
$instrument2<-ifelse(ak91$qob==2,1,0)
ak91$instrument3<-ifelse(ak91$qob==3,1,0) ak91
Now, we have three instruments: the first three quarters of the year. As you can see, we got a more precise estimate (the standard error went down), and the returns to schooling increased a bit.
<-feols(lnw~1|yob+sob|s~instrument+instrument2+instrument3, data=ak91, se="hetero")
ivreg3summary(ivreg3)
## TSLS estimation, Dep. Var.: lnw, Endo.: s, Instr.: instrument, instrument2, instrument3
## Second stage: Dep. Var.: lnw
## Observations: 329,509
## Fixed-effects: yob: 10, sob: 51
## Standard-errors: Heteroskedasticity-robust
## Estimate Std. Error t value Pr(>|t|)
## fit_s 0.107694 0.019559 5.5061 3.67e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.646523 Adj. R2: 0.092734
## Within R2: 0.065938
## F-test (1st stage): stat = 36.0 , p < 2.2e-16 , on 3 and 329,446 DoF.
## Wu-Hausman: stat = 4.453 , p = 0.034841, on 1 and 329,447 DoF.
## Sargan: stat = 3.0652, p = 0.215978, on 2 DoF.