Subclassification and Matching
Subclassification Example
In this example, we are going to follow Cunningham (2021) and use a dataset of Titanic passengers. Download the .RDS
file here.
The variables’ description is the following:
Variable | Definition |
---|---|
class | Passenger’s class: 1 if located in the upper decks |
age | Passenger’s age: 0 if Child, 1 if Adult |
sex | Passenger’s gender: 0 if woman, 1 if man |
survived | 1 if survived, 0 otherwise |
Out of 2,201 people among passengers and crew, only 711 survived. It is well known the role that wealth and norms - women and children had priority for lifeboats - played in passengers’ probability of survival, so let’s check whether or not being part of the first class made someone more likely to survive the tragedy. First, let’s compare survival rates between passengers in the upper deck and the others.
Create a dummy variable first_class
that takes 1 if class==1
and 0 otherwise. Then, use group_by()
and summarize()
the information calculating the average values of survived
library(tidyverse)
<-readRDS("titanic.RDS")
titanic$first_class<-ifelse(titanic$class==1,"Upper Deck","Lower Decks")
titanic%>%group_by(first_class)%>%summarize(survival_rate=mean(survived)) titanic
## # A tibble: 2 x 2
## first_class survival_rate
## <chr> <dbl>
## 1 Lower Decks 0.271
## 2 Upper Deck 0.625
The naive comparison gives a difference in survival rates of 35.4% for first class seats. That difference is considerable and also statistically significant:
t.test(survived~first_class, data=titanic)
##
## Welch Two Sample t-test
##
## data: survived by first_class
## t = -12.289, df = 423.61, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Lower Decks and group Upper Deck is not equal to 0
## 95 percent confidence interval:
## -0.4104198 -0.2972332
## sample estimates:
## mean in group Lower Decks mean in group Upper Deck
## 0.2707889 0.6246154
However, we also know that women and children were explicitly given priority for boarding the very limited number of lifeboats. Hence, that difference might be biased if those groups - lower and upper deck - are fundamentally different. If women and children are more likely to be seated in the first class, then that difference is driven by social norms.
What if we adjust for those two observable confounders (age and gender)? To use subclassification weighting:
Stratify the data into four groups: male children, female children, male adult, female adult
Calculate the difference in survival probabilities for each group
Calculate the number of people in the non-first-class groups and divide by the total number of non-first-class population. These are our strata-specific weights
Calculate the weighted average survival rate using the strata weights
%>%group_by(age, sex, first_class)%>%summarize(outcome=mean(survived), count=n()) titanic
## # A tibble: 8 x 5
## # Groups: age, sex [4]
## age sex first_class outcome count
## <dbl+lbl> <dbl+lbl> <chr> <dbl> <int>
## 1 0 [child] 0 [women] Lower Decks 0.614 44
## 2 0 [child] 0 [women] Upper Deck 1 1
## 3 0 [child] 1 [man] Lower Decks 0.407 59
## 4 0 [child] 1 [man] Upper Deck 1 5
## 5 1 [adults] 0 [women] Lower Decks 0.626 281
## 6 1 [adults] 0 [women] Upper Deck 0.972 144
## 7 1 [adults] 1 [man] Lower Decks 0.188 1492
## 8 1 [adults] 1 [man] Upper Deck 0.326 175
=1-0.614
diff1=1-0.407
diff2=0.972-0.626
diff3=0.326-0.188
diff4
=44/1876*diff1+59/1876*diff2+281/1876*diff3+1492/1876*diff4
weightedATE weightedATE
## [1] 0.189282
Using the subclassification weighting scheme, the average treatment effect (ATE) is 18.9%, compared to 35.4% using a simple comparison between first and non-first classes.
Propensity Score Example
In this example, we follow Cunningham (2021). Download the NSW experimental data here and let’s replicate the table we saw in class.
The variables’ description is the following:
Variable | Definition |
---|---|
data_id | Individual’s Identification |
treat | Treatment: 1 if participated in the Job training, 0 otherwise |
age | Age |
educ | Years of Education |
black | Race: 1 if Black, 0 otherwise |
hisp | Race: 1 if Hispanic, 0 otherwise |
marr | Marital status: 1 if married, 0 otherwise |
nodegree | 1 if participant does not have a degree, 0 otherwise |
re74 | Earnings in 1974 (pre-program) |
re75 | Earnings in 1975 (pre-program) |
re78 | Earnings in 1978 (post-program) |
First, read the .RDS
file. Then, summarize the baseline characteristics to check for balance and get the difference in means of earnings in 1978 re78
:
library(tidyverse)
<-readRDS("NSW_EXPER.RDS")
exper_data%>%group_by(treat)%>%summarize(Age=mean(age), Education=mean(educ),
exper_dataBlack=mean(black), Hispanic=mean(hisp),
Married=mean(marr), `No degree`=mean(nodegree),
`Earnings in 1974`=mean(re74),
`Earnings in 1975`=mean(re75),
`Earnings in 1978`=mean(re78),
`Number of Observations`=n())
## # A tibble: 2 x 11
## treat Age Education Black Hispanic Married `No degree` `Earnings in 1974`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 25.1 10.1 0.827 0.108 0.154 0.835 2107.
## 2 1 25.8 10.3 0.843 0.0595 0.189 0.708 2096.
## # ... with 3 more variables: Earnings in 1975 <dbl>, Earnings in 1978 <dbl>,
## # Number of Observations <int>
Since the NSW was randomized, the independence assumption \((Y_{1i}, Y_{0i}) \perp\!\!\!\perp D_{i}\) is satisfied and simple comparisons between means give the average treatment effect (ATE):
\[\frac{1}{N_{Treat}}\sum_{D_{i}=1} Y_{i}-\frac{1}{N_{Control}}\sum_{D_{i}=0}Y_{i} \approx \underbrace{E[Y_{1i}-Y_{0i}]}_{\text{Average Treatment Effect}}\]
<-exper_data%>%filter(treat==1)%>%summarize(avg_earnings78=mean(re78))
mean_treated<-exper_data%>%filter(treat==0)%>%summarize(avg_earnings78=mean(re78))
mean_control
-mean_control mean_treated
## avg_earnings78
## 1 1794.342
The causal effect of the NSW job-training program is $1,794.
What if we replace the experimental control group with a random sample of Americans from the Current Population Survey
? Download this data here and let’s see how it goes.
<-readRDS("NSW_NE.RDS")
nonexper_data
%>%group_by(treat)%>%summarize(Age=mean(age), Education=mean(educ),
nonexper_dataBlack=mean(black), Hispanic=mean(hisp),
Married=mean(marr), `No degree`=mean(nodegree),
`Earnings in 1974`=mean(re74),
`Earnings in 1975`=mean(re75),
`Earnings in 1978`=mean(re78),
`Number of Observations`=n())
## # A tibble: 2 x 11
## treat Age Education Black Hispanic Married `No degree` `Earnings in 1974`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 33.2 12.0 0.0735 0.0720 0.712 0.296 14017.
## 2 1 25.8 10.3 0.843 0.0595 0.189 0.708 2096.
## # ... with 3 more variables: Earnings in 1975 <dbl>, Earnings in 1978 <dbl>,
## # Number of Observations <int>
As you can see, the NSW participants are younger, less educated, more black, less likely to be married, more likely to have no degree, among others - those two groups are fundamentally different.
Nearest-Neighbor Matching
One possibility is to use a matched sample that better approximates the NSW participants using propensity scores - the conditional probability of receiving the treatment given covariate values (in this case, education, age, past earnings, etc.). To get those estimated probabilities, we can use a logit regression:
# Use the glm() function with
# the family = binomial(link = 'logit') option
<- glm( treat ~ age + educ + marr + nodegree + black +
logit_reg + re74 + re75, data = nonexper_data,
hisp family = binomial(link = 'logit'))
And get its fitted values:
<- data.frame(prop_score = predict(logit_reg, type = "response"),
probs job_training = logit_reg$model$treat)
#View(probs)
head(probs,5)
## prop_score job_training
## 1 0.24751114 1
## 2 0.07257909 1
## 3 0.25039795 1
## 4 0.47895432 1
## 5 0.44753066 1
One of the assumptions for matching is the existence of common support: there must be units in both the treatment and control groups for any probability. In other words, common support ensures enough overlap in the characteristics of treated and control units to find suitable matches.
You can look at the distributions of the propensity score in the treated and control groups:
library(patchwork)
<-probs %>%
hist1filter(job_training == 0) %>%
ggplot() +
geom_histogram(aes(x = prop_score))+
xlab("Probability of getting Job-training: control")+
theme_bw()
<-probs %>%
hist2filter(job_training == 1) %>%
ggplot() +
geom_histogram(aes(x = prop_score))+
xlab("Probability of getting Job-training: treated")+
theme_bw()
+hist2 hist1
It does not look good! The characteristics of individuals in the treatment group are scarce in the CPS sample we are using as a non-experimental control group.
A simple method for estimating the average treatment effect of the NSW training is to restrict the sample finding pairs of observations with very similar propensity scores but with different treatment statuses:
library(MatchIt)
<- matchit(treat ~ age + educ +
m_out + nodegree +
marr + hisp + re74 + re75,
black data = nonexper_data, method = "nearest",
distance = "logit")
<- match.data(m_out)
m_data #View(m_data)
dim(m_data)
## [1] 370 14
The matched sample has 370 individuals: 185 in the treatment group and 185 in the control group. You might want to see if there is indeed a balance between covariates in the new dataset:
%>%group_by(treat)%>%summarize(Age=mean(age), Education=mean(educ),
m_dataBlack=mean(black), Hispanic=mean(hisp),
Married=mean(marr), `No degree`=mean(nodegree),
`Earnings in 1974`=mean(re74),
`Earnings in 1975`=mean(re75),
`Earnings in 1978`=mean(re78),
`Number of Observations`=n())
## # A tibble: 2 x 11
## treat Age Education Black Hispanic Married `No degree` `Earnings in 1974`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 23.8 10.4 0.827 0.0432 0.135 0.681 1905.
## 2 1 25.8 10.3 0.843 0.0595 0.189 0.708 2096.
## # ... with 3 more variables: Earnings in 1975 <dbl>, Earnings in 1978 <dbl>,
## # Number of Observations <int>
You can also run t.test()
and see if the differences are statistically significant (they shouldn’t be). To get the average treatment effect, you can calculate the difference in 1978 average earnings between the treatment and control group or run a regression:
library(fixest)
<-feols(re78~treat, data=m_data, se="hetero")
reg1<-feols(re78~treat+age+educ+black+hisp+marr+nodegree+re74+re75, data=m_data, se="hetero")
reg2
etable(reg1, reg2, signifCode = c("***"=0.01, "**"=0.05, "*"=0.10))
## reg1 reg2
## Dependent Var.: re78 re78
##
## (Intercept) 4,586.9*** (423.8) 199.9 (3,716.3)
## treat 1,762.2** (717.1) 1,678.8** (697.5)
## age -6.903 (43.98)
## educ 428.4* (231.7)
## black -689.8 (1,050.8)
## hisp 1,152.2 (1,872.4)
## marr 377.2 (1,089.4)
## nodegree 271.5 (1,162.5)
## re74 -0.0108 (0.1801)
## re75 0.3189* (0.1764)
## _______________ __________________ _________________
## S.E. type Heteroskedas.-rob. Heteroskeda.-rob.
## Observations 370 370
## R2 0.01615 0.05328
## Adj. R2 0.01347 0.02961
Weighting on the Propensity Score
Besides nearest-neighbor matching, there are several other ways to get average treatment effects using the estimated propensity score. Assuming the CIA holds, one can use the individuals’ propensity score to weigh their outcomes.
The first step is to generate the propensity scores using the logit regression:
<-readRDS("NSW_NE.RDS")
nonexper_data
# Logit model to predict probability of treatment
<- glm( treat ~ age + educ + marr + nodegree + black +
logit_reg + re74 + re75, data = nonexper_data,
hisp family = binomial(link = 'logit'))
# Generate propensity scores
$prop_scores<-logit_reg$fitted.values nonexper_data
After that, trim the data using a good rule of thumb: we keep observations on the interval [0.1,0.9]. Then, calculate the inverse probability weights (IPW) following this formula:
\[\frac{Treatment}{Propensity}-\frac{1-Treatment}{1-Propensity}\]
## Trimming the data
<-nonexper_data%>%filter(prop_scores>=.1 &prop_scores<=.9 )
nonexper_data## Generating the IPW
<-nonexper_data %>%
nonexper_datamutate(ipw = (treat / prop_scores) + ((1 - treat) / (1 - prop_scores)))
Finally, run a weighted regression using ipw
as weights:
<- feols(re78 ~ treat, data = nonexper_data, se="hetero", weights = nonexper_data$ipw)
reg3 <- feols(re78 ~ treat+age+educ+black+hisp+marr+nodegree+re74+re75, data = nonexper_data, se="hetero", weights = nonexper_data$ipw)
reg4 etable(reg3,reg4,signifCode = c("***"=0.01, "**"=0.05, "*"=0.10))
## reg3 reg4
## Dependent Var.: re78 re78
##
## (Intercept) 4,425.6*** (312.3) 2,542.1 (2,937.0)
## treat 1,713.8** (732.8) 1,686.3** (740.4)
## age -42.71 (36.13)
## educ 299.6* (181.8)
## marr 74.42 (1,125.0)
## nodegree -448.5 (1,104.9)
## re74 -0.1361 (0.1192)
## re75 0.3328 (0.2175)
## _______________ __________________ _________________
## S.E. type Heteroskedas.-rob. Heteroskeda.-rob.
## Observations 504 504
## R2 0.01541 0.04135
## Adj. R2 0.01345 0.02782
The average treatment effect using IPW is $1,686-1,714, close to the actual causal effect we got from the randomized trial.